sMAP: Standard Microarray Analysis Pipeline

An application for processing quality control, statistical and functional analysis of a GEO dataset in order to find potential biomarkers.


This app is created by STEM-Away RShiny Project Team - Session 1, 2021

Upload Data Files in sMAP to begin


Demo data selected. Download from here.


Visualization Before Processing




Quality control is crucial because it allows you to determine the level of variation between samples that may be due to technical variation. There are four options here for visualizing the data:

  • Normalized unscaled standard errors (NUSE) effectively compares the probeset standard errors between samples.
  • Relative log expression (RLE) plots show the relative intensity values between samples.
  • Principle Component Analysis (PCA) plots show how well experimental groups can be separated from each other. Ideally, each group of interest should form a separate cluster.
  • Boxplots allow expression values to be easily visualized and is primarily useful for seeing how preprocessing steps change the data.
Normalization


Normalization accounts for variation between samples that are not due to biological or genetic differences.

  • Normalization of the data will make feature extraction easier and the organization of the data much more structured.
  • The MAS5 function will normalize each chip in the data independently while RMA and GCRMA use a multi-chip model.
  • Each normalization function performs background correction, normalization, and summarization.
  • NOTE: If the Affymetrix Human Gene 1.0 ST Array was used, only RMA can be used.
Batch Correction & Visualization After Processing




Batch Correction makes two batches comparable to one another by correcting for differences in equipment and experimentation.

  • The effects these differences have on the data are called batch effects.
  • In Batch correction the normalized data will be the input and the returned value is an expression matrix that is corrected for batch effects.





Visualizing the data after the normalization is complete will allow you to see how your data has changed due to normalization. Both the boxplot and the PCA will show you the quality control that was done on the data. The boxplot will help to visualize the outliers while the PCA plot will help visualize patterns and variety in the data.

Find Potential Outliers


Potential Outliers:

In order for the data more reliable, the outliers need to be identified and removed so that that results can truly be attributed to differences in gene function and expression. You can choose from three different methods, and all three will find and remove the outliers.

Group Samples to Analyze DEGs


Differentially Expressed Genes



The percent of genes you choose will be the percent of genes that will be filtered because of the low levels of expression. Additionally any duplicates or NA gene symbols will be removed.


Genes with a lower adjusted p-value than the cutoff and with a higher magnitude log fold change will be considered significantly differentially expressed.




                
Interactive Volcano Plot for Calculated DEGs


Volcano plots are a visualization used for differential gene expression analysis. The x-axis represents log-fold change and the y-axis represents p-value.

KEGG Pathway Enrichment Analysis


KEGG is a database that lets you compare your data using the gene symbols to other biological systems, allowing you to determine which biological pathways are associated with your set of enriched genes. You will be able to visualize a dotplot and barplot for which larger dots and longer bars indicate enriched pathways.

Gene Ontology Enrichment Analysis


Gene Ontology defines gene function and the relationship between Cellular Components, Molecular Functions, and Biological Processes with your set of genes. The dot plot and bar plot are two ways of visualizing which gene ontology terms are most enriched, these terms being indicated by larger dots and longer bars.

Gene-Set Enrichment Analysis


GSEA is the evaluation of microarray data in gene sets. The goal of GSEA is to figure out how the genes are distributed within the data set. GSEA will first rank all genes in a data set, then it will calculate an enrichment score for the genes which show the under and over expressed genes.

What are the differences between all the file options in data importation?

The server function reads these CSV files into a list. For uploading raw CEL files, the code includes a file input for tar zipped files that is then unzipped in the server function. The user can specify whether they have a microarray that requires the oligo package using radio buttons, and depending on their input, the code will use the correct package to read in the CEL files. For the Illumina data, the code works similarly and takes in a tar zipped file containing the raw IDAT files as well as a .bgx file. The server function reads in these files using the read.idat() function. The file also gets data from the GEO database using functions from the GEOQuery package. The user can input an accession number and can specify whether they want the series matrix, raw CEL files, or data set file. Based on the user input, the code uses different GEOQuery functions to read in these files. Lastly, the code inputs metadata files in a similar manner to the CSV files of normalized and pre-processed expression data.

How does batch correction work?

To automatically assign samples from multiple CSV files to batches, Group A used a combination of functions. They used the seq_along() function to generate a sequence of integers based on the number of CSV files, then used the rep() function to repeat this sequence based on the number of samples per CSV file. This ultimately results in a vector containing the batch number of each sample. They then used the factor function to get an object of the factor class. Factor objects are particularly useful for statistical analysis, which the group will be implementing later.

What is normalization and how does RMA, GCRMA, and MAS5 differ?

Normalization of the data the user imports will make the organization of the data much more structured. The RMA function is used to take in an AffyBatch object and turn it into an expression set. The gcrma function is used if the user chooses the gcrma method. The gcrma function takes in the affybatch object and turns it into an expression set using the probe sequence. If the user chooses the mas5 method the MAS5 function would be used. The MAS5 function will normalize each array in the data independently while rma and gcrma use a multi-chip model.

What does NUSE and RLE do in quality control?

The two choices the user gets is NUSE and RLE. If the user chooses the NUSE method for quality control then at the server end, depending on the file that was imported, the quality control is completed using the NUSE() function. If the user chooses the RLE method, then in the server end the RLE() function is used for quality control using different methods depending on the file type. Once that is complete, the user can see the visualization of the data before the normalization and batch correction.

How do I analyze quality control boxplot and PCA plot?

On the server side, if the user chose Boxplot then the boxplot () is used to generate the plot with the x axis representing the sample number and the y axis representing the gene expression values. However if the user chose to visualize the data using a PCA plot the prcomp() is used to calculate the pca values. Then the user will choose the components for plotting the pca and the feature the pca plot will represent. The user will receive input from the server side if the user enters more or less than 2 components and if the user did not specify a feature. Then on the server side, the ggplot() is used to plot the PCA plot.

How do I analyze volcano plots?

The Volcano Plot is utilized in the Statistical Analysis for visualizing the top DEGs. The User specifies the LogFC and adjusted P-Values for the Volcano plot. The function used for generating the Volcano Plot is EnhancedVolcano which takes in the user input and the final result for the top table. The Volcano Plot allows the user to visualize how various processes have changed their data.

Why do you remove outliers?

The process of identifying and removing outliers is really important because it allows the data to be more reliable. It also lets the user see how the data changed after the outliers are removed. The user can be sure that extreme samples are not causing the results they see, and that results can truly be attributed to differences in gene function and expression.

Can I get more info on enrichment KEGG analysis and its process getting to the output?

To get the KEGG pathways, the function used was the enrichKEGG(), and for the visualization the dotplot() was optimized. Additionally the barplot() was used to create a barplot visualization for the KEGG analysis. The number of categories that were shown for the KEGG analysis was determined by the user. Then the setReadable() was used so that the gene concept network for the significant KEGG pathways could be presented as well. The significance of the different plots that were plotted have an immense importance as they represent the top DEGs from the data and with the ENTREZIDs there is additional information to more biological concepts.

What is the most important step in Functional Analysis?

A crucial function in the functional analysis was the select() so that the data could be converted to the Entrenzids. The select function allowed the users data to identify genes from a NIH database.

Project Technical Lead:

  • Samuel Bharti

Team Members:

  • Nikita Krishnan
  • Disha Chauhan
  • Arian Veyssi
  • Maryam Momeni
  • Roman Ramirez
  • Hou Wang (Ivan) Lam
  • Aditi Verma
  • Huikun (Kelly) Li
  • Sneha Raj

Acknowledgement:

All team members thank the STEM-AWAY for providing an excellent internship opportunity with complete technical and management resources which made sMAP possible.

We would also like to acknowledge interns who shared ideas and discussions that bring us here.

  • Austin (Assisted in Task Management)
  • Marc (Logistic Help)
  • Ananya (Participation in UI discussion)

How to cite us:

Updated soon.

Contact Us

If you have questions or comments about the sMAP, please email Samuel Bharti, samuelbharti@gmail.com.

Our Git Page: BI-STEM-Away