sMAP: Standard Microarray Analysis Pipeline
An application for processing quality control, statistical and functional analysis of a GEO dataset in order to find potential biomarkers.
This app is created by STEM-Away RShiny Project Team - Session 1, 2021
Upload Data Files in sMAP to begin
Visualization Before Processing
Quality control is crucial because it allows you to determine the level of variation between samples that may be due to technical variation. There are four options here for visualizing the data:
- Normalized unscaled standard errors (NUSE) effectively compares the probeset standard errors between samples.
- Relative log expression (RLE) plots show the relative intensity values between samples.
- Principle Component Analysis (PCA) plots show how well experimental groups can be separated from each other. Ideally, each group of interest should form a separate cluster.
- Boxplots allow expression values to be easily visualized and is primarily useful for seeing how preprocessing steps change the data.
Normalization accounts for variation between samples that are not due to biological or genetic differences.
- Normalization of the data will make feature extraction easier and the organization of the data much more structured.
- The MAS5 function will normalize each chip in the data independently while RMA and GCRMA use a multi-chip model.
- Each normalization function performs background correction, normalization, and summarization.
- NOTE: If the Affymetrix Human Gene 1.0 ST Array was used, only RMA can be used.
Batch Correction & Visualization After Processing
Batch Correction makes two batches comparable to one another by correcting for differences in equipment and experimentation.
- The effects these differences have on the data are called batch effects.
- In Batch correction the normalized data will be the input and the returned value is an expression matrix that is corrected for batch effects.
Visualizing the data after the normalization is complete will allow you to see how your data has changed due to normalization. Both the boxplot and the PCA will show you the quality control that was done on the data. The boxplot will help to visualize the outliers while the PCA plot will help visualize patterns and variety in the data.
Find Potential Outliers
In order for the data more reliable, the outliers need to be identified and removed so that that results can truly be attributed to differences in gene function and expression. You can choose from three different methods, and all three will find and remove the outliers.
Group Samples to Analyze DEGs
Differentially Expressed Genes
The percent of genes you choose will be the percent of genes that will be filtered because of the low levels of expression. Additionally any duplicates or NA gene symbols will be removed.
Genes with a lower adjusted p-value than the cutoff and with a higher magnitude log fold change will be considered significantly differentially expressed.
Interactive Volcano Plot for Calculated DEGs
Volcano plots are a visualization used for differential gene expression analysis. The x-axis represents log-fold change and the y-axis represents p-value.
KEGG Pathway Enrichment Analysis
KEGG is a database that lets you compare your data using the gene symbols to other biological systems, allowing you to determine which biological pathways are associated with your set of enriched genes. You will be able to visualize a dotplot and barplot for which larger dots and longer bars indicate enriched pathways.
Gene Ontology Enrichment Analysis
Gene Ontology defines gene function and the relationship between Cellular Components, Molecular Functions, and Biological Processes with your set of genes. The dot plot and bar plot are two ways of visualizing which gene ontology terms are most enriched, these terms being indicated by larger dots and longer bars.
Gene-Set Enrichment Analysis
GSEA is the evaluation of microarray data in gene sets. The goal of GSEA is to figure out how the genes are distributed within the data set. GSEA will first rank all genes in a data set, then it will calculate an enrichment score for the genes which show the under and over expressed genes.
What are the differences between all the file options in data importation?
How does batch correction work?
What is normalization and how does RMA, GCRMA, and MAS5 differ?
What does NUSE and RLE do in quality control?
How do I analyze quality control boxplot and PCA plot?
How do I analyze volcano plots?
Why do you remove outliers?
Can I get more info on enrichment KEGG analysis and its process getting to the output?
What is the most important step in Functional Analysis?
Project Technical Lead:
- Samuel Bharti
- Nikita Krishnan
- Disha Chauhan
- Arian Veyssi
- Maryam Momeni
- Roman Ramirez
- Hou Wang (Ivan) Lam
- Aditi Verma
- Huikun (Kelly) Li
- Sneha Raj
All team members thank the STEM-AWAY for providing an excellent internship opportunity with complete technical and management resources which made sMAP possible.
We would also like to acknowledge interns who shared ideas and discussions that bring us here.
- Austin (Assisted in Task Management)
- Marc (Logistic Help)
- Ananya (Participation in UI discussion)
How to cite us:
If you have questions or comments about the sMAP, please email Samuel Bharti, email@example.com.
Our Git Page: BI-STEM-Away