The Association Tutorial¶
A genome-wide association study (or GWAS) is an examination of a genome-wide set of genetic variants in a population of individuals aimed at determining whether any variant is associated with a trait. In this section, we will walk through the Association workflow available on the ‘Public workflows’ page. The workflow contains following operations in order: intersecting marker data with trait data, imputing missing markers, estimating population structure with PCA, and performing mixed-model based analyses with two different tools (EMMAX and TASSEL MLM). The example data used here are a subset of the published data (Morris, G., et al. PNAS 110.2 (2013): 453-458).
We will start by loading the workflow, visualizing results, creating a new workflow with the EMMAX method, and then add additional analyses to estimate population structure and perform an association analysis.
|Marker file (Sorghum bicolor v3 chromosome 9)||Marker data in TASSEL Hapmap format||marker.v3.txt.gz|
|Trait file||Trait data in TASSEL trait format||trait.txt|
|App name||Version||Description||App link||Notes/other links|
|MergeG2P||0.0.3||Intersect marker data with trait data||MergeG2P-0.0.3|
|NPUTE||0.0.3||Imputes missing markers via voting from K-nearest-neighbors (KNN)||NPUTE-0.0.3||NPUTE documentation|
|NumericalTransform||0.0.3||Numerical Transform of marker data using TASSEL and PLINK||NumericalTransform-0.0.3|
|MLM||0.0.3||Mixed Linear Model analysis using TASSEL||MLM-0.0.3||TASSEL documentation|
|EMMAX||0.0.4||Association mapping with consideration of sample structure||EMMAX-0.0.4||EMMAX documentation|
|MLMM||0.0.3||An efficient multi-locus mixed-model approach for GWAS||MLMM-0.0.3||MLMM documentation|
|PCA||0.0.3||Principal Component Analysis||PCA-0.0.3||PCA documentation|
|STRUCTURE||220.127.116.11||Parallelized STRUCTURE software for estimating population structures||STRUCTURE-18.104.22.168||STRUCTURE documentation|
Step 1: Importing the Workflow¶
This step will demonstrate how to import the Association workflow into your own workspace.
Log into SciApps at https://www.SciApps.org.
Click ‘Workflow’ (from the top navigation bar), then ‘Public workflows’ to load the public workflow page in the main panel.
Check the ‘Association’ workflow, then click “Relaunch”. The App forms will be loaded in the main panel, and analysis histories are loaded in the right panel.
To load analysis histories to the right panel while keeping the workflow list in the main panel, click ‘Load’. This is useful for checking results of multiple workflows, or building a new workflow from more than one workflows.
Alternatively, click ‘Visualize’ to display the workflow diagram. Analysis histories will also be loaded in the right panel.
Click the output node of the workflow diagram, you will be directed to the output folder. To check a specific output, click the output name from the History panel.
Step 2: Visualizing the Results¶
This step will walk through how to visualize the results of EMMAX and PCA. You can check other results with the similar operations.
Once the workflow is loaded, in the History panel, click the Visualization icon for EMMAX-0.0.4 to bring up its outputs.
Select manhattan_plot.view.tgz from the list of outputs, then click Visualizae, you will be directed to the Manhattan plot of the results. You can also check Q-Q plot and click the Manhattan plot to check nearby genes around the clicked position.
The manhattan plot will be displayed in a new window, so please check if pop-ups from SciApps are blocked by your web browser.
The example here is using chromosome 9 only. And the Manhattan plot is pre-configured to display the chromosome 9 of sorghum (BTx623). For your own data, use the options on the left side to select a specific chromosome or all chromosomes of your genome.
Use the options on the left panel for P-values adjustments, specifying species, chromosome, neighbouring window size, and display Q-Q plot.
Both Manhattan plot and Q-Q plot are interactive with all of the options.
Click on the most significant SNP to bring up the table of nearby genes. Then type 229800 in the search box (above the table) to locate a dwarf gene, dw1 (SORBI_009G229800).
From the left panel, you can increase the ‘window size’ to list more nearby genes.
For visualizing PCA outputs, click the Visualization icon for PCA-0.0.3 to bring up its outputs. Select the image file from the list of outputs, then click Visualizae to open the image in a web browser. There are two image outputs: pcplot and scree plot.
Step 3: Creating a New Workflow¶
This step will demo how to build a workflow from the loaded history. Assuming we just want to use EMMAX for association analysis, based on the workflow diagram above, we will also need MergeG2P, NPUTE, and NumericalTransform if given new marker data and/or new trait data.
Check the checkboxes for step 1 (MergeG2P), 2 (NPUTE), 3 (NumericalTransform), and 5 (EMMAX) in the History panel, then click the ‘build a workflow’ link to load the Workflow building page.
History panel checkboxes and the workflow building page are interactive. Use the ‘Select All’ or ‘Reset’ button to simplify the selection step.
Click the ‘Build Workflow’ button to visualize the workflow diagram.
On the ‘Workflow Diagram’, you can also save the workflow. Your saved workflows will appear in ‘My Workflows’ (under the ‘Workflow’ menu from top navigation panel).
From ‘My workflow’, You can load the new workflow to run or share the workflow with a direct URL, which can be passed to others for sharing the entire analysis.
Step 4: Adding New Analyses to the Workflow¶
This step will show you how to add new analysis to the workflow built above. We will use STRUCTURE instead of PCA to estimate population structure, then pass the estimation to MLM for performing association analysis.
Click Clustering category (left panel) or search for structure, then click STRUCUTRE-22.214.171.124 to load STRUCTURE-126.96.36.199.
Click NumericalTransform-0.0.3 in the History panel to expand its outputs, then drag and drop nt1_marker.txt.gz into the Select marker file field.
If the input field is not empty, be sure to clear it before dragging and dropping new input there.
Enter 12063 for ‘number of loci’ and 310 for ‘number of individuals, leave others as defaults, then click the “Submit Job” button. Once completed, click to expand its outputs.
These numbers are in the nt5.log file from the NumericalTransform-0.0.3 job.
Click Mapping category, then MLM-0.0.3 to the App form. Drag and drop mt1_trait.txt (MergeG2P-0.0.3) to Input Trait Data, npt_mm_marker.v3.txt.gz (NPUTE-0.0.3) to Input Marker Data, and s3_f (STRUCTURE-188.8.131.52) to Input Structure Data.
Leave others as defaults, then click the “Submit Job” button. Once completed, select all jobs to build a new workflow. The workflow can be re-run or shared as described before.
As in Step 2, you can visualize the Manhattan plot of MLM outputs and compare it with that of EMMAX or MLM (when PCA is used for estimating population structure).
Similarly, MLMM-0.0.3 can be used for multi-locus mixed model testing with loading the MLMM-0.0.3 app form, dragging and dropping mt2_trait.txt (MergeG2P-0.0.3) to Input Trait Data, and nt2mlmm.txt.gz (NumericalTransform-0.0.3) to Input Marker Data.
As shown in this section, various customized Association workflows can be constructed on SciApps. The interactive Manhattan plot provides an easy way to examine nearby genes annotated around significant loci.