Walkthrough of DNA Subway Green Line¶
The Green line runs within CyVerse DNA Subway and was developed leverages powerful computing and data storage infrastructure and uses the Stampede supercomputer cluster to provide a high performance analytical platform with a simple user interface suitable for both teaching and research.
Discontinuing support for Tuxedo workflow
The Tuxedo workflow previously implemented for the Green Line will be removed in June 2019. After that time you will no longer be able to use that workflow to analyze your data. Your data and previously analyzed results will still be available on the CyVerse Data Store.
Until then, you can still view and use the Tuexdo workflow by toggling between Kallisto and Tuexdo by selecting the Workflow button in the Project Information menu at the bottom of the Green Line page. See the next page on the new Kallisto/Sleuth Green Line workflow.
Some things to remember about the platform
- You must be a registered CyVerse user to use Green Line
- The Green line was designed to make RNA-Seq data analysis “simple”. However, we ask that users thoughtfully decide what “jobs” they want to submit.
- A single Green Line project may take a week to process since HPC computing is subject to queues which hundreds of other jobs may be staging for. Additionally these systems undergo regular maintenance and are subject to periodic disruption.
- DNA Subway implements the Tuxedo Protocol; RNA-Seq is a rapidly evolving method, and we anticipate upgrades to newer more efficient protocols. The important concepts behind RNA-Seq are still embodied in the current Subway architecture.
DNA Subway Green Line - Create an RNA-Seq Project to Examine Differential Expression¶
A. Create a project in Subway
Log-in to DNA Subway - unregistered users may NOT use Green Line.
Click on the Green “Next Generation Sequencing” square to start a Green Line project.
For ‘Select Project Type’ select either Single End Reads or Paired End Reads
For ‘Select an Organism’ select a species and genome build.
If you don’t see a desired species/genome contact us to have it added
Enter a project title, and description; click ‘Continue’
B. Upload Read Data to CyVerse Data Store The sequence read files used in these experiments are too large to upload using the Subway internet interface. You must upload your files (either .fastq or .fastq.gz) directly to the CyVerse Data Store.
- Upload your reads to the CyVerse Data Store using Cyberduck. See instructions: CyVerse Data Store Guide
DNA Subway Green Line - Manage Data and Check Quality with FASTQC¶
A. Select and pair files
- Click on the “Manage Data” stop: this opens a Data store window that says “Select your FASTQ files from the Data Store” (if you are not logged in to CyVerse, it will ask you to do so)
- Click on the folder that matches your CyVerse username and Navigate to the folder where your sequencing files are located.
- Select the sequencing files you want to analyze (either .fastq or .fastq.gz format).
- If working with paired-end reads, click the ‘Pair Mode’ button to toggle to on; check each pair of sequencing files to pair them.
B. Check sequencing quality with FastQC It is important to only work with high quality data. FastQC is a popular tool for determining sequencing quality.
- Once files have been loaded, in the ‘Manage Data’ window, click the ‘Run’ link in the ‘QC’ column to run FastQC.
- One the job is complete, click the ‘View’ link to view repeat_results
DNA Subway Green Line - Trim and Filter Reads with FastX Toolkit¶
Raw reads are first “quality trimmed” (remove poor quality bases off the end(s) of a read) and then are “quality filtered” (filter out entire poor quality reads) prior to aligning to the genome. After trimming and filtering, FastQC is run on the trimmed/filtered files.
Click “FastX ToolKit” to open the FastX Toolkit panel for all your data.
For each file, under ‘Basic’, Click ‘Run’ to filter the reads using default parameters or click ‘Advanced’ to run with desired parameters; repeat this process for all the FASTQ files in your dataset.
Once the job completes, click the ‘View’ link to view a generated FastQC report.
- Starting at this step, DNA Subway results are labeled with a job ID (e.g. fx####). These jobs are available in a ‘DNASubway’ folder in the home directory of your CyVerse Data Store.
- Starting at this step, you may see ‘Basic’ and ‘Advanced’ options for analyses. Clicking the ‘Advanced’ option allows you to set parameters. The Parameters shown in the ‘Advanced’ step are the defaults used in the ‘Basic’ option.
Since you may trim reads multiple times to achieve the desired quality of data record the job IDs (e.g. fx####) that you wish to use in the subsequent steps.
DNA Subway Green Line - Map Reads to Genome with TopHat¶
TopHat is the first component of the Tuxedo Protocol. This program aligns individual RNA-Seq reads to a previously assembled “reference” genome using a component program called Bowtie. TopHat then uses information from the newly mapped reads to determine what the intron/exon boundaries are for mapped transcripts, determining their splice sites.
- For each file, under ‘Basic’, Click ‘Run’ to begin the alignment using default parameters. (The reads will be aligned to the reference genome you selected when you created your project)
- Repeat this process for all the FASTQ files in your dataset.
- Click ‘TopHat’ to open the TopHat panel for all your data.
- Under ‘Advanced’ Click ‘Run’.
- Set the parameters as desired; Click ‘Submit’ to begin the alignment using default parameters. (The reads will be aligned to the reference genome you selected when you created your project).
- Repeat this process for all the FASTQ files in your dataset.
We generally recommend selecting the ‘No novel junctions’ option unless you have very high-coverage data (e.g. >100 million reads for a ~3Gb genome).
When this step completed you can view the summary mapping statistics, or view the aligned reads using the Integrated Genome Viewer (IGV).
DNA Subway Green Line - Examine Differential Expression with CuffDiff¶
Cuffdiff uses the Cufflinks output (and/or or reference genome) to calculate gene and transcript expression levels in one or more condition and tests them for significant differences. Depending on how many replicates and conditions you have, you may ultimately create several Cuffdiff jobs to test your desired combinations.
- Click ‘Cuffdiff’ to open the Cuffdiff panel for all your data.
- Under ‘Assign TopHat alignment files to samples and replicates’ assign all of your samples (e.g. wild type, time point 1, control, etc.) to a grouping (e.g. ‘Sample 1’, ‘Sample 2’, etc.)
- For each sample, select from the drop-down menu the TopHat job (previously TopHat mapped reads) and their replicates that belong with that sample group. (you may need to review the TopHat job names from the TopHat step).
- Either click ‘Submit’ (Basic) to run with default parameters, or use the ‘Advanced’ link to adjust parameters.
For the result you wish to examine, click the graph icon to view a collection of graphs that illustrate differences in expression between samples. You can also view a table of the results, including expression levels and comparison for each annotated gene.