Importing Data from the NCBI Sequence Read Archive (SRA) using the DE¶
Import data from the NCBI Sequence Read Archive into your data store (SRA) via the Discovery Environment
The NCBI Sequence Read Archive (SRA) is a repository for high-throughput sequencing data. You can import data from the SRA into your Data Store using the Discovery Environment SRA-Import App.
According to the SRA homepage: “Sequence Read Archive (SRA) makes biological sequence data available to the research community to enhance reproducibility and allow for new discoveries by comparing data sets. The SRA stores raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD System®, Helicos Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.”
Downloads, access, and services¶
In order to complete this tutorial you will need access to the following services/software
|CyVerse account||You will need a CyVerse account to complete this exercise||Register|
We will use the following CyVerse platform(s):
|Platform||Interface||Link||Platform Documentation||Quick Start|
|Discovery Environment||Web/Point-and-click||Discovery Environment||DE Manual||Guide|
Searching the SRA: Searching the SRA can be complicated. Often a paper or reference will specify the accession number(s) connected to a dataset. You can search flexibly using a number of terms (such as the organism name) or the filters (e.g. DNA vs. RNA). The SRA Help Manual provides several useful explanations. It is important to know is that projects are organized and related at several levels, and some important terms include:
- Bioproject: A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium of coordinating organizations; see for example Bio Project 272719
- Bio Sample: A description of the source materials for a project
- Run: These are the actual sequencing runs (usually starting with SRR); see for example SRR1761506
- Obtain an SRA accession number (starting SRR***); If you do not have an accession, you can go to the SRA homepage and search using a variety of search terms and filters (e.g. DNA vs. RNA, exome vs. genome, etc.)
On the SRA homepage for each accession, you may wish to record some useful information about the run, including the sequencing format and the file size.
- Log in to the Discovery Environment and click on NCBI-SRA-Fastq-dump-2.8.1 App or clcik on Apps to search for an launch this App.
- Name your analysis and enter any desired comments
- Under “Inputs” enter the SRA accession run number (if you have already downloaded an SRA file you can use this App to decompress it into a fastq file - search for the file using the ‘Browse’ button)
Depending on the file size, this will take several minutes
- (optional) Under “optional parameters” check ‘Split files’ if your data are paried-end
The SRA page for your run should indicate ‘SINGLE’ or ‘PAIRED’ under Library Layout; https://www.ncbi.nlm.nih.gov/sra/?term=SRR1761506
- (optional) Under “Output” enter a custom name for ‘Sra output folder name’ or leave the default
- Click Launch Analysis
- To view the status of the import and obtain results click on the Analysis icon
- When the job status is marked ‘Completed’ in the Analysis window (you may have to refresh), click on the job name (e.g. ‘SRA-Import-0.1.0_analysis1’) to view the result in your data store
In addition to a folder of logs you should have the following files: - A compressed file (including sequence data and metadata) in the NCBI “.sra” format - An output folder (default:’sra_out’) containing your fastq file (sequence data). If paired-end, and the ‘Split files’ option was checked, you will have two .fastq files (_1 for left-reads, _2 for right reads).
Some common next steps include
- Using FastQC to check the quality of the sequence reads
- Using Trimmomatic to filter and trim reads for quality control
Both of these applications are available for use in the Discovery Environment. See DE Apps catalog