CyVerse logo

Home_Icon Learning Center Home

Importing Data from the NCBI Sequence Read Archive (SRA) using the DE

Goal

Import data from the NCBI Sequence Read Archive into your data store (SRA) via the Discovery Environment

The NCBI Sequence Read Archive (SRA) is a repository for high-throughput sequencing data. You can import data from the SRA into your Data Store using the Discovery Environment SRA-Import App.

Tip

According to the SRA homepage: “Sequence Read Archive (SRA) makes biological sequence data available to the research community to enhance reproducibility and allow for new discoveries by comparing data sets. The SRA stores raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD System®, Helicos Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.”


Prerequisites

Downloads, access, and services

In order to complete this tutorial you will need access to the following services/software

Prerequisite Preparation/Notes Link/Download
CyVerse account You will need a CyVerse account to complete this exercise Register

Platform(s)

We will use the following CyVerse platform(s):

Platform Interface Link Platform Documentation Quick Start
Discovery Environment Web/Point-and-click Discovery Environment DE Manual Guide

Input and example data

In order to complete this quickstart you will need to have the following inputs prepared

Input File(s) Format Preparation/Notes Example Data
SRA Accession number N/A We will cover how to search for an accession In this example, we will download accession SRR1761506

Get started

Note

Searching the SRA: Searching the SRA can be complicated. Often a paper or reference will specify the accession number(s) connected to a dataset. You can search flexibly using a number of terms (such as the organism name) or the filters (e.g. DNA vs. RNA). The SRA Help Manual provides several useful explanations. It is important to know is that projects are organized and related at several levels, and some important terms include:

  • Bioproject: A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium of coordinating organizations; see for example Bio Project 272719
  • Bio Sample: A description of the source materials for a project
  • Run: These are the actual sequencing runs (usually starting with SRR); see for example SRR1761506
  1. Obtain an SRA accession number (starting SRR***); If you do not have an accession, you can go to the SRA homepage and search using a variety of search terms and filters (e.g. DNA vs. RNA, exome vs. genome, etc.)

Note

On the SRA homepage for each accession, you may wish to record some useful information about the run, including the sequencing format and the file size.

  1. Log in to the Discovery Environment and click on NCBI-SRA-Fastq-dump-2.8.1 App or clcik on Apps to search for an launch this App.
  2. Name your analysis and enter any desired comments
  3. Under “Inputs” enter the SRA accession run number (if you have already downloaded an SRA file you can use this App to decompress it into a fastq file - search for the file using the ‘Browse’ button)

Tip

Depending on the file size, this will take several minutes

  1. (optional) Under “optional parameters” check ‘Split files’ if your data are paried-end

Tip

The SRA page for your run should indicate ‘SINGLE’ or ‘PAIRED’ under Library Layout; https://www.ncbi.nlm.nih.gov/sra/?term=SRR1761506

  1. (optional) Under “Output” enter a custom name for ‘Sra output folder name’ or leave the default
  2. Click Launch Analysis
  3. To view the status of the import and obtain results click on the Analysis icon
  4. When the job status is marked ‘Completed’ in the Analysis window (you may have to refresh), click on the job name (e.g. ‘SRA-Import-0.1.0_analysis1’) to view the result in your data store

Summary

In addition to a folder of logs you should have the following files: - A compressed file (including sequence data and metadata) in the NCBI “.sra” format - An output folder (default:’sra_out’) containing your fastq file (sequence data). If paired-end, and the ‘Split files’ option was checked, you will have two .fastq files (_1 for left-reads, _2 for right reads).

Next Steps:

Some common next steps include

  1. Using FastQC to check the quality of the sequence reads
  2. Using Trimmomatic to filter and trim reads for quality control

Both of these applications are available for use in the Discovery Environment. See DE Apps catalog


Additional information, help

Search for an answer: CyVerse Learning Center or CyVerse Wiki

Post your question to the user forum: Ask CyVerse


Fix or improve this documentation

Fix this tutorial on GitHub: GitHub

Send a note: Tutorials@CyVerse.org

Home_Icon Learning Center Home