CyVerse logo

Home_Icon Learning Center Home

HTSeqQC Quick Start

Goal

HTSeqQC is an automated quality control analysis tool for a single and paired-end high-throughput sequencing data (HTS) generated from Illumina sequencing platforms.


Features

  1. Simultaneously filter and/or trim reads for adapter or primer contamination, uncalled bases (N), and low-quality reads
  2. Supports single and paired-end reads
  3. Analyze multiple samples simultaneously
  4. Parallel computation for accelerating the speed of analysis
  5. Visualization and statistics
  6. No dependency on an external open-source tool

Prerequisites

Downloads, access, and services

To complete this tutorial, you will need access to the following services/software

Prerequisite Preparation/Notes Link/Download
CyVerse account You will need a CyVerse account to complete this exercise Register: CyVerse User Portal

Platform(s)

We will use the following CyVerse platform(s):

Platform Interface Link Platform Documentation Quick Start
Data Store GUI/Command line Data Store Data Store Manual Data Store Guide
Discovery Environment Web/Point-and-click Discovery Environment DE Manual Discovery Environment Guide

Input and example data

To complete this tutorial, you will need to have the following inputs prepared

Input File(s) Format Preparation/Notes Example Data
HTS generated Sequence reads FASTQ Single and Paired-end sequence reads generated from Illumina sequencing platform Test data for this is available in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> htseqqc

Get started:

Tip

If you have not already imported your own sequence read files to CyVerse, you can follow the instructions for uploading data, for example using Cyberduck, in our Data Store Guide.

  1. Login to the Discovery Environment.

  2. Click on the Apps in the DE workspace and search for HTSeqQC. Click on HTSeqQC to open the app or click the button below:

    CyVerse_launch

  3. Under “Analysis Name”, you can use default name or change as appropriate.

  4. Under “Input”, provide singe (Input file 1 only) or paired-end (both Input file 1 and 2) data by clicking “Add” button. Here, you can provide single or multiple files for the analysis.

  5. Under “Parameter”, select appropriate parameters to run the analysis. Check below for detailed parameters.

  6. Click Launch Analysis. You will receive a notification and may close the Apps window.

  7. Click on Analyses from the DE workspace and monitor the status of your submitted job (You may have to click refresh to view updated status).

  8. In the Analysis console, once your status appears as ‘Completed,’ click on the name of your analysis to navigate you to the results.


HTSeqQC optional parameters:

-a INPUT_FILES_1, --p1 INPUT_FILES_1
                      Single end input files or left files for paired-end
                      data (.fastq, .fq). Multiple sample files must be
                      separated by comma or space
-b INPUT_FILES_2, --p2 INPUT_FILES_2
                      Right files for paired-end data (.fastq, .fq).
                      Multiple files must be separated by comma or space
-c QUAL_FMT, --qfmt QUAL_FMT
                      Quality value format [1= Illumina 1.8, 2= Illumina
                      1.3,3= Sanger]. If quality format not provided, it
                      will automatically detect based on sequence data
-e N_CONT, --nb N_CONT
                      Filter the reads containing given % of uncalled bases
                      (N)
-f ADPT_SEQS, --adp ADPT_SEQS
                      Trim the adapter and truncate the read sequence
                      (multiple adapter sequences must be separated by
                      comma)
-d MIN_SIZE, --msz MIN_SIZE
                      Filter the reads which are lesser than minimum size
-g ADPT_MATCH, --per ADPT_MATCH
                      Truncate the read sequence if it matches to adapter
                      sequence equal or more than given percent (0.0-1.0)
                      [default=0.9]
-i QUAL_THRESH, --qthr QUAL_THRESH
                      Filter the read sequence if average quality of bases
                      in reads is lower than threshold (1-40) [default:20]
-n TRIM_OPT, --trim TRIM_OPT
                      If trim option set to True, the reads with low quality
                      (as defined by option --qthr) will be trimmed instead
                      of discarding [True|False] [default: False]
-p WIND_SIZE, --wsz WIND_SIZE
                      The window size for trimming (5->3) the reads. This
                      option should always set when -trim option is defined
                      [default: 5]
-r MIN_LEN_FILT, --mlk MIN_LEN_FILT
                      Minimum length of the reads to retain after trimming
-q CPU, --cpu CPU     Number of CPU [default:2]
-m OUT_FMT, --ofmt OUT_FMT
                      Output file format (fastq/fasta) [default:fastq]
-v VIS_OPT, --no-vis VIS_OPT
                      No figures will be produced [True|False]
                      [default:False]
--version             show program's version number and exit

Output:

HTSeqQC produces the filtered cleaned HTS data as FASTQ/FASTA files, and statistics and visualization of filtered cleaned HTS datasets. The output will be saved in folder with name ending as filtering_out.

Following output files and graphs are produced by HTSeqQC with default commands,

Files Description
Statistics.txt Detailed statistics of quality control evaluations for provided parameters
1_S1_L001_R1_001_Clean.fastq Cleaned sequence data in FASTQ format (left file for paired-end data)
1_S1_L001_R2_001_Clean.fastq Cleaned sequence data in FASTQ format (right file for paired-end data)
1_S1_L001_R1_001_Qualdist.png Comparative sequence PHRED quality distribution for raw and cleaned data (for left file)
1_S1_L001_R2_001_Qualdist.png Comparative sequence PHRED quality distribution for raw and cleaned data (for right file)
1_S1_L001_R1_001_GCdist.png Comparative percentage of GC content distribution for raw and cleaned data (for left file)
1_S1_L001_R2_001_GCdist.png Comparative percentage of GC content distribution for raw and cleaned data (for right file)
1_S1_L001_R1_001_QualGroup.png Group-wise sequence PHRED quality distribution for cleaned data (for left file)
1_S1_L001_R2_001_QualGroup.png Group-wise for sequence PHRED quality distribution for cleaned data (for right file)
1_S1_L001_R1_001_Basedist.png Nucleotide base content for raw and cleaned data (for left file)
1_S1_L001_R2_001_Basedist.png Nucleotide base content for raw and cleaned data (for right file)
Command.log Commands used for the HTSeqQC analysis

Detailed statistics of quality control evaluations (Statistics.txt)

_images/stat.PNG

Comparative sequence PHRED quality distribution for raw and cleaned data (1_S1_L001_R1_001_Qualdist.png and 1_S1_L001_R2_001_Qualdist.png)

_images/1_S1_L001_R1_001_Qualdist.png _images/1_S1_L001_R2_001_Qualdist.png

Comparative percentage of GC content distribution for raw and cleaned data (1_S1_L001_R1_001_GCdist.png and 1_S1_L001_R2_001_GCdist.png)

_images/1_S1_L001_R1_001_GCdist.png _images/1_S1_L001_R2_001_GCdist.png

Nucleotide base content for raw and cleaned data (1_S1_L001_R1_001_Basedist.png and 1_S1_L001_R2_001_Basedist.png)

_images/1_S1_L001_R1_001_Basedist.png _images/1_S1_L001_R2_001_Basedist.png

Group-wise sequence PHRED quality distribution for cleaned data (1_S1_L001_R1_001_QualGroup.png and 1_S1_L001_R2_001_QualGroup.png)

_images/1_S1_L001_R1_001_QualGroup.png _images/1_S1_L001_R2_001_QualGroup.png

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Fix or improve this documentation

Search for an answer: CyVerse Learning Center or CyVerse Wiki


Home_Icon Learning Center Home