HTSeqQC Quick Start¶
HTSeqQC is an automated quality control analysis tool for a single and paired-end high-throughput sequencing data (HTS) generated from Illumina sequencing platforms.
- Simultaneously filter and/or trim reads for adapter or primer contamination, uncalled bases (N), and low-quality reads
- Supports single and paired-end reads
- Analyze multiple samples simultaneously
- Parallel computation for accelerating the speed of analysis
- Visualization and statistics
- No dependency on an external open-source tool
Downloads, access, and services¶
To complete this tutorial, you will need access to the following services/software
|CyVerse account||You will need a CyVerse account to complete this exercise||Register: CyVerse User Portal|
We will use the following CyVerse platform(s):
|Platform||Interface||Link||Platform Documentation||Quick Start|
|Data Store||GUI/Command line||Data Store||Data Store Manual||Data Store Guide|
|Discovery Environment||Web/Point-and-click||Discovery Environment||DE Manual||Discovery Environment Guide|
Input and example data¶
To complete this tutorial, you will need to have the following inputs prepared
|Input File(s)||Format||Preparation/Notes||Example Data|
|HTS generated Sequence reads||FASTQ||Single and Paired-end sequence reads generated from Illumina sequencing platform||Test data for this is available in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> htseqqc|
If you have not already imported your own sequence read files to CyVerse, you can follow the instructions for uploading data, for example using Cyberduck, in our Data Store Guide.
Login to the Discovery Environment.
Click on the Apps in the DE workspace and search for HTSeqQC. Click on HTSeqQC to open the app or click the button below:
Under “Analysis Name”, you can use default name or change as appropriate.
Under “Input”, provide singe (Input file 1 only) or paired-end (both Input file 1 and 2) data by clicking “Add” button. Here, you can provide single or multiple files for the analysis.
Under “Parameter”, select appropriate parameters to run the analysis. Check below for detailed parameters.
Click Launch Analysis. You will receive a notification and may close the Apps window.
Click on Analyses from the DE workspace and monitor the status of your submitted job (You may have to click refresh to view updated status).
In the Analysis console, once your status appears as ‘Completed,’ click on the name of your analysis to navigate you to the results.
HTSeqQC optional parameters:¶
-a INPUT_FILES_1, --p1 INPUT_FILES_1 Single end input files or left files for paired-end data (.fastq, .fq). Multiple sample files must be separated by comma or space -b INPUT_FILES_2, --p2 INPUT_FILES_2 Right files for paired-end data (.fastq, .fq). Multiple files must be separated by comma or space -c QUAL_FMT, --qfmt QUAL_FMT Quality value format [1= Illumina 1.8, 2= Illumina 1.3,3= Sanger]. If quality format not provided, it will automatically detect based on sequence data -e N_CONT, --nb N_CONT Filter the reads containing given % of uncalled bases (N) -f ADPT_SEQS, --adp ADPT_SEQS Trim the adapter and truncate the read sequence (multiple adapter sequences must be separated by comma) -d MIN_SIZE, --msz MIN_SIZE Filter the reads which are lesser than minimum size -g ADPT_MATCH, --per ADPT_MATCH Truncate the read sequence if it matches to adapter sequence equal or more than given percent (0.0-1.0) [default=0.9] -i QUAL_THRESH, --qthr QUAL_THRESH Filter the read sequence if average quality of bases in reads is lower than threshold (1-40) [default:20] -n TRIM_OPT, --trim TRIM_OPT If trim option set to True, the reads with low quality (as defined by option --qthr) will be trimmed instead of discarding [True|False] [default: False] -p WIND_SIZE, --wsz WIND_SIZE The window size for trimming (5->3) the reads. This option should always set when -trim option is defined [default: 5] -r MIN_LEN_FILT, --mlk MIN_LEN_FILT Minimum length of the reads to retain after trimming -q CPU, --cpu CPU Number of CPU [default:2] -m OUT_FMT, --ofmt OUT_FMT Output file format (fastq/fasta) [default:fastq] -v VIS_OPT, --no-vis VIS_OPT No figures will be produced [True|False] [default:False] --version show program's version number and exit
HTSeqQC produces the filtered cleaned HTS data as FASTQ/FASTA files, and statistics and visualization of filtered cleaned HTS datasets. The output will be saved in folder with name ending as filtering_out.
Following output files and graphs are produced by HTSeqQC with default commands,
|Statistics.txt||Detailed statistics of quality control evaluations for provided parameters|
|1_S1_L001_R1_001_Clean.fastq||Cleaned sequence data in FASTQ format (left file for paired-end data)|
|1_S1_L001_R2_001_Clean.fastq||Cleaned sequence data in FASTQ format (right file for paired-end data)|
|1_S1_L001_R1_001_Qualdist.png||Comparative sequence PHRED quality distribution for raw and cleaned data (for left file)|
|1_S1_L001_R2_001_Qualdist.png||Comparative sequence PHRED quality distribution for raw and cleaned data (for right file)|
|1_S1_L001_R1_001_GCdist.png||Comparative percentage of GC content distribution for raw and cleaned data (for left file)|
|1_S1_L001_R2_001_GCdist.png||Comparative percentage of GC content distribution for raw and cleaned data (for right file)|
|1_S1_L001_R1_001_QualGroup.png||Group-wise sequence PHRED quality distribution for cleaned data (for left file)|
|1_S1_L001_R2_001_QualGroup.png||Group-wise for sequence PHRED quality distribution for cleaned data (for right file)|
|1_S1_L001_R1_001_Basedist.png||Nucleotide base content for raw and cleaned data (for left file)|
|1_S1_L001_R2_001_Basedist.png||Nucleotide base content for raw and cleaned data (for right file)|
|Command.log||Commands used for the HTSeqQC analysis|
Detailed statistics of quality control evaluations (Statistics.txt)
Comparative sequence PHRED quality distribution for raw and cleaned data (1_S1_L001_R1_001_Qualdist.png and 1_S1_L001_R2_001_Qualdist.png)
Comparative percentage of GC content distribution for raw and cleaned data (1_S1_L001_R1_001_GCdist.png and 1_S1_L001_R2_001_GCdist.png)
Nucleotide base content for raw and cleaned data (1_S1_L001_R1_001_Basedist.png and 1_S1_L001_R2_001_Basedist.png)
Group-wise sequence PHRED quality distribution for cleaned data (1_S1_L001_R1_001_QualGroup.png and 1_S1_L001_R2_001_QualGroup.png)
If you have any questions, comments or recommendations, please email me at firstname.lastname@example.org
Fix or improve this documentation