Fast, streaming QA/QC for High Throughput Sequencing data

Download .zip Download .tar.gz Download binary View on GitHub

Welcome to the HTStream application page.

HTStream is a quality control and processing pipeline for High Throughput Sequencing data. The difference between HTStream and other tools is that HTStream uses a tab delimited fastq format that allows for streaming from application to application. This streaming creates some awesome efficiencies when processing HTS data and makes it fully interoperable with other standard Linux tools.

Benefits Include:

  • No intermediate files (reduces storage footprint)
  • Reduce I/O (files are only read in and written out once)
  • Handles both single end and paired end reads at the same time
  • Processes can work at the same time allowing for process parallelization
  • Built on top of mature C++ Boost libraries to reduce bugs and memory leaks
  • Designed following the philosophy of Program Design in the UNIX Environment
  • Works with native Unix/Linux applications such as grep/sed/awk etc

Important Links

Installation HTStream

Options for installing HTStream include the following:

1) Download and unpack the binary:

The easist and most straight forward option if you are using a Linux system is just to download the compiled binary.
wget https://github.com/s4hts/HTStream/releases/download/v1.3.0-release/HTStream_v1.3.0-release.tar.gz
tar xvf HTStream_v1.1.0-release.tar.gz

2) Bioconda:

Many people use conda environments to manage software. HTStream is available through the Bioconda channel. If you don't have Bioconda, visit Miniconda to get a copy of the Miniconda installer, or follow the steps below:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

#Add channels for bioconda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

#Install HTStream
conda install htstream
        

3) Compiling HTStream on Linux

Alternatively you might choose to compile HTStream yourself. This process can be more complicated, but we offer some suggestions below.

Linux Prerequisites
  • Cmake 3.2 or greater.
  • Boost 1.56 or greater including the following
    • libboost-dev libboost-system-dev libboost-program-options-dev libboost-iostreams-dev libboost-filesystem-dev
On Ubuntu or similar systems prerequisites can be installed as follows
sudo apt-get install cmake # Additional help for Cmake here
sudo apt install libboost-dev libboost-system-dev libboost-program-options-dev libboost-iostreams-dev libboost-filesystem-dev 
Alternatively Boost can be installed from source:
wget -O 1_60.tar.bz2 https://sourceforge.net/projects/boost/files/boost/1.60.0/boost_1_60_0.tar.bz2/download
tar --bzip2 -xf 1_60.tar.bz2
cd boost_1_60_0
./bootstrap.sh 
./b2 # This will take a while.
BOOST_INCLUDE=$(pwd)
BOOST_INCLUDE_LIB=$(pwd)/stage/lib
Install HTStream
git clone https://github.com/s4hts/HTStream.git
cd HTStream
mkdir build
cd build
cmake ..
#When using an alternative Boost install:
#cmake -DCMAKE_INCLUDE_PATH=$BOOST_INCLUDE -DCMAKE_LIBRARY_PATH=$BOOST_INCLUDE_LIB .. 
#To specify an install location use -DCMAKE_INSTALL_PREFIX=
make 
make install #install to /usr/local/bin with sudo access (or location specified with not sudo access)

4) Building on Mac OSX

First, Install Homebrew
brew upgrade
brew install cmake
brew install boost
git clone https://github.com/s4hts/HTStream.git
cd HTStream
mkdir build
cd build
cmake ..
#When using an alternative Boost install:
#cmake -DCMAKE_INCLUDE_PATH=$BOOST_INCLUDE -DCMAKE_LIBRARY_PATH=$BOOST_INCLUDE_LIB .. 
#To specify an install location use -DCMAKE_INSTALL_PREFIX=
make 
make install #install to /usr/local/bin with sudo access (or location specified with not sudo access)

Troubleshooting

Some users have experienced issues with the boost install. We think this arises from having other versions of boost on the system. Try the following:
cmake -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_INCLUDEDIR=/usr/local/include/ ..
Or
export CC=`which gcc`
export CXX=`which g++`
mkdir -p build
cd build
cmake -DBoost_NO_SYSTEM_PATHS=TRUE -DCMAKE_INCLUDE_PATH=$BOOST_INCLUDE -DCMAKE_LIBRARY_PATH=$BOOST_INCLUDE_LIB ..

Additional CMake Options

To build HTStream with statically linked libraries use
cmake -DBUILD_STATIC_BIN=ON ..
Table of contents

HTStream Applications Overview

hts_AdapterTrimmer

Adapter trimmer trims adapters which are sequenced when the fragment insert length is shorter than the read length. In this case, the sequencer reads past the end of the insert and into the Illumina sequencing adapter. hts_AdapterTrimmer first overlaps the reads, then determins whether a 3' overhang exists. Typically a 3' overhang can only occur when adapter is present in the reads, so hts_AdapterTrimmer trims these overhangs. The algorithm used is the hts_Overlapper algorithm, except that the reads are returned as a pair. This is useful for some programs that don't accept a combination of SE and PE reads. It is important to note that because hts_AdapterTrimmer does not rely on adapter pattern matching, it can effectively trim adapters as short as a single base pair.

For detailed usage instructions please run:

hts_AdapterTrimmer --help
Table of contents

hts_CutTrim

The hts_CutTrim application trims a fixed number of bases from the 5' and/or 3' end of each read. For example, to trim the first 10 bases from a single end read:
hts_CutTrim -a 10 -U R1.fastq.gz -F R1_left_trimmed

For detailed usage instructions please run:

hts_CutTrim --help
Table of contents

hts_Overlapper

The hts_Overlapper application attempts to overlap paired end reads to produce the original fragment, trims adapters, and can correct sequencing errors.

Overlaping pairs of bases are handled as follows:
  • match: q-scores added, with max q-score of 40
  • do not match: base with highest q-score is kept, q-scores are subtracted
  • In the case where bases do not match and q-scores are equal, the base from read 1 is kept.
Reads come in three flavors:
1) sins Reads produced from an insert shorter than the read length will result in a single read in the orientation of R1, where adapters have been trimmed.
R1:         --------------->
R2:     <---------------
Output:     ------------  
2) mins Reads produced from a medium-insert greater than read length, but somewhat shorter than 2x read length will produce a SE read in the orientation of R1.
R1:     --------------->
R2:           <---------------
Output: --------------------->
3) lins Reads produced from long-inserts which do not overlap significantly, result in a PE read.
R1:     --------------->
R2:                        <---------------
Output: --------------->   <---------------

For detailed usage instructions please run:

hts_Overlapper --help
Table of contents

hts_QWindowTrim

The quality trimmer uses a sliding window approach to remove the low quality ends of reads. It slides a window from each end of the read, moving inwards. When the average quality of the bases within the window is above the avg-qual threshold trimming stops.

For detailed usage instructions please run:

hts_QWindowTrim --help
Table of contents

hts_Stats

The hts_Stats program generates an JSON formatted file containing a set of statistical measures about the input read data. This output can be used for plotting or to develop a better understading of a dataset.

For detailed usage instructions please run:

hts_Stats --help
Table of contents

hts_NTrimmer

hts_NTrimmer trims reads to the longest subsequence that contains no Ns.

For detailed usage instructions please run:

hts_NTrimmer --help
Table of contents

hts_PolyATTrim

hts_PolyATTrim attempts to trim poly-A and poly-T sequences from the end of reads. It starts at either end of a read, expanding a trimming window until the specified number of mistmaches (non-A/T bases) is reached.

For detailed usage instructions please run:

hts_PolyATTrim --help
Table of contents

hts_SeqScreener

hts_SeqScreener is a simple sequence screening tool which uses a kmer lookup approach to identify reads from an unwanted source. By default it will look for reads which are likely to have come from PhiX (commonly added to Illumina sequencing runs), but the user can also provide a sequece or set of sequences to screen against (sequences should be less than a few Kb in length). This tool can also be useful in removing primer dimers and other reads containing sequencing adapters. For example, setting -k 15 -x .01 in combination with a collection of adapters in fasta format, has been found to work well for this purpose.

For detailed usage instructions please run:

hts_SeqScreener --help
Table of contents

hts_SuperDeduper

hts_SuperDeduper is a reference free duplicate read removal tool. Traditionally PCR duplicates have been removed by comparing the mapping location of paired reads and removing those pairs with identical R1 and R2 mapping locations. hts_SuperDeduper avoids relying on mapping location by using a short subsequence within each read as a barcode for the read pair. Any other read pair with this same barcode is identified as a PCR duplicate and discarded. Default settings for this tool have been tuned to produce sets of duplicated reads which are highly congruent with those identified using picard mark-duplicates.

For detailed usage instructions please run:

hts_SuperDeduper --help
Table of contents

hts_Primers

The hts_Primers application identifies primer sequences located on the 5' ends of R1 and R2, or 5' and 3' end of SE reads. Optionally reads can be flipped based on primer orientation. Read ID is modified by adding the primer to the read id. Primer IDs are seq1,seq2, etc if supplied on the command line, or the sequence names if included as a fasta file.

For detailed usage instructions please run:

hts_Primers --help
Table of contents

Example Pipelines

Building pipelines with HTStream will feel natural to Unix/Linux users. Simply string together a set of HTS tools using Linux pipes, and your pipeline is done.

Example 1

A simple pipeline for overlapping reads and screening for PhiX with JSON logging data written to a common logging file:
hts_Overlapper -e 0.1 -L ./01-cleaned/Sample1_stats.log -1 ./00-RawData/Sample1_R1.fastq.gz -2 ./00-RawData/Sample1_R2.fastq.gz \
      | hts_SeqScreener -k 12 -AL ./01-cleaned/Sample1_stats.log -f ./01-cleaned/Sample1 

Example 2

A more complex example which 1) deduplicates reads, 2) screens them for PhiX, 3) trims poly-A/T bases, 4) screens resulting reads against a collection of adapter sequences, and 5) collects detailed statistics before writing the reads out. Note that in this example, two statistics files will be created. One will contain program-specific statistics, while the second will contain statistics about the cleaned set of reads.
        hts_SuperDeduper -L ./01-Cleaned/Sample1_stats.log -1 ./00-RawData/Sample1_L001_R1_001.fastq.gz -2 ./00-RawData/Sample1_L001_R2_001.fastq.gz \
    | hts_SeqScreener -A -L ./01-Cleaned/Sample1_stats.log \
    | hts_PolyATTrim -m 100 -A -L ./01-Cleaned/Sample1_stats.log \
    | hts_SeqScreener -A -L ./01-Cleaned/Sample1_stats.log --seq adapters.fa -k 15 -x .01 \
    | hts_Stats -N Detailed_Stats -A -L ./01-Cleaned/Sample1_StatsStats.log -f ./01-Cleaned/Sample1
      
      

Example 3

For most projects, the number of samples is simply too large to write HTS pipelines by hand. Additionally copy/paste errors can easily ruin your day (or a whole analysis). Instead, a better practice is to script the analysis. The following is an example Python script which will generate HTS cleaning pipelines for a set of PE read files. In its current form, it creates a bash script, but this could easily be modified to enable submission to a cluster. The script assumes that gzipped reads are stored in a 00-RawData folder, and that the file names end in "_L001_R1_001.fastq.gz". These assumptions will mostly need to be adjusted on a case-by-case basis.

      #!/usr/bin/env python
      from glob import glob
      import os
      
      cleaning = open("cleaning_commands.sh", 'w')
      
      for r1 in glob("./00-RawData/*_R1_*.gz"):
          r2 = r1.replace("R1", "R2")
          s = r1.split('/')[-1].replace("_L001_R1_001.fastq.gz", '')
          cmd = "hts_SuperDeduper -L ./01-Cleaned/" + s + "_stats.log -1 " + r1 + " -2 " + r2 + " | "
          cmd += "hts_SeqScreener -AL ./01-Cleaned/" + s + "_stats.log | "
          cmd += "hts_PolyATTrim -m 100 -AL ./01-Cleaned/" + s + "_stats.log | "
          cmd += "hts_SeqScreener -AL ./01-Cleaned/" + s + "_stats.log --seq adapters.fa -k 15 -x .01 | "
          cmd += "hts_Stats -N phix-remover-adapters -AL ./01-Cleaned/" + s + "_StatsStats.log -f ./01-Cleaned/" + s
          cleaning.write(cmd+'\n')
      
      cleaning.close()


Additional Examples And Tutorials

An excellent HTStream tutorial for processing RNA-seq data UC Davis Bioinformatics Core 2018 RNA-Seq Workshop
Table of contents

Parsing JSON log file in R

The JSON log files produced by HTStream can be parsed easily from within R using the jsonlite package. This is a very simple example showing how to load and access data from hts_Stats.
## R
> library(jsonlite)
> results = fromJSON("test.log")
> names(results)
[1] "hts_Stats_23737"

> results$hts_Stats_23737$Paired_end

$PE_in
[1] 19595

$PE_out
[1] 19595

$R1_bpLen
[1] 5708153

$R1_bQ30
[1] 5233672

$R2_bpLen
[1] 5708153

$R2_bQ30
[1] 4946344