HTStream

Fast, streaming QA/QC for High Throughput Sequencing data

Download .zip Download .tar.gz Download binary View on GitHub

Welcome to the HTStream application page.

HTStream is a quality control and processing pipeline for High Throughput Sequencing data. The difference between HTStream and other tools is that HTStream uses a tab delimited fastq format that allows for streaming from application to application. This streaming creates some awesome efficiencies when processing HTS data and makes it fully interoperable with other standard Linux tools.

Benefits Include:

No intermediate files (reduces storage footprint)
Reduce I/O (files are only read in and written out once)
Handles both single end and paired end reads at the same time
Processes can work at the same time allowing for process parallelization
Built on top of mature C++ Boost libraries to reduce bugs and memory leaks
Designed following the philosophy of Program Design in the UNIX Environment
Works with native Unix/Linux applications such as grep/sed/awk etc

Important Links

Bug reports and suggested features GitHub Issues

Installing HTStream
hts_AdapterTrimmer
hts_CutTrim
hts_Overlapper
hts_QWindowTrim
hts_Stats
hts_NTrimmer
hts_PolyATTrim
hts_SeqScreener
hts_SuperDeduper
hts_Primers
ExamplePipelines
ParsingJSON

Installation HTStream

Options for installing HTStream include the following:

1) Download and unpack the binary:

The easist and most straight forward option if you are using a Linux system is just to download the compiled binary.

wget https://github.com/s4hts/HTStream/releases/download/v1.3.0-release/HTStream_v1.3.0-release.tar.gz
tar xvf HTStream_v1.1.0-release.tar.gz

2) Bioconda:

Many people use conda environments to manage software. HTStream is available through the Bioconda channel. If you don't have Bioconda, visit Miniconda to get a copy of the Miniconda installer, or follow the steps below:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

#Add channels for bioconda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

#Install HTStream
conda install htstream

3) Compiling HTStream on Linux

Alternatively you might choose to compile HTStream yourself. This process can be more complicated, but we offer some suggestions below.

Linux Prerequisites

Cmake 3.2 or greater.
Boost 1.56 or greater including the following
- libboost-dev libboost-system-dev libboost-program-options-dev libboost-iostreams-dev libboost-filesystem-dev

On Ubuntu or similar systems prerequisites can be installed as follows

sudo apt-get install cmake # Additional help for Cmake here

sudo apt install libboost-dev libboost-system-dev libboost-program-options-dev libboost-iostreams-dev libboost-filesystem-dev

Alternatively Boost can be installed from source:

wget -O 1_60.tar.bz2 https://sourceforge.net/projects/boost/files/boost/1.60.0/boost_1_60_0.tar.bz2/download
tar --bzip2 -xf 1_60.tar.bz2
cd boost_1_60_0
./bootstrap.sh 
./b2 # This will take a while.
BOOST_INCLUDE=$(pwd)
BOOST_INCLUDE_LIB=$(pwd)/stage/lib

Install HTStream

git clone https://github.com/s4hts/HTStream.git
cd HTStream
mkdir build
cd build
cmake ..
#When using an alternative Boost install:
#cmake -DCMAKE_INCLUDE_PATH=$BOOST_INCLUDE -DCMAKE_LIBRARY_PATH=$BOOST_INCLUDE_LIB .. 
#To specify an install location use -DCMAKE_INSTALL_PREFIX=
make 
make install #install to /usr/local/bin with sudo access (or location specified with not sudo access)

4) Building on Mac OSX

First, Install Homebrew

brew upgrade
brew install cmake
brew install boost

git clone https://github.com/s4hts/HTStream.git
cd HTStream
mkdir build
cd build
cmake ..
#When using an alternative Boost install:
#cmake -DCMAKE_INCLUDE_PATH=$BOOST_INCLUDE -DCMAKE_LIBRARY_PATH=$BOOST_INCLUDE_LIB .. 
#To specify an install location use -DCMAKE_INSTALL_PREFIX=
make 
make install #install to /usr/local/bin with sudo access (or location specified with not sudo access)

Troubleshooting

Some users have experienced issues with the boost install. We think this arises from having other versions of boost on the system. Try the following:

cmake -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_INCLUDEDIR=/usr/local/include/ ..

export CC=`which gcc`
export CXX=`which g++`
mkdir -p build
cd build
cmake -DBoost_NO_SYSTEM_PATHS=TRUE -DCMAKE_INCLUDE_PATH=$BOOST_INCLUDE -DCMAKE_LIBRARY_PATH=$BOOST_INCLUDE_LIB ..

Additional CMake Options

To build HTStream with statically linked libraries use

cmake -DBUILD_STATIC_BIN=ON ..

Table of contents

HTStream Applications Overview

hts_AdapterTrimmer

Adapter trimmer trims adapters which are sequenced when the fragment insert length is shorter than the read length. In this case, the sequencer reads past the end of the insert and into the Illumina sequencing adapter. hts_AdapterTrimmer first overlaps the reads, then determins whether a 3' overhang exists. Typically a 3' overhang can only occur when adapter is present in the reads, so hts_AdapterTrimmer trims these overhangs. The algorithm used is the hts_Overlapper algorithm, except that the reads are returned as a pair. This is useful for some programs that don't accept a combination of SE and PE reads. It is important to note that because hts_AdapterTrimmer does not rely on adapter pattern matching, it can effectively trim adapters as short as a single base pair.

For detailed usage instructions please run:

hts_AdapterTrimmer --help

Table of contents

hts_CutTrim

The hts_CutTrim application trims a fixed number of bases from the 5' and/or 3' end of each read. For example, to trim the first 10 bases from a single end read:

hts_CutTrim -a 10 -U R1.fastq.gz -F R1_left_trimmed

For detailed usage instructions please run:

hts_CutTrim --help

Table of contents

hts_Overlapper

The hts_Overlapper application attempts to overlap paired end reads to produce the original fragment, trims adapters, and can correct sequencing errors.

Overlaping pairs of bases are handled as follows:

match: q-scores added, with max q-score of 40
do not match: base with highest q-score is kept, q-scores are subtracted
In the case where bases do not match and q-scores are equal, the base from read 1 is kept.

Reads come in three flavors:
1) sins Reads produced from an insert shorter than the read length will result in a single read in the orientation of R1, where adapters have been trimmed.

R1:         --------------->
R2:     <---------------
Output:     ------------

2) mins Reads produced from a medium-insert greater than read length, but somewhat shorter than 2x read length will produce a SE read in the orientation of R1.

R1:     --------------->
R2:           <---------------
Output: --------------------->

3) lins Reads produced from long-inserts which do not overlap significantly, result in a PE read.

R1:     --------------->
R2:                        <---------------
Output: --------------->   <---------------

For detailed usage instructions please run:

hts_Overlapper --help

Table of contents

hts_QWindowTrim

The quality trimmer uses a sliding window approach to remove the low quality ends of reads. It slides a window from each end of the read, moving inwards. When the average quality of the bases within the window is above the avg-qual threshold trimming stops.

For detailed usage instructions please run:

hts_QWindowTrim --help

Table of contents

hts_Stats

The hts_Stats program generates an JSON formatted file containing a set of statistical measures about the input read data. This output can be used for plotting or to develop a better understading of a dataset.

For detailed usage instructions please run:

hts_Stats --help

Table of contents

hts_NTrimmer

hts_NTrimmer trims reads to the longest subsequence that contains no Ns.

For detailed usage instructions please run:

hts_NTrimmer --help

Table of contents

hts_PolyATTrim

hts_PolyATTrim attempts to trim poly-A and poly-T sequences from the end of reads. It starts at either end of a read, expanding a trimming window until the specified number of mistmaches (non-A/T bases) is reached.

For detailed usage instructions please run:

hts_PolyATTrim --help

Table of contents

hts_SeqScreener

hts_SeqScreener is a simple sequence screening tool which uses a kmer lookup approach to identify reads from an unwanted source. By default it will look for reads which are likely to have come from PhiX (commonly added to Illumina sequencing runs), but the user can also provide a sequece or set of sequences to screen against (sequences should be less than a few Kb in length). This tool can also be useful in removing primer dimers and other reads containing sequencing adapters. For example, setting -k 15 -x .01 in combination with a collection of adapters in fasta format, has been found to work well for this purpose.

For detailed usage instructions please run:

hts_SeqScreener --help

Table of contents

hts_SuperDeduper

hts_SuperDeduper is a reference free duplicate read removal tool. Traditionally PCR duplicates have been removed by comparing the mapping location of paired reads and removing those pairs with identical R1 and R2 mapping locations. hts_SuperDeduper avoids relying on mapping location by using a short subsequence within each read as a barcode for the read pair. Any other read pair with this same barcode is identified as a PCR duplicate and discarded. Default settings for this tool have been tuned to produce sets of duplicated reads which are highly congruent with those identified using picard mark-duplicates.

For detailed usage instructions please run:

hts_SuperDeduper --help

Table of contents

hts_Primers

The hts_Primers application identifies primer sequences located on the 5' ends of R1 and R2, or 5' and 3' end of SE reads. Optionally reads can be flipped based on primer orientation. Read ID is modified by adding the primer to the read id. Primer IDs are seq1,seq2, etc if supplied on the command line, or the sequence names if included as a fasta file.

For detailed usage instructions please run:

hts_Primers --help

Table of contents

Example Pipelines

Building pipelines with HTStream will feel natural to Unix/Linux users. Simply string together a set of HTS tools using Linux pipes, and your pipeline is done.

Example 1

A simple pipeline for overlapping reads and screening for PhiX with JSON logging data written to a common logging file:

hts_Overlapper -e 0.1 -L ./01-cleaned/Sample1_stats.log -1 ./00-RawData/Sample1_R1.fastq.gz -2 ./00-RawData/Sample1_R2.fastq.gz \
      | hts_SeqScreener -k 12 -AL ./01-cleaned/Sample1_stats.log -f ./01-cleaned/Sample1

Example 2

A more complex example which 1) deduplicates reads, 2) screens them for PhiX, 3) trims poly-A/T bases, 4) screens resulting reads against a collection of adapter sequences, and 5) collects detailed statistics before writing the reads out. Note that in this example, two statistics files will be created. One will contain program-specific statistics, while the second will contain statistics about the cleaned set of reads.

        hts_SuperDeduper -L ./01-Cleaned/Sample1_stats.log -1 ./00-RawData/Sample1_L001_R1_001.fastq.gz -2 ./00-RawData/Sample1_L001_R2_001.fastq.gz \
    | hts_SeqScreener -A -L ./01-Cleaned/Sample1_stats.log \
    | hts_PolyATTrim -m 100 -A -L ./01-Cleaned/Sample1_stats.log \
    | hts_SeqScreener -A -L ./01-Cleaned/Sample1_stats.log --seq adapters.fa -k 15 -x .01 \
    | hts_Stats -N Detailed_Stats -A -L ./01-Cleaned/Sample1_StatsStats.log -f ./01-Cleaned/Sample1

Example 3

For most projects, the number of samples is simply too large to write HTS pipelines by hand. Additionally copy/paste errors can easily ruin your day (or a whole analysis). Instead, a better practice is to script the analysis. The following is an example Python script which will generate HTS cleaning pipelines for a set of PE read files. In its current form, it creates a bash script, but this could easily be modified to enable submission to a cluster. The script assumes that gzipped reads are stored in a 00-RawData folder, and that the file names end in "_L001_R1_001.fastq.gz". These assumptions will mostly need to be adjusted on a case-by-case basis.


      #!/usr/bin/env python
      from glob import glob
      import os
      
      cleaning = open("cleaning_commands.sh", 'w')
      
      for r1 in glob("./00-RawData/*_R1_*.gz"):
          r2 = r1.replace("R1", "R2")
          s = r1.split('/')[-1].replace("_L001_R1_001.fastq.gz", '')
          cmd = "hts_SuperDeduper -L ./01-Cleaned/" + s + "_stats.log -1 " + r1 + " -2 " + r2 + " | "
          cmd += "hts_SeqScreener -AL ./01-Cleaned/" + s + "_stats.log | "
          cmd += "hts_PolyATTrim -m 100 -AL ./01-Cleaned/" + s + "_stats.log | "
          cmd += "hts_SeqScreener -AL ./01-Cleaned/" + s + "_stats.log --seq adapters.fa -k 15 -x .01 | "
          cmd += "hts_Stats -N phix-remover-adapters -AL ./01-Cleaned/" + s + "_StatsStats.log -f ./01-Cleaned/" + s
          cleaning.write(cmd+'\n')
      
      cleaning.close()

Additional Examples And Tutorials

An excellent HTStream tutorial for processing RNA-seq data UC Davis Bioinformatics Core 2018 RNA-Seq Workshop
Table of contents

Parsing JSON log file in R

The JSON log files produced by HTStream can be parsed easily from within R using the jsonlite package. This is a very simple example showing how to load and access data from hts_Stats.

## R
> library(jsonlite)
> results = fromJSON("test.log")
> names(results)
[1] "hts_Stats_23737"

> results$hts_Stats_23737$Paired_end

$PE_in
[1] 19595

$PE_out
[1] 19595

$R1_bpLen
[1] 5708153

$R1_bQ30
[1] 5233672

$R2_bpLen
[1] 5708153

$R2_bQ30
[1] 4946344

Welcome to the HTStream application page.

Benefits Include:

Important Links

Table of Contents

Installation HTStream

1) Download and unpack the binary:

2) Bioconda:

3) Compiling HTStream on Linux

Linux Prerequisites

On Ubuntu or similar systems prerequisites can be installed as follows

Install HTStream

4) Building on Mac OSX

Troubleshooting

Additional CMake Options

HTStream Applications Overview

hts_AdapterTrimmer

For detailed usage instructions please run:

hts_CutTrim

For detailed usage instructions please run:

hts_Overlapper

For detailed usage instructions please run:

hts_QWindowTrim

For detailed usage instructions please run:

hts_Stats

For detailed usage instructions please run:

hts_NTrimmer

For detailed usage instructions please run:

hts_PolyATTrim

For detailed usage instructions please run:

hts_SeqScreener

For detailed usage instructions please run:

hts_SuperDeduper

For detailed usage instructions please run:

hts_Primers

For detailed usage instructions please run:

Example Pipelines

Example 1

Example 2

Example 3

Additional Examples And Tutorials

Parsing JSON log file in R