Introduction

The goal of this practical session is to evaluate using an OTU calling approach as an alternative to denoising for the processing of 16S sequence data. While a large number of tools are available for each step in a typical OTU calling pipeline. This tutorial relies on FLASh for amplicon assembly, and USEARCH for OTU calling.

  1. Using a text editor to document your actions
  2. Fetching 16S sequence data
  3. Combining paired-end reads into contiguous sequences
  4. Generating a unique set of 16S gene sequences
  5. Clustering unique sequences into operational taxonomic units
  6. Creating a table of OTU abundance
  7. Assigning taxonomy to OTU sequences
  8. Amalgamating code into a pipeline script

Conclusions

Session One

1. Using a text editor to document your actions (Optional)

During this session you may wish to keep a record of the commands used to analyse your 16S data. One way to do this is to write each command to a file. From the command line, this can be acheived using a text editor. There are many to choose from, but this session will use Nano.

# Create a working directory for this session
mkdir ~/amplicon_otu
cd ~/amplicon_otu
pwd

# Open a new file
nano 16s_pipeline.sh

Opening a new text file you should see something similar to this:

Nano

Try writing some text to the document 16s_pipeline.sh
Try saving the document, closing it, and re-opening it.

Because the saved document is a shell script, it is also possible to run the entire script from the command line.

# Make the file 16s_pipeline.sh executable from the command line.
chmod u+x 16s_pipeline.sh

# run the code in 16s_pipeline.sh
./16s_pipeline.sh

For more information on using nano, see the Nano online documentation.

2. Fetching 16S sequence data

To begin, it’s necessary to fetch the 16S sequence data that will be used for this tutorial.

# Create a subdirectory for storing fastq files
cd ~/amplicon_otu
mkdir fastqs
cd fastqs
cp  ~/.thursday_pm/data/input_fastq/*fastq.*.gz .
ls

These data were generated by sequencing the V1-V3 region of the 16S rRNA gene using 2x300 base pair paired-end sequencing on the Illumina MiSeq. There should be two files per sample, with the files *.fastq.1.gz and *.fastq.2.gz containing the first and second reads in each pair, respectively. Have a look at the first four lines in one of the fastqs.

zcat sample1_trimmed.fastq.1.gz | head -4

@M03204:217:000000000-B8W8J:1:2108:24791:7353
AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCATGCCTTACACATGCAAGTCGGACGGGAAGTGGTGTTTCCAGTGGCGGACGGGTGAGTAACGCGTAAGAACCTACCCTTGGGAGGGGAACAACAGCTGGAAACGGCTGCTAATACCCCGTAGGCTGAGGAGCAAAAGGAGGAATCCGCCCGAGGAGGGGTTCGCGTCTGATTAGCTAGTTGGTGAGGCAATAGCTTACCAAGGCGATGATCAGTAGCTGGTCCGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGA
+
CCCCCGFGGGGGGFCFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDDFGGGGGGGGGGGGGGGGGGGFGGGGDGGGGGGGGGFGGGGGGEGB:FEFGGGFGGECGGGGFFGFGFGGGGGGGGFGFEGGGGGGFFFGGGGGGFECGGGFGGGGGGGGGGGGGGGGGGGGF6CFFGGGGDGGEFFCF:5*)5>29*5<)5>GF4)

NB. Sequences in these files have been trimmed to remove adapters, but are otherwise untouched.

3. Combining paired-end reads into contiguous sequences

In order to generate a single, contiguous sequence spanning the target region of the 16S gene, it’s necessary the paired reads in each fastq file. To do this we will use a tool called FLASh. It’s first necessary to ensure this tool is accessible from the command line.

export PATH=~/.thursday_pm/bin/:$PATH
which flash

We can then start by combining fastq files containing forward and reverse reads for a single sample.

# Check you've returned to the run directory for this session
cd ~/amplicon_otu
pwd

# Run FLASh for a single sample
flash ./fastqs/sample1_trimmed.fastq.1.gz ./fastqs/sample1_trimmed.fastq.2.gz

# The files created by FLASh should be in the current directory
ls

Have a look at the files created by FLASh, in particular the file out.extendedFrags.fastq
Are all sequences the same length? How many sequences were successfully assembled and how many failed to assemble?

For information on options for running FLASh and the files it creates visit the website, or access the help documentation on the command line.

flash --help

3.b. Running FLASh using a for loop

While it’s possible to run FLASh individually for each sample, for convenience we will use a for loop to iteratively combine fastqs for all the samples in the fastq directory.

# make a directory in which to store the combined fastq files
mkdir fastqs_combined

ls 
ls fastqs_combined

# Run FLASh for all samples in the working directory
for FASTQ1 in fastqs/*.fastq.1.gz
do
  # capture the file prefix, which corresponds to the sample name
  SAMPLEID="$(basename ${FASTQ1%.fastq.1.gz})"
  # assign the second fastq file to the variable FASTQ2
  FASTQ2="fastqs/${SAMPLEID}.fastq.2.gz"
  # run FLASh 
  flash $FASTQ1 $FASTQ2 --output-prefix=$SAMPLEID --output-directory=fastqs_combined
done

ls fastqs_combined

If you’re using your text editor, copy the for loop into your pipeline script.
Remember to annotate your code so that you can return to it later.

3.c. Converting fastq files to fasta format

From here on, it’s necessary to work with fasta files, rather than fastq. It’s possible to convert our data from fastq to fasta format using the FASTX-toolkit.

# Run fastq_to_fasta for the combined fastq files created by FLASh
for FASTQ in fastqs_combined/*.extendedFrags.fastq
do
  # capture the file prefix, which corresponds to the sample name
  SAMPLEID="${FASTQ%.extendedFrags.fastq}"
  printf 'Processing %s\n' $FASTQ
  # run fastq_to_fasta
  fastq_to_fasta -Q33 -i $FASTQ -o $SAMPLEID.fasta
done

# The fasta files should now be in the same directory as the combined fastqs
ls fastqs_combined

4. Generating a unique set of 16S gene sequences

Having successfully processed our sequence data, the goals now are to:

Identify distinct biological taxa
Quantify the relative abundance of each distinct taxon

These goals are predicated on the assumption that the extent to which two distinct bacterial taxa are related correlates with the similarity of their 16S rRNA gene sequences. Different taxa can therefore be identified by resolving unique 16S gene sequences and counting their abundance.

As we’re using USEARCH for this purpose, the first step is to pool 16S gene sequences from different samples into one file. In order to keep track of which sequence originated from which sample, it is also necessary to add the sample name to samplesthe header line for each fasta sequence. For simplicity, this step will be carried out using a custom script.

# The location of the script for combining fasta files
which samples2fasta

# Combine samples into a single file located in the run directory
samples2fasta fastqs_combined all_samples.fasta

The second step is to create a file containing a single copy of each unique sequence in the dataset, as each unique sequence potentially represents a uniqlue bacterial taxon. When the -sizeout argument is provided USEARCH keeps a record of the number of times each unique sequence appears in the data set.

# How many sequences are in the combined fasta file?
cat all_samples.fasta | grep ">" | wc -l

# Fetch the latest version of USEARCH
rm ~/.thursday_pm/bin/usearch
ln -s ~/.thursday_pm/bin/src/usearch11.0.667_i86linux32 ~/.thursday_pm/bin/usearch
which usearch

# Run USEARCH to generate a file of unique sequences
usearch -fastx_uniques all_samples.fasta -fastaout all_unique_seqs.fasta -sizeout

Have a look at the first four lines of all_unique_seqs.fasta to see where sizeout is recorded.

Sequence variation may represent distinct bacterial taxa, but it may also be caused by errors during sequencing. If a bacterial taxon is present at a detectable abundance in these samples, then its 16S gene sequence is likely to be represented multiple times in the data set. By contrast, sequencing error is assumed to be (more-or-less) random, meaning that errors due to sequencing have only a small likelihood of occurring more than once. For this reason it is common practice to discard unique sequences that occur one, or a few times. The next step is to sort unique sequences based on their frequency of occurrence, and to discard those that occur only once.

# Sort unique sequences based on the number of times they occur, discarding singletons
usearch -sortbysize all_unique_seqs.fasta -fastaout all_unique_seqs_sorted.fasta -minsize 2

How many unique sequences remain after removing singletons?
How many unique sequences remain if you discard sequences occurring less than five times?

5. Clustering unique sequences into operational taxonomic units

The goal of this practical is to identify and quantify distinct bacterial taxa. Assuming that a certain amount of 16S gene sequence variation exists within species (different bacterial strains?) we will generate operational taxonomic units (OTUs) based on the assumption that two sequences which are more than 97% similar belong to the same species. There is a more nuanced discussion of exactly what 97% OTUs mean on the USEARCH website.

Uparse

The UPARSE-OTU algorithm, implemented in USEARCH, can be used to generate OTUs based on this 97% similarity threshold. It selects a single representative sequence for each generated OTU.

# Cluster OTUs using a 97% similarity threshold
usearch -cluster_otus all_unique_seqs_sorted.fasta -otus otus.fasta

How many OTUs have been generated?
How many chimeras were detected?

Note that chimeras can be a significant problem in amplicon-based sequence analysis. There are dedicated tools for chimera detection and removal (e.g. ChimeraSlayer); however the UPARSE-OTU algorithm implicitly filters chimeras.

6. Creating a table of OTU abundance

Having generated a set of sequences representing distinct bacterial taxa (OTUs). The next step is to quantify OTU abundance. This can be acheived by counting the number of sequences in each sample that match each OTU. For a sequence to “match” an OTU it must be more than 97% similar to the representative sequence for that OTU.

The first step is to relabel each of our OTUs with a unique identifier. We will do this using a python script fasta_number.py

# fetch the fasta_number.py python script to the current working directory
ln -s ~/.thursday_pm/bin/drive5_py/fasta_number.py .

# call python script to rename the OTUs in the format OTU_*
python fasta_number.py otus.fasta OTU_ > otus_renamed.fasta

The next step is to use the pooled fasta file (all_samples.fasta) generated in Section 4 to map the sequences originating from each sample file back their closest matching OTUs. Sample sequences that do not match any OTU sequenced at >97% similarity are assumed to be sequencing errors and are discarded.

# Use USEARCH otutab command to map original sequences back to OTUs
usearch -otutab all_samples.fasta -otus otus_renamed.fasta -otutabout otu_matrix.tsv -mapout map.txt

The resulting file otu_matrix.tsv is a count matrix in which columns are samples, rows are OTUs and each cell represents the number of sequences assigned to each OTU in each sample.

7. Assigning taxonomy to OTU sequences

Having generated a matrix of OTU abundance in different samples it’s necessary to match OTU sequences against a reference database in order to infer their taxonmy. One simple way to do this would be to copy each sequence into NCBI’s BLAST tool, which can be used to match sequences to the RefSeq 16S ribosomal RNA sequences database. (See the accompanying ASV tutorial for an example of this.)

An alternative method for classifying 16S gene sequences is to use the Ribosomal Database Project (RDP) classifier tool, which compares sequences to the RDP reference database. The RDP classifier can be run from the command line without the need to download the entire reference database.

# Run the RDP classifer. As this tool is written in java, it's necessary to first run
# java then pass the location of the installed classifier
java -Xmx1g -jar ~/.thursday_pm/bin/rdp_classifier_2.14/dist/classifier.jar classify \
  -f filterbyconf \
  -c 0.8 \
  -o otu_taxonomy_rdp_0.8.tsv \
  -h otu_taxonomy.hierachy \
  otus_renamed.fasta

Have a look at the two files (otu_taxonomy_rdp_0.8.tsv and otu_taxonomy.hierachy) produced by the RDP classifier.

The RDP classifier produces two files the first otu_taxonomy_rdp_0.8.tsv contains a taxonomic classification from each OTU from Kingdom to Genus level. The second otu_taxonomy.hierachy contains similar information in a commonly used hierachical format. Each taxonomic level encountered in our dataset is listed in this file, with the final column showing the number of times it is encountered.

For further information on the options available when running the RDP classifer have a look at the help pages.

# Running RDP without input arguments will cause it to print its help pages.
java -Xmx1g -jar ~/.thursday_pm/bin/rdp_classifier_2.14/dist/classifier.jar 
java -Xmx1g -jar ~/.thursday_pm/bin/rdp_classifier_2.14/dist/classifier.jar classify

What does the command -c 0.8 mean?

8. Amalgamating code into a pipeline script

Finally, having worked out how to run each of the steps necssary to generate an OTU count table, it should now be possible to take the code for each step and add it to the script 16s_pipeline.sh.

This effectively creates a simple computational pipeline. As the resulting script does not contain any hardcoded sample names, it should be possible to run it on other datasets.

Conclusions

This session provides an overview of the fundamental steps taken to process 16S gene sequence data from raw reads into operational taxonomic units that can be used for downstream analysis.

Consider heading to the second tutorial in this session for a comparison of this approach versus a ‘denoising’ approach that generates amplicons sequence variants (ASVs).