02 Analysing 16S data with ASVs
Introduction
The goal of this practical session is to evaluate using a denoising approach as an alternative to OTU calling for removal of errors in 16S sequence data. A number of tools have beend developed for this purpose (e.g. Minimum entropy decomposition, UNOISE, and DEBLUR). This practical us based around the use of DADA2).
1. Fetching Data for DADA2 Analysis
2. Open DADA2 and Plot Read Quality
3. Trim and filter forward and reverse reads based on quality
4. Denoise reads with DADA2
5. Genererate amplicons and create an abundance matrix
6. Assign Taxonomy to DADA2 Amplicon Sequence Variants
1. Fetching Data for DADA2 Analysis
We’ll begin by opening a terminal, then creating a new working directory for this session.
cd ~
# Create working directory
mkdir amplicon_asv
cd amplicon_asv
# Fetch input fastq files containing amplicon sequence data
mkdir fastqs
cd fastqs
cp ~/.thursday_pm/data/input_fastq/*fastq.*.gz .Note that there are now gzip-compressed paired end sequence data for three samples in this location.
ls -l
# Files are gzip compressed, but it is still possible to view them
# using the linux pipe.
zcat sample1_trimmed.fastq.1.gz | head -4
# Move back into our working directory
cd ..
2. Open DADA2 and Plot Read Quality
As the DADA2 package is written in R, most of this session will be carried out in RStudio.
Open RStudio from your linux desktop and open a new R script.

Remember, you can write (and comment!) code in your new R script then run it via
the R terminal, either by copying and pasting it across, or by highlighting
it and selecting the Run icon in the top-right side of your script window.

Having opened RStudio, set your working directory to the location of your run directory and load the Dada2 library.
# Find out which working directory R is running in?
getwd()
# Set your working directory and check the contents
setwd("~/amplicon_asv/")
list.files('.')
list.files('./fastqs')
# Load Dada2
library('dada2')Note that these fastq files have been trimmed to remove PCR and sequencing primers.
However, they have not been trimmed to remove low quality sequences at the 3’
end of reads. We will use the DADA2 function plotQualityProfile() to visualize
the distribution in quality at each base in the forward and reverse reads from
the first sample.
# Create vectors in R containing details of forward and reverse reads
fastq1 <- sort(list.files('fastqs', pattern='.fastq.1.gz', full.names=T))
fastq2 <- sort(list.files('fastqs', pattern='.fastq.2.gz', full.names=T))
# Plot the read quality distribution for each sample pair
plotQualityProfile(c(fastq1, fastq2))
In this plot lines show the mean (green), median (orange) and 25th/75th percentiles (dashed) of the quality score at each base position in the forward and reverse reads.
3. Trim and filter forward and reverse reads based on quality
The first step in processing reads with DADA2 is to remove poor quality sequence. (Note in the OTU tutorial we were relying on the FLASh assembler and downstream OTU calling to account for low quality sequence.)
Similar to working on the commandline, it’s necessary to define the names of input and output files before running the filtering step.
# Create a subdirectory from within R to contain quality trimmed sequences
dir.create('fastqs_trimmed')
# This subdirectory now appears in your working directory
list.files('.')
list.files('./fastqs_trimmed')
# Fetch a list of sample names from the list of forward reads
sample.names <- sapply(strsplit(basename(fastq1), '_'), `[`, 1)
# Create vectors of output filenames for both forward and reverse reads
fastq1.filtered <- file.path('fastqs_trimmed', paste0(sample.names, '.fastq.1.gz'))
fastq2.filtered <- file.path('fastqs_trimmed', paste0(sample.names, '.fastq.2.gz'))
# Run quality trimming and filtering in DADA2
out <- filterAndTrim(fwd=fastq1, filt=fastq1.filtered,
rev=fastq2, filt.rev=fastq2.filtered,
truncLen=c(260, 260),
maxEE=c(2,2),
multithread=T)
# The subdirectory should now contain quality trimmed reads
list.files('./fastqs_trimmed')
print(out)
# Have a closer look at the arguments supplied to the filterAndTrim function
?filterAndTrimIn the code above, we’ve set the minimum permissible truncated length of for a sequence to be 260bp for both forward and reverse reads. These were originally 2x300bp reads generated on the Illumina MiSeq platform to span the (~500bp) V1-V3 region. However, 20bp has been trimmed from the 5’ end of each read during adapter removal. This minimum lengths used above have therefore been selected to ensure (minimal!) overlap between reads during amplicon assembly.
The histogram below shows a typical distribution in the length of V1-V3 amplicons generated from a human gut microbiome study.

What biases may be introduced by setting a stringent
truncLenvalue?
For more information on the arguments supplied in the quality trimming and filtering step type
?filterAndTriminto the R terminal.
Try altering
truncLenand replacingmaxEEwith a minimum quality score (minQ) to see what effect these parameters have on the number of reads remaining after filtering.
4. Denoise reads with DADA2
Denoising in DADA2 is a two step process, first it models the expected error in the forward and reverse reads, then it removes this error to create a corrected read set. For further detail see the methods and supplementary material in the original DADA2 publication.
# Model error in forward and reverse reads
error1 <- learnErrors(fastq1.filtered)
error2 <- learnErrors(fastq2.filtered)
# Remove exact replicate sequences
dereplicated1 <- derepFastq(fastq1.filtered, verbose=T)
names(dereplicated1) <- sample.names
dereplicated2 <- derepFastq(fastq2.filtered, verbose=T)
names(dereplicated2) <- sample.names
# Then apply the DADA2 denoising algorithm
corrected1 <- dada(dereplicated1, err=error1)
corrected2 <- dada(dereplicated2, err=error2)
corrected1
corrected2
5. Genererate amplicons and create an abundance matrix
Having corrected sequencing errors in forward and reverse reads, the remaining steps are to merge read pairs into amplicons, remove chimeras, and to calculate the abundance of each amplicon in each sample.
# Combine the forward and reverse read into a single set of amplicons
amplicons <- mergePairs(dadaF=corrected1, derepF=dereplicated1,
dadaR=corrected2, derepR=dereplicated2,
verbose=T)
# Generate a count matrix containing the counts for each denoised sequence
sequence.tab <- makeSequenceTable(amplicons, orderBy='abundance')
# The rows of this matrix are the samples
rownames(sequence.tab)
# The columns are the sequences
colnames(sequence.tab)
# Remove chimeras directly from the count.matrix
sequence.tab.nochimeras <- removeBimeraDenovo(sequence.tab, verbose=T)
# Finally, it's possible to write the denoised amplicons to a fasta file
uniquesToFasta(getUniques(sequence.tab.nochimeras), 'dada2_asvs.fasta')How many unique amplicons were created using the DADA2 denoising pipeline?
How many sequences were lost compared to using the FLASh assembler?
6. Assign Taxonomy to DADA2 Amplicon Sequence Variants
Within the DADA2 pipeline, it is possible to assign taxonomy to denoised
sequences via a two step process. The first step is to use the assignTaxonomy()
function. (Use ?assignTaxonomy to see how this compares to the method used in
the accompanying OTU tutorial).
As this function only returns classifications to genus level, the second step addSpecies()
attempts to find exact matchesfor each sequence in a curated reference database.
First, it’s necessary to retrieve reference databases for taxonomic and species assignment. These have been downloaded for you and can be copied directly into your current working directory.
# In your bash terminal cd into the current working directory
cd ~/amplicon_asv
# Copy the required reference fasta files to your working directory
cp ~/.thursday_pm/data/GTDBr95* .
ls -lHaving retrieved the reference databases, return to your RStudio session to perform taxonomic assignment.
# Assign taxonomy to genus level
taxonomy.table <- assignTaxonomy(seqs=sequence.tab.nochimeras,
refFasta="GTDBr95-Genus.fna",
multithread=T)
# Assign species via exact matching
taxonomy.table <- addSpecies(taxtab=taxonomy.table,
refFasta="GTDBr95-Species.fna",
verbose=T)
# As row names in the resulting table are the sequences
# remove them before viewing.
taxa.print <- taxonomy.table
row.names(taxa.print) <- NULL
taxa.print <- as.data.frame(taxa.print)
View(taxa.print)Alternatively, it is possible to assign taxonomy to the denoised amplicon sequences by using the NCBI nucleotide BLAST tool. This can be done by copying and pasting individual sequences into the online BLAST portal. Then selecting the highly-curated NCBI 16S rRNA database as a reference.

Try running BLAST for the most and least abundant ASVs in your output fasta file.
Try running BLAST against the NCBI Nucleotide collection database, rather than the 16S rRNA database.
Note it’s also possible to install BLAST locally and run it from the commandline. However, if you do this for taxonomic assignment, be careful about the arguments you provide.
Conclusions
Biological judgement’s required when interpreting both OTU and denoised amplicon data. In this example, all three samples were generated from sequencing the same mock community that (when put together) contained 36 species and 24 genera.
A table of the taxa present in the sample is given below:
| Actinomyces naeslundii MG-1 | Akkermansia muciniphila | Bacteroides caccae |
| Bacteroides vulgatus | Bifidobacterium dentium | Bifidobacterium longum subsp. infantis |
| Corynebacterium accolens | Corynebacterium amycolatum | Corynebacterium matruchotti |
| Enterococcus faecalis | Escherichia coli | Eubacterium brachy |
| Faecalibacterium prausnitzii | Fusobacterium nucleatum subsp. nucleatum | Gardnerella vaginalis |
| Lactobacillus crispatus | Lactobacillus gasseri | Lactobacillus iners |
| Lactobacillus jensenii | Lautropia mirabilis | Moraxella catarrhalis |
| Porphyromonas gingivalis | Prevotella melaninogenica | Prevotella nigrescens |
| Prevotella oralis | Cutibacterium acnes | Rothia dentocariosa |
| Ruminococcus lactaris | Staphylococcus aureus subsp. aureus | Streptococcus agalactiae |
| Streptococcus mutans | Streptococcus pneumoniae | Streptococcus sanguinis |
| Tannerella forsythia | Treponema denticola | Veillonella parvula |
A (non-exhaustive) list of points to consider:
How many ASVs were generated by Dada2? How does this compare to the number of unique taxa present in this mock community?
What impact does using a denoising approach like Dada2 have on sample richness estimations?
What proportion of ASVs could be identified to species level?
To what extent does taxonomic classification depend on a) the approach and b) the reference database used?
(NB. you can rerun the above code using alternative reference database, which are available for use with Dada2 here.)
How many ASVs could be matched to each genus present in the mock community?
How do these approaches handle intragenomic 16S gene copy variation?
How does using ASVs compare to using OTUs for community-level microbiome analysis?