When there are lots of samples with complicated samplenames the rendering of this document is suboptimal. To mitigate unwanted visual effects (somewhat) we map each sample to a unique number that is plotted on each graph - the map is below for linking what you see in the plots to the sample names.
sample | reportId |
---|---|
stool-2DSS__1 | 1 |
stool-2DSS__10 | 2 |
stool-2DSS__11 | 3 |
stool-2DSS__12 | 4 |
stool-2DSS__13 | 5 |
stool-2DSS__14 | 6 |
stool-2DSS__16 | 7 |
stool-2DSS__18 | 8 |
stool-2DSS__19 | 9 |
stool-2DSS__2 | 10 |
stool-2DSS__21 | 11 |
stool-2DSS__23 | 12 |
stool-2DSS__24 | 13 |
stool-2DSS__25 | 14 |
stool-2DSS__27 | 15 |
stool-2DSS__28 | 16 |
stool-2DSS__3 | 17 |
stool-2DSS__30 | 18 |
stool-2DSS__5 | 19 |
stool-2DSS__6 | 20 |
stool-2DSS__7 | 21 |
stool-2DSS__8 | 22 |
stool-2DSS__9 | 23 |
stool-3DSS__11 | 24 |
stool-3DSS__16 | 25 |
stool-3DSS__2 | 26 |
stool-3DSS__21 | 27 |
stool-3DSS__22 | 28 |
stool-3DSS__23 | 29 |
stool-3DSS__24 | 30 |
stool-3DSS__26 | 31 |
stool-3DSS__27 | 32 |
stool-3DSS__29 | 33 |
stool-3DSS__30 | 34 |
stool-3DSS__7 | 35 |
stool-4DSS__1 | 36 |
stool-4DSS__16 | 37 |
stool-4DSS__23 | 38 |
stool-4DSS__24 | 39 |
stool-4DSS__6 | 40 |
The first stage of the dada2 pipeline is filtering and trimming of reads. The number of reads that remain for downstream analysis is dependent on the parameters that were set for filtering and trimming. In most cases it would be expected that the vast majority of reads will remain after this step. It is noteworthy that dada2 does not accept any “N” bases and so will remove reads if there is an N in the sequence.
Below is a summary of the number of input reads and the number of output reads for each sample.
Number of input and output reads during filtering step
Dada2 performs a step where it learns the sequencing error model. Taken from the tutorial:
The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
In contrast to the dada2 tutorial here, and for the purposes of parallelisation, we learn the error model using each sample separately. It should be noted that this may not be ideal in all situtations but it does speed up data processing. This also means that we produce a plot for each sample separately. It is not feasible to display them all here and so we just inspect one for the report purposes but all others are available where the piepline was run. Note that at the moment this is restricted to the forward reads (so there is no error thrown when using single end data).
Error model
The next stage of the dada2 pipeline involves dereplication, sample inference, merging (if paired-end) and chimera removal. Again from the tutorial, dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance” equal to the number of reads with that unique sequence. These are then taken forward into the sample inference stage and chimera removal. It is useful to see after this has been done how many sequences we are left with. The majority of reads should contribute to the final overall counts.
Number of input and output reads during filtering step
The next stage is to assign each of the amplicon sequence variants (ASV) to a taxonomic group. Below is a description of the number of ASVs that were identified in each sample and the taxonomic groups found among them.
Taxonomic assignments of ASVs