“MMP-9 in DSS colitis”

Sample mapping

When there are lots of samples with complicated samplenames the rendering of this document is suboptimal. To mitigate unwanted visual effects (somewhat) we map each sample to a unique number that is plotted on each graph - the map is below for linking what you see in the plots to the sample names.

sample	reportId
stool-2DSS__1	1
stool-2DSS__10	2
stool-2DSS__11	3
stool-2DSS__12	4
stool-2DSS__13	5
stool-2DSS__14	6
stool-2DSS__16	7
stool-2DSS__18	8
stool-2DSS__19	9
stool-2DSS__2	10
stool-2DSS__21	11
stool-2DSS__23	12
stool-2DSS__24	13
stool-2DSS__25	14
stool-2DSS__27	15
stool-2DSS__28	16
stool-2DSS__3	17
stool-2DSS__30	18
stool-2DSS__5	19
stool-2DSS__6	20
stool-2DSS__7	21
stool-2DSS__8	22
stool-2DSS__9	23
stool-3DSS__11	24
stool-3DSS__16	25
stool-3DSS__2	26
stool-3DSS__21	27
stool-3DSS__22	28
stool-3DSS__23	29
stool-3DSS__24	30
stool-3DSS__26	31
stool-3DSS__27	32
stool-3DSS__29	33
stool-3DSS__30	34
stool-3DSS__7	35
stool-4DSS__1	36
stool-4DSS__16	37
stool-4DSS__23	38
stool-4DSS__24	39
stool-4DSS__6	40

dada2 filtering reads

The first stage of the dada2 pipeline is filtering and trimming of reads. The number of reads that remain for downstream analysis is dependent on the parameters that were set for filtering and trimming. In most cases it would be expected that the vast majority of reads will remain after this step. It is noteworthy that dada2 does not accept any “N” bases and so will remove reads if there is an N in the sequence.

Number of reads input/output during filtering and trimming

Below is a summary of the number of input reads and the number of output reads for each sample.

Number of input and output reads during filtering step

Learning the error model

Dada2 performs a step where it learns the sequencing error model. Taken from the tutorial:

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

In contrast to the dada2 tutorial here, and for the purposes of parallelisation, we learn the error model using each sample separately. It should be noted that this may not be ideal in all situtations but it does speed up data processing. This also means that we produce a plot for each sample separately. It is not feasible to display them all here and so we just inspect one for the report purposes but all others are available where the piepline was run. Note that at the moment this is restricted to the forward reads (so there is no error thrown when using single end data).

Error model

De-replication, sample inference, merging and chimera removal

The next stage of the dada2 pipeline involves dereplication, sample inference, merging (if paired-end) and chimera removal. Again from the tutorial, dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance” equal to the number of reads with that unique sequence. These are then taken forward into the sample inference stage and chimera removal. It is useful to see after this has been done how many sequences we are left with. The majority of reads should contribute to the final overall counts.

Number of input and output reads during filtering step

Taxonomic assignment

The next stage is to assign each of the amplicon sequence variants (ASV) to a taxonomic group. Below is a description of the number of ASVs that were identified in each sample and the taxonomic groups found among them.

Taxonomic assignments of ASVs