“MMP-9 in DSS colitis”

Sample mapping

When there are lots of samples with complicated samplenames the rendering of this document is suboptimal. To mitigate unwanted visual effects (somewhat) we map each sample to a unique number that is plotted on each graph - the map is below for linking what you see in the plots to the sample names.

sample reportId
stool-2DSS__1 1
stool-2DSS__10 2
stool-2DSS__11 3
stool-2DSS__12 4
stool-2DSS__13 5
stool-2DSS__14 6
stool-2DSS__16 7
stool-2DSS__18 8
stool-2DSS__19 9
stool-2DSS__2 10
stool-2DSS__21 11
stool-2DSS__23 12
stool-2DSS__24 13
stool-2DSS__25 14
stool-2DSS__27 15
stool-2DSS__28 16
stool-2DSS__3 17
stool-2DSS__30 18
stool-2DSS__5 19
stool-2DSS__6 20
stool-2DSS__7 21
stool-2DSS__8 22
stool-2DSS__9 23
stool-3DSS__11 24
stool-3DSS__16 25
stool-3DSS__2 26
stool-3DSS__21 27
stool-3DSS__22 28
stool-3DSS__23 29
stool-3DSS__24 30
stool-3DSS__26 31
stool-3DSS__27 32
stool-3DSS__29 33
stool-3DSS__30 34
stool-3DSS__7 35
stool-4DSS__1 36
stool-4DSS__16 37
stool-4DSS__23 38
stool-4DSS__24 39
stool-4DSS__6 40

dada2 filtering reads

The first stage of the dada2 pipeline is filtering and trimming of reads. The number of reads that remain for downstream analysis is dependent on the parameters that were set for filtering and trimming. In most cases it would be expected that the vast majority of reads will remain after this step. It is noteworthy that dada2 does not accept any “N” bases and so will remove reads if there is an N in the sequence.

Number of reads input/output during filtering and trimming

Below is a summary of the number of input reads and the number of output reads for each sample.

Number of input and output reads during filtering step

Number of input and output reads during filtering step

Learning the error model

Dada2 performs a step where it learns the sequencing error model. Taken from the tutorial:

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

In contrast to the dada2 tutorial here, and for the purposes of parallelisation, we learn the error model using each sample separately. It should be noted that this may not be ideal in all situtations but it does speed up data processing. This also means that we produce a plot for each sample separately. It is not feasible to display them all here and so we just inspect one for the report purposes but all others are available where the piepline was run. Note that at the moment this is restricted to the forward reads (so there is no error thrown when using single end data).