“DADA2 Test”

Sample mapping

When there are lots of samples with complicated samplenames the rendering of this document is suboptimal. To mitigate unwanted visual effects (somewhat) we map each sample to a unique number that is plotted on each graph - the map is below for linking what you see in the plots to the sample names.

sample reportId
ERR1382318 1
ERR1382380 2
ERR1382381 3
ERR1382389 4
ERR1382449 5
ERR1382452 6
ERR1382480 7
ERR1382483 8
ERR1382494 9
ERR1382520 10
ERR1382539 11
ERR1382553 12
ERR1382581 13
ERR1382596 14
ERR1382601 15
ERR1382610 16
ERR1382637 17
ERR1382659 18
ERR1382664 19
ERR1382724 20
ERR1382742 21
ERR1382773 22
ERR1382776 23
ERR1382785 24
ERR1382792 25
ERR1382793 26
ERR1382798 27
ERR1382865 28
ERR1382879 29
ERR1382908 30
ERR1382957 31
ERR1382965 32
ERR1383009 33
ERR1383010 34
ERR1383027 35
ERR1383048 36
ERR1383068 37
ERR1383095 38
ERR1383097 39
ERR1383123 40
ERR1383135 41
ERR1383137 42
ERR1383192 43
ERR1383242 44
ERR1383250 45
ERR1383258 46
ERR1383263 47
ERR1383270 48
ERR1383311 49
ERR1383315 50
ERR1383329 51
ERR1383347 52
ERR1383389 53
ERR1383414 54
ERR1383441 55
ERR1383476 56
ERR1383536 57
ERR1383541 58
ERR1383552 59
ERR1383569 60
ERR1383591 61
ERR1383646 62
ERR1383672 63
ERR1383798 64
ERR1383807 65
ERR1383867 66
ERR1383895 67
ERR1383999 68
ERR1384007 69
ERR1384030 70
ERR1384037 71
ERR1384067 72
ERR1384083 73
ERR1384091 74
ERR1384116 75
ERR1384120 76
ERR1384128 77
ERR1384162 78
ERR1384177 79
ERR1384184 80
ERR1384218 81
ERR1384246 82
ERR1384345 83
ERR1384388 84
ERR1384422 85
ERR1384441 86
ERR1384483 87
ERR1384600 88
ERR1384727 89
ERR1384736 90
ERR1384815 91
ERR1384901 92
ERR1384935 93
ERR1385099 94
ERR1385242 95
ERR1385294 96
ERR1385395 97
ERR1385526 98
ERR1385542 99
ERR1385550 100

dada2 filtering reads

The first stage of the dada2 pipeline is filtering and trimming of reads. The number of reads that remain for downstream analysis is dependent on the parameters that were set for filtering and trimming. In most cases it would be expected that the vast majority of reads will remain after this step. It is noteworthy that dada2 does not accept any “N” bases and so will remove reads if there is an N in the sequence.

Number of reads input/output during filtering and trimming

Below is a summary of the number of input reads and the number of output reads for each sample.

Number of input and output reads during filtering step

Number of input and output reads during filtering step

Learning the error model

Dada2 performs a step where it learns the sequencing error model. Taken from the tutorial:

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

In contrast to the dada2 tutorial here, and for the purposes of parallelisation, we learn the error model using each sample separately. It should be noted that this may not be ideal in all situtations but it does speed up data processing. This also means that we produce a plot for each sample separately. It is not feasible to display them all here and so we just inspect one for the report purposes but all others are available where the piepline was run. Note that at the moment this is restricted to the forward reads (so there is no error thrown when using single end data).

Error model

Error model

De-replication, sample inference, merging and chimera removal

The next stage of the dada2 pipeline involves dereplication, sample inference, merging (if paired-end) and chimera removal. Again from the tutorial, dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance” equal to the number of reads with that unique sequence. These are then taken forward into the sample inference stage and chimera removal. It is useful to see after this has been done how many sequences we are left with. The majority of reads should contribute to the final overall counts.

Number of input and output reads during filtering step

Number of input and output reads during filtering step

Taxonomic assignment

The next stage is to assign each of the amplicon sequence variants (ASV) to a taxonomic group. Below is a description of the number of ASVs that were identified in each sample and the taxonomic groups found among them.

Taxonomic assignments of ASVs

Taxonomic assignments of ASVs