“DADA2 Test”

Sample mapping

When there are lots of samples with complicated samplenames the rendering of this document is suboptimal. To mitigate unwanted visual effects (somewhat) we map each sample to a unique number that is plotted on each graph - the map is below for linking what you see in the plots to the sample names.

sample	reportId
ERR1382318	1
ERR1382380	2
ERR1382381	3
ERR1382389	4
ERR1382449	5
ERR1382452	6
ERR1382480	7
ERR1382483	8
ERR1382494	9
ERR1382520	10
ERR1382539	11
ERR1382553	12
ERR1382581	13
ERR1382596	14
ERR1382601	15
ERR1382610	16
ERR1382637	17
ERR1382659	18
ERR1382664	19
ERR1382724	20
ERR1382742	21
ERR1382773	22
ERR1382776	23
ERR1382785	24
ERR1382792	25
ERR1382793	26
ERR1382798	27
ERR1382865	28
ERR1382879	29
ERR1382908	30
ERR1382957	31
ERR1382965	32
ERR1383009	33
ERR1383010	34
ERR1383027	35
ERR1383048	36
ERR1383068	37
ERR1383095	38
ERR1383097	39
ERR1383123	40
ERR1383135	41
ERR1383137	42
ERR1383192	43
ERR1383242	44
ERR1383250	45
ERR1383258	46
ERR1383263	47
ERR1383270	48
ERR1383311	49
ERR1383315	50
ERR1383329	51
ERR1383347	52
ERR1383389	53
ERR1383414	54
ERR1383441	55
ERR1383476	56
ERR1383536	57
ERR1383541	58
ERR1383552	59
ERR1383569	60
ERR1383591	61
ERR1383646	62
ERR1383672	63
ERR1383798	64
ERR1383807	65
ERR1383867	66
ERR1383895	67
ERR1383999	68
ERR1384007	69
ERR1384030	70
ERR1384037	71
ERR1384067	72
ERR1384083	73
ERR1384091	74
ERR1384116	75
ERR1384120	76
ERR1384128	77
ERR1384162	78
ERR1384177	79
ERR1384184	80
ERR1384218	81
ERR1384246	82
ERR1384345	83
ERR1384388	84
ERR1384422	85
ERR1384441	86
ERR1384483	87
ERR1384600	88
ERR1384727	89
ERR1384736	90
ERR1384815	91
ERR1384901	92
ERR1384935	93
ERR1385099	94
ERR1385242	95
ERR1385294	96
ERR1385395	97
ERR1385526	98
ERR1385542	99
ERR1385550	100

dada2 filtering reads

The first stage of the dada2 pipeline is filtering and trimming of reads. The number of reads that remain for downstream analysis is dependent on the parameters that were set for filtering and trimming. In most cases it would be expected that the vast majority of reads will remain after this step. It is noteworthy that dada2 does not accept any “N” bases and so will remove reads if there is an N in the sequence.

Number of reads input/output during filtering and trimming

Below is a summary of the number of input reads and the number of output reads for each sample.

Number of input and output reads during filtering step

Learning the error model

Dada2 performs a step where it learns the sequencing error model. Taken from the tutorial:

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

In contrast to the dada2 tutorial here, and for the purposes of parallelisation, we learn the error model using each sample separately. It should be noted that this may not be ideal in all situtations but it does speed up data processing. This also means that we produce a plot for each sample separately. It is not feasible to display them all here and so we just inspect one for the report purposes but all others are available where the piepline was run. Note that at the moment this is restricted to the forward reads (so there is no error thrown when using single end data).

Error model

De-replication, sample inference, merging and chimera removal

The next stage of the dada2 pipeline involves dereplication, sample inference, merging (if paired-end) and chimera removal. Again from the tutorial, dereplication combines all identical sequencing reads into into “unique sequences” with a corresponding “abundance” equal to the number of reads with that unique sequence. These are then taken forward into the sample inference stage and chimera removal. It is useful to see after this has been done how many sequences we are left with. The majority of reads should contribute to the final overall counts.

Number of input and output reads during filtering step

Taxonomic assignment

The next stage is to assign each of the amplicon sequence variants (ASV) to a taxonomic group. Below is a description of the number of ASVs that were identified in each sample and the taxonomic groups found among them.

Taxonomic assignments of ASVs