Author: Xiyu Peng
Next-generation amplicon sequencing is a powerful tool for investigating microbial communities. One main challenge is to distinguish true biological variants from errors caused by PCR and sequencing. In the traditional analysis pipeline, such errors are eliminated by clustering reads within a sequence similarity threshold, usually 97%, and constructing operational taxonomic units (OTUs). However, the arbitrary threshold can lead to low resolution and high false-positive rates.
Here, we introduce AmpliCI, a reference-free model-based method for rapidly resolving the number, abundance, and identity of error-free sequences in massive Illumina amplicon datasets. AmpliCI takes into account quality information and allows the data, rather than an arbitrary threshold or an external database, to drive the conclusion. AmpliCI estimates a mixture model, using a greedy strategy to gradually select error-free sequences while approximately maximizing the likelihood. In a simulation study, our method achieves better accuracy than competing methods, especially in communities of closely related species. AmpliCI also shows comparable or better accuracy than other methods when analyzing real mock datasets.