SRST21, ARIBA2, and KmerResistance (web service, code)3 are three widely used pieces of standalone software for read-based detection of target genes in bacterial genomes. Published in 2014, SRST2 is recognised as the pioneer amongst these three tools2, 3. In this post, I compare methodologies underlying these tools in a concise manner to shed light on the selection of appropriate software for gene detection. Particularly, I herein presume that detecting antimicrobial resistance determinants is the only use case. Whenever unspecified, software versions referred to in this post are: SRST2 v0.2.0, ARIBA 2.14.4, and KmerResistance v2.2.
1. Reference database
All of the three pieces of software use redundant reference databases, where matches are often not unique5.
- Only contains presence-or-absence sequences;
- Clusters sequences by similarity (default: 80% nucleotide identity and default settings of CD-HIT-EST4);
- A cluster may consist of multiple sequences.
- Four kinds of sequences: coding versus non-coding, “presence or absence” versus resistance-associated variants (“variant only”);
- Clusters sequences by similarity (default: 90% nucleotide identity and 100% length identity);
- A cluster may consist of multiple sequences;
- Consistency assessment for quality control:
- Excludes partial coding sequences — a key difference to SRST2 and KmerResistance;
- Excludes variant-only sequences whose translations do not match known protein-level variants (defined in the metadata file of a database).
- Requires indexing a reference gene database (from the ResFinder database by default) and an optional species database (by default is provided by the Danish Center for Genomic Epidemiology) using KMA (web service, code)5.
- No clustering of the reference sequences.
2. Software dependencies
Both SRST2 and ARIBA are Python programs, whereas KmerResistance is built from C code.
- Bowtie 2;
- Python 2 (probably will convert to Python 3 soon);
- MUMmer 3;
- Bowtie 2;
- Python 3.
- KMA: indexing the gene database, mapping reads to references, and making the species database sparse6.
3. Input sequencing reads
- Paired-end or single-end short reads or both.
- Paired-end short reads.
- Short reads, single-end or paired-end;
- Long reads such as those from Nanopore sequencing.
4. Sequence alignment
This section summarises software and algorithms that SRST2, ARIBA, and KmerResistance use to find out the best hit of target genes in query bacterial genomes.
- Mapping reads to reference sequences using Bowtie 2, producing BAM files;
- Generating pileups from BAM files using SAMtools;
- Scoring matched reference alleles of each sequence cluster with the slope in a Q-Q plot generated from binomial tests, and the reference allele showing the lowest score is chosen as the best match.
ARIBA (two rounds of read mapping, one round of assembly, and one round of assembly-reference alignment)
- Mapping reads to clustered reference sequences with minimap;
- Assembling reads mapped to each cluster into contigs;
- Within each cluster where reads are mapped, finding out the reference sequence that is best matched to the contigs using nucmer of MUMmer;
- Comparing the query allele (in the sample genome) and its closest reference sequence: calling variants in the assembly based on the best matched reference with show-snps of MUMmer for translational prediction;
- Purity or unicity assessment of allele calls: mapping reads to the assembly using Bowtie 2 and calling variants (e.g., SNPs or heterozygous sites in the assembly);
- Repeating the last three steps for other mapped clusters.
- Heuristic k-mer mapping, which employs the Needleman-Wunsch algorithm for fine alignment;
- Scoring scheme (ConClave) summing together alignment scores of each candidate reference sequence.
5. Key outputs
- A gene profile showing the most likely alleles per sequence cluster;
- Consensus sequences generated from pileup files;
- Score files summarising mapping results.
- A gene profile showing the most likely alleles per sequence cluster;
- Consensus sequences;
- Calls to best matched alleles for loci with flags showing conditions of query allele sequences in sample genomes. For instance:
$ ariba flag 539 # Retrieve the meaning of flag 539 in results Meaning of flag 539 [X] assembled [X] assembled_into_one_contig [ ] region_assembled_twice [X] complete_gene [X] unique_contig [ ] scaffold_graph_bad [ ] assembly_fail [ ] variants_suggest_collapsed_repeat [ ] hit_both_strands [X] has_variant [ ] ref_seq_choose_fail
Consensus sequences generated by KMA with reference-guided assembly (majority voting and McNemar tests) for the best match5;
- Upper case (significantly overrepresented) and lower case (insignificant majority) letters for nucleotides;
Calls to best matched alleles, no matter the loci in sample genomes are functional or not (for instance, the locus may be disrupted by a large insertion or truncation).
6. Discussion and conclusions
I had used SRST2 for years when I did my PhD in the Holt Lab. Besides, I have used ARIBA (published in 2017) for several practices, but have not tried KmerResistance (2016) yet. Surprisingly, authors of KMA (2018), which has been using as the read mapper of KmerResistance, do not mention ARIBA at all in their article5. Despite high concordance in results from these three pieces of software to each other as well as to the gold standard — traditional phenotypic susceptibility tests2, 3, strengths and limitations differ between the software. Specifically, as a function-oriented tool (rather than sequence-oriented, such as SRST2), ARIBA demonstrates the following advantages over SRST2 and KmerResistance (checkout 041bc89b832cf6a3b7629d76b4dffb4c7428caab, committed to Bitbucket on 13 Apr 2016) for certain studies2.
- For database curation: exclusion of partial coding sequences and erroneous mutation records. ARIBA reported 102 partial sequences in the reference database SRST2-ARGannot v2, and these sequences can be attributed to the original ARG-ANNOT database.
- The use of de novo assembly of clustered reads enables ARIBA to determine whether a query coding sequence is complete, functional, or undisrupted (by an insertion, for example, which results in multiple contigs that make up a single assembly)2. In addition, flags of allele calls in ARIBA reports provide users with a comprehensive classification of every query sequence.
- Functional prediction: ARIBA lists and annotates the divergence of query sequences from their closest reference sequences. For instance, it distinguishes between synonymous and non-synonymous mutations.
On the other hand, parameters and algorithms for the read clustering and assembly processes of ARIBA may introduce erroneous allele calls or recovered allele sequences.
Regarding the use of KmerResistance and its read mapper, KMA, for predicting antimicrobial susceptibility, several advantages of it over SRST2 have been shown as follows comparing to phenotypic susceptibility3, 5.
- Lower false-positive rate;
- Higher accuracy;
- Less sensitivity of results to reduced read depths;
- Less requirement for computational resources.
Nevertheless, I would argue that the phenotypic susceptibility test is the gold standard only for benchmarking software that detects functional antimicrobial resistance determinants, and high-quality finished-grade genome assemblies are probably the gold standard for nucleotide-level sequence detections. Since SRST2 aims at the presence of target nucleotide sequences rather than intact coding sequences, it is either unfair or misleading to compare its results to phenotypic profiles such as those obtained from phenotypic susceptibility tests. To conclude, although it is worthwhile cross validating results from different tools, I believe that using SRST2, ARIBA, or KmerResistance for detecting nucleotide sequences does not make a substantial difference.
- Inouye, M., Dashnow, H., Raven, L-A., Schultz, M., Pope, B., Tomita, T., … Holt, K. (2014). SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Medicine, 6(11), 90. Retrieved from http://genomemedicine.com/content/6/11/90.
- Hunt, M., Mather, A. E., Sánchez-Busó, L., Page, A. J., Parkhill, J., Keane, J. A., & Harris, S. R. (2017). ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads. Microbial Genomics, 3(10). Retrieved from https://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000131.
- Clausen, P. T. L. C., Zankari, E., Aarestrup, F. M., & Lund, O. (2016). Benchmarking of methods for identification of antimicrobial resistance genes in bacterial whole genome data. Journal of Antimicrobial Chemotherapy, 71(9), 2484–2488. https://doi.org/10.1093/jac/dkw184.
- Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. BIOINFORMATICS, 22(13), 1658–1659. https://doi.org/10.1093/bioinformatics/btl158.
- Clausen, P. T. L. C., Aarestrup, F. M., & Lund, O. (2018). Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics, 19(1), 307. https://doi.org/10.1186/s12859-018-2336-6.