Popular reference databases of antimicrobial resistance genes
Reliable and up-to-date databases play a pivotal role in reference-based detection of antimicrobial resistance genes (ARGs) in bacteria. Nonetheless, these databases differ in their content and quality, making it challenging to decide an appropriate reference database for a particular research project. In order to address this challenge, this post offers a review of several ARG databases that are publicly available and widely used in bacterial genomics. Particularly, I focus on databases that are still undergoing regular maintenance, and I do not discuss any program released with these databases for sequence search or statistical analysis.
1. Popular databases
The Comprehensive Antibiotic Resistance database (CARD) is an ongoing project of constantly collecting and curating information of antimicrobial resistance (AMR) determinants, ARG products, and associated phenotypes1, 2. The database of AMR determinants includes nonredundant nucleotide and protein sequences of ARGs and resistance-associated SNPs (see the download page). For BLAST search of ARGs, users can use the FASTA file nucleotide_fasta_protein_homolog_model.fasta, in which the sequences are defined as follows:
>gb|AJ920369|+|23-860|ARO:3001071|SHV-12 [Escherichia coli] ATGCGTTATATTCGCCTGTGTATTATCTCCCTGTTAGCCACCCTGCCGCTGGCGGTAC... >gb|AY587956|+|0-774|ARO:3001773|OXA-61 [Campylobacter jejuni] ATGAAAAAAATAACTTTATTTTTACTTTTCTTAAATTTAGTGTTTGGGCAAGATAAGAT... >gb|U14749|+|690-1557|ARO:3002243|CARB-4 [Pseudomonas aeruginosa] ATGAAGCTTTTACTGGTATTTTCGCTTTTAATACCGTCTATGGTGTTTGCAAATAGTTC... >gb|AF462019|+|26-656|ARO:3002681|catB9 [Vibrio cholerae] ATGAACTTCTTTACGTCTCCATTTTCTGGGATTCCCTTAGATCAGCAAGTAACAAATCC... >gb|U95363|+|0-861|ARO:3000912|TEM-43 [Klebsiella pneumoniae] ATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCC...
where the ARO number is the Antibiotic Resistance Ontology (ARO) accession in the CARD. For example, comprehensive details of blaCARB-4 can be retrieved through typing the accession ARO:3002243 in the search box on the CARD website.
Alleles containing AMR-associated point mutations are stored in another two FASTA files (nucleotide_fasta_protein_variant_model.fasta and nucleotide_fasta_rRNA_gene_variant_model.fasta). For instance, a resistant gyrA allele is defined in nucleotide_fasta_protein_variant_model.fasta:
>gb|NC_002952|+|7004-9665|ARO:3003296|Staphylococcus aureus gyrA conferring resistance to fluoroquinolones [Staphylococcus aureus subsp. aureus MRSA252] ATGGCTGAATTACCTCAATCAAGAATAAATGAACGAAATATTACCAGTGAAATGCGTGAATCA...
and a resistant rRNA allele is defined in nucleotide_fasta_rRNA_gene_variant_model.fasta:
>gb|U00096|+|4166659-4168200|ARO:3003223|Escherichia coli 16S rRNA mutation conferring resistance to edeine [Escherichia coli K-12] AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCG...
CARD also contains a large number of bacterial genomes (chromosomes and plasmids) and whole-genome sequencing (WGS) assemblies. A curated version of CARD is used as a reference database by the Genefinder pipeline3, 4 of the Public Health England (PHE)5.
In version 3.0.4 (Aug 2019), nucleotide_fasta_protein_homolog_model.fasta is comprised of 2,602 reference ARG sequences.
The Antibiotic Resistance Gene-Annotation (ARG-ANNOT) database only consists of ARGs, although the database article declares an inclusion of point mutations known to be associated with AMR6. The database has been annually updated since 2017, and the latest version is V4 (2018). A FASTA file of reference nucleotide sequences or protein sequences is offered, respectively, at each release. Compared to CARD, however, the ARG-ANNOT database does not provide detailed functional annotations of genes. Nucleotide sequences in ARG-ANNOT are defined with the format:
>([Product class])[gene name]:[GenBank accession]:[coordinates]:[length (bp)]
For instance, two sequences from the database read:
>(Sul)sul1:AF071413:6700-7539:840 ATGGTGACGGTGTTCGGCATTCTGAATCTCACCGAGGACTCCTTCTTCGAT... >(Sul)sul2:EU360945:1617-2432:816 ATGAATAAATCGCTCATCATTTTCGGCATCGTCAACATAACCTCGGACAGT... >(Sul)sul3:HQ875016:7396-8187:792 ATGAGCAAGATTTTTGGAATCGTAAATATAACCACCGATAGTTTTTCCGAT...
I am delighted to see that curators of the ARG-ANNOT database have managed to correct letters in gene names to lower-case in the V4, whereas the gene names are actually protein names (e.g., sul1 is SulI) in all previous versions.
A popular derivation of the ARG-ANNOT database is provided with SRST2 releases. The latest version is revision 3 (ARGannot_r3.fasta), and I am the curator of the database revision 2 (ARGannot_r2.fasta). The SRST2-compatible databases r2 and r3 are non-redundant, and were carefully curated for sequence and annotation accuracy. Every sequence in the databases are defined as:
>[Sequence cluster ID]__[Gene name]_[Product class]__[Gene name]__[sequence ID] [Semicolon-delimited additional gene annotation]
>225__SulI_Sul__SulI__1616 no;no;SulI;Sul;U37105;4069-4908;840 ATGGTGACGGTGTTCGGCATTCTGAATCTCACCGAGGACTCCTTCTTCGA...
The sequence cluster IDs are assigned on the basis of sequence clustering (e.g., with CD-HIT-EST8) of the whole nucleotide database, and the “gene name” is actually the cluster name, which can be customised and does not necessarily correspond to any gene name in public databases. See an online tutorial for details of sequence clustering for database generation. The ARG-ANNOT databases (ARGannot.fasta — ARGannot_r3.fasta) released along with SRST2 were clustered under an 80% nucleotide identity7. Since distinct sequences may share the same gene name as a historical legacy, an index is appended to each sequence in the database. Accordingly, in the output of SRST2, [gene name]__[sequence ID] becomes [gene name]_[sequence ID], which is considered as an allele ID of the gene. In the example above, the additional gene annotation is composed of seven fields (cf., ARGannot_clustered80_r3.csv, the spreadsheet used by csv_to_gene_db.py to generate the database ARGannot_r3.fasta):
- Whether the cluster contains multiple genes (yes/no)
- Whether the gene is found in multiple clusters (yes/no)
- Class of antimicrobials (e.g., Sul)
- Gene name or product name (e.g., SulI)
- GenBank accession
- Coordinates in the GenBank record
- Sequence length (bp)
In total, ARG-ANNOT V4 and the SRST2-compatible ARGannot_r3.fasta consist of 2,038 and 1,856 reference ARG sequences, respectively.
The ResFinder database9, which is now constantly curated by Valeria Bortolaia from the Technical University of Denmark, only consists of alleles of acquired ARGs. By the time of writing this post, the latest database was released on 19 Jul 2019. Authors of the database offer a guidance for BLAST search against the database. The ResFinder website also offers a PointFinder database for chromosomal AMR-associated mutations.
Unlike the CARD or ARG-ANNOT databases, ResFinder stores reference nucleotide sequences that confer resistance to the same class of antimicrobials into a separate FASTA file (file extension: *.fsa). Every sequence header follows the format:
>[Allele name]_[Sequence index of the same allele name]_[GenBank accession]
For example, sequences of three alleles conferring colistin resistance can be retrieved from the file colistin.fsa:
>mcr-1.1_1_KP347127 ATGATGCAGCATACTTCTGTGTGGTACCGACGCTCGGTCAGT... >mcr-1.2_1_KX236309 ATGATGCTGCATACTTCTGTGTGGTACCGACGCTCGGTCAGT... >mcr-1.3_1_KU934208 ATGATGCAGCATACTTCTGTGTGGTACCGACGCTCGGTCAGT...
The database also provides phenotypes.txt to describe functions of the ARGs and antibiotic_classes.txt to list classes of antimicrobials that these ARGs confer resistance to.
Published in 2016, the MEGARes database aims to address artificial count inflation, uncertainty, and poor platform compatibility in high-throughput, count-based, and hierarchical statistical analysis of AMR at the population level10. In addition to manual curation of reference sequences of AMR determinants, MEGARes introduces a main innovation — the creation of a hierarchical and acyclic annotation structure, keeping a balance between nucleotide identity and functional grouping within gene clusters. Nonetheless, the database seems idle currently: it has not been updated since Dec 2016 (version 1.0.1).
Each version of the database is comprised of three files10:
- A FASTA file (*.fasta) of non-redundant nucleotide sequences of ARGs and AMR-associated mutations: the sequence database.
- A comma-delimited file (*.csv) of hierarchical sequence annotations: the annotation database.
- A tab-delimited file (*.tsv) mapping sequence headers in the FASTA file to those in external databases from which the reference sequences were drawn. This file is not essential for sequence searches against the MEGARes database or statistical analysis.
Annotations of AMR determinants consist of three levels: drug class, mechanism, and group. A few examples of annotations from megares_annotations_v1.01.csv are listed as follows:
header,class,mechanism,group Bla|OXA-223|JN248564|1-825|825|betalactams|Class_D_betalactamases|OXA,betalactams,Class D betalactamases,OXA 1172|AF317511.1|AF317511|betalactams|Class_B_betalactamases|VIM,betalactams,Class B betalactamases,VIM Flq|CP001918.1|gene3562|Fluoroquinolones|Fluoroquinolone-resistant_DNA_topoisomerases|GYRA|RequiresSNPConfirmation,Fluoroquinolones,Fluoroquinolone-resistant DNA topoisomerases,GYRA Flq|NC_012491.1.7720231|Fluoroquinolones|Fluoroquinolone-resistant_DNA_topoisomerases|PARC|RequiresSNPConfirmation,Fluoroquinolones,Fluoroquinolone-resistant DNA topoisomerases,PARC CARD|phgb|AJ012256|208-1069|ARO:3000939|TEM-73|betalactams|Class_A_betalactamases|TEM,betalactams,Class A betalactamases,TEM
Headers are keys linking the annotation database and sequence database. For example, sequences corresponding to annotations shown above can be retrieved from megares_database_v1.01.fasta:
>Bla|OXA-223|JN248564|1-825|825|betalactams|Class_D_betalactamases|OXA ATGAACATTAAAACACTCTTACTTATAACAAGCGCTATTTTT... >1172|AF317511.1|AF317511|betalactams|Class_B_betalactamases|VIM ATGTTAAAAGTTATTAGTAGTTTATTGGTCTACATGACCGCG... >Flq|CP001918.1|gene3562|Fluoroquinolones|Fluoroquinolone-resistant_DNA_topoisomerases|GYRA|RequiresSNPConfirmation ATGAGCGACCTTGCGAGAGAAATTACACCGGTTAACATCGAG... >Flq|NC_012491.1.7720231|Fluoroquinolones|Fluoroquinolone-resistant_DNA_topoisomerases|PARC|RequiresSNPConfirmation ATGCTGTCCAATCAAATTATTAATCAGAGCTTCGCGGAGATT... >CARD|phgb|AJ012256|208-1069|ARO:3000939|TEM-73|betalactams|Class_A_betalactamases|TEM ATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTT...
The sequences were drawn from CARD, ARG-ANNOT, ResFinder, NCBI beta-lactamase data, and the Lahey Clinic beta-lactamase archive [Note: in the article of MEGARes10, the authors wrote the latter two resources as “the National Center for Biotechnology Information (NCBI) Lahey Clinic beta-lactamase archive”, which is inaccurate according to its citations and the NCBI web page].
2. Assessment and curation of databases
Not surprisingly, comparisons between ARG databases are scarce in literature and often lack of depth due to the complexity and dynamics of the databases. Nonetheless, several bioinformatic tools have been developed to facilitate curation of the databases, which is by far a time-consuming process usually relying on expert decisions and hence subject to errors. Authors of these tools also proposed guidelines for database curation. For instance, ARGDIT (Antimicrobial Resistance Gene Data Integration Toolkit)11 and ARG-miner12 are two platforms recently published for inspecting, validating and curating ARG sequence data. On one hand, the article about ARGDIT offers a detailed procedure and criteria for database validation and curation11. On the other hand, interestingly, ARG-miner recruits a crowdsourcing approach to reduce the cost and time for sequence classification.
- Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., … McArthur, A. G. (2017). CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(D1), D566–D573. https://doi.org/10.1093/nar/gkw1004.
- McArthur, A. G., Waglechner, N., Nizam, F., Yan, A., Azad, M. A., Baylay, A. J., … Wright, G. D. (2013). The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy , 57(7), 3348–3357. https://doi.org/10.1128/AAC.00419-13.
- Day, M., Doumith, M., Jenkins, C., Dallman, T. J., Hopkins, K. L., Elson, R., … Woodford, N. (2016). Antimicrobial resistance in Shiga toxin-producing Escherichia coli serogroups O157 and O26 isolated from human cases of diarrhoeal disease in England, 2015. Journal of Antimicrobial Chemotherapy, 72(1), 145–152. https://doi.org/10.1093/jac/dkw371.
- Day, M. R., Doumith, M., Do Nascimento, V., Nair, S., Ashton, P. M., Jenkins, C., … Godbole, G. (2017). Comparison of phenotypic and WGS-derived antimicrobial resistance profiles of Salmonella enterica serovars Typhi and Paratyphi. Journal of Antimicrobial Chemotherapy, 73(2), 365–372. https://doi.org/10.1093/jac/dkx379.
- Ingle, D. J., Nair, S., Hartman, H., Ashton, P. M., Dyson, Z. A., Day, M., … Dallman, T. J. (2019). Informal genomic surveillance of regional distribution of Salmonella Typhi genotypes and antimicrobial resistance via returning travellers. PLOS Neglected Tropical Diseases, 13(9), e0007620. Retrieved from https://doi.org/10.1371/journal.pntd.0007620.
- Gupta, S. K., Padmanabhan, B. R., Diene, S. M., Lopez-Rojas, R., Kempf, M., Landraud, L., & Rolain, J. M. (2014). ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob Agents Chemother, 58(1), 212–220. https://doi.org/10.1128/aac.01310-13.
- Inouye, M., Dashnow, H., Raven, L.-A., Schultz, M., Pope, B., Tomita, T., … Holt, K. (2014). SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Medicine, 6(11), 90. Retrieved from http://genomemedicine.com/content/6/11/90.
- Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. BIOINFORMATICS, 22(13), 1658–1659. https://doi.org/10.1093/bioinformatics/btl158.
- Zankari, E., Hasman, H., Cosentino, S., Vestergaard, M., Rasmussen, S., Lund, O., … Larsen, M. V. (2012). Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother, 67(11), 2640–2644. https://doi.org/10.1093/jac/dks261.
- Lakin, S. M., Dean, C., Noyes, N. R., Dettenwanger, A., Ross, A. S., Doster, E., … Boucher, C. (2016). MEGARes: an antimicrobial resistance database for high throughput sequencing. Nucleic Acids Research, 45(D1), D574–D580. https://doi.org/10.1093/nar/gkw1009.
- Chiu, J. K. H., & Ong, R. T.-H. (2018). ARGDIT: a validation and integration toolkit for Antimicrobial Resistance Gene Databases. Bioinformatics, 35(14), 2466–2474. https://doi.org/10.1093/bioinformatics/bty987.
- Arango-Argoty, G. A., Guron, G. K. P., Garner, E., Riquelme, M. V, Heath, L. S., Pruden, A., … Zhang, L. (2019). ARG-miner: A web platform for crowdsourcing-based curation of antibiotic resistance genes. BioRxiv. https://doi.org/10.1101/274282