Notes of an online metagenomics course

Page content

In this post, I compile my notes of the course Metagenomics applied to surveillance of pathogens and antimicrobial resistance. This three-week course is offered by the Technical University of Denmark and is freely accessible at Coursera. As a graduate researcher working on antimicrobial resistance (AMR) in bacterial populations, I have read countless pieces of literature about bacterial population genomics, surveillance and metagenomics in the most recent four years, and I am supposed to be familiar with the content of this course. Nevertheless, the course remains quite helpful to me since it leads me to build a comprehensive knowledge framework of metagenomics from individual concepts. Here, I focus on knowledge that was once unfamiliar or ambiguous to me, and it may be new to some readers as well. More information can be found in course materials on Coursera.

1. Study design and sampling

1.1. Surveillance

  • Bacterial population genomics and surveillance provides us with timely information for disease control, policy making, etc.
    • Continuous assessment of distributions and their changes.
    • Plan, implement, and continuously evaluate control measures.
  • Three approaches to microbial genomic information
    • 16S rRNA sequencing.
    • Whole-genome sequencing of a single bacterial colony (culturable) or a metagenomic sample (includes unculturable bacteria).
    • Single-cell sequencing, useful when the bacterium of interest is unculturable. Flow cytometry and laser tweezers can be used for isolating a single cell from a complex sample.
  • For evaluating the effectiveness of a surveillance method
    • Sensitivity: conditional detection rate given a case group.
    • Specificity: conditional true-negative rate given a control group.
  • Dimensions for the reliability of a surveillance system or measurements
    • Accuracy: reflects the deviation (error) of an individual measurement to the true value. Accuracy relates to the method of measuring an attribute.
    • Precision: inversely related to the variation (dispersion) in repeated measurements on the same sample. Sample sizes control the precision.
    • Four combinations of (high/low) accuracy and (high/low) precision.
      • Desirable: high accuracy (i.e., the range of measurements includes the true value) + high precision (small dispersion of of measurement).
      • Acceptable: high accuracy + low precision
      • Misleading: low accuracy + high precision.
      • Useless: low accuracy + low precision.

1.2. Sampling for surveillance

  • Definition: systematic and continuous collection of health-related data, subsequent data analysis, and result interpretation.
  • Probability sampling: simple random sampling, stratified random sampling, and cluster random sampling.
  • Non-probability sampling: convenience sampling, purposive sampling (useful for studying rare subjects), etc.
  • Acquisition of approvals from certain authorities usually takes much time.
  • To avoid social desirability bias in the responses of interviewees, interviewers should independently score biosecurity and animal welfare.
  • In metagenomics, samples are often cross-sectional due to costs, time, and other constraints.
  • Controls for each stage of a metagenomic study, including sample collection, DNA/RNA extraction, library preparation, etc.
    • Handle and process all samples in the same way throughout the same study.
    • Control samples (e.g., a blank control for DNA extraction, a mock mixture of different bacteria).
  • Factors to be taken into account when defining a sampling plan for surveillance:
    • Purpose of the study.
    • Population size (representativeness).
    • Expected value of the variable to be estimated.
    • Sample variation of the variable of interest (may be seasonal).
    • Sensitivity and specificity of methods used in the study (e.g., sample pooling, DNA extraction, sequencing technique, bioinformatic software).
  • Selection criteria for sampling cover sampling units, sample size, sampling frequency, batches of samples, etc.
  • A pilot study may be necessary to determine the sampling plan and selection criteria.
  • Validity of conclusions drawn in a study
    • Internal validity: robustness of conclusions within the context of a particular study. Randomisation and standardisation of sampling procedure define the internal validity of a study.
    • External validity: the extent to which conclusions of a study can be generalised to and across other contexts or situations. The study domain is the basis for your study’s external validity.
  • Sample pooling
    • Pros: increase in time and cost efficiency.
    • Cons: reduction in methodological sensitivity.
    • Uncertainty of estimates increases when the pool size grows.
  • Sample storage
    • Short term: mimic the original environment of samples (e.g., aerobic or anaerobic)
    • Prolonged storage: instant freezing. We should make aliquots of a sample for different examinations to avoid freeze-thaw cycles.

1.3. Sequencing

  • To avoid contamination: physically separate areas in the lab for DNA/RNA extraction, library preparation, and sequencing. Ideally, separate these tasks in different rooms.
  • Reverse transcription of RNA and the production of cDNA allow detection of RNA viruses.
  • Signal saturation in 454 and Ion Torrent systems due to homopolymers can cause over- or underestimation of nucleotides.
  • Sequence case and control samples in the same experiment runs.
  • Always check DNA contamination from hosts, reference genomes and environments.
  • Instruments used for DNA/RNA extraction, library preparation, and sequencing can come from different companies. For instance, we may use QIAamp Fast DNA Stool Mini Kit (Qiagen) for DNA extraction, Bioo Scientific NEXTflex PCR-free DNA prep kit for amplification-free library preparation, and Illumina HiSeq4000 for sequencing.

2. Antimicrobial resistance

  • Antibiotics are chemical compounds produced by microorganisms, whereas antimicrobials can be synthetic or semi-synthetic. Therefore, antibiotics belong to antimicrobials.
  • Sources of acquired AMR: mutations in conserved genes and acquisition of exogenous DNA.
  • Gram-negative bacteria are intrinsically resistant to vancomycin because their outer membrane is impermeable to this antibiotic.
  • Enterococcus spp. are intrinsically resistant to cephalosporin due to a lack of penicillin-binding proteins.
  • A resistome refers to the collection of all AMR genes in a metagenomic sample.
  • The ResFinder database
    • Only includes de facto AMR genes (compared to the point mutation database).
    • Includes all alleles.

3. Bioinformatic analysis of metagenomic data

  • Analyses after quality assessment of reads
    • Approaches
      • Assembly based: database independent, limited ability in dealing with complex samples, slow and resource-hungry.
        • Absence of assemblies of genomes that are present causes false negative signals.
        • Compared to mapping-based methods, we can identify novel taxa using assemblies.
        • Assemblers: IDBA-UD, MetaRay, MEGAHIT, and MetaVelvet.
        • Characteristics of contigs and scaffolds: read coverage, k-mer frequency, GC content, codon usage, and taxonomic assignment of contigs (binning).
        • Binning of contigs and scaffolds
          • Principle: contigs and scaffolds from the same (resolvable) OTU tend to have similar (namely, strong correlation) read depths, codon usage, and GC contents.
          • Methods: differential-abundance (DA) binning and nucleotide-composition (NC) binning.
          • Automated binners with an unsupervised algorithm: MetaBAT, Canopy, CONCONCT, MaxBin, GroopM — scalable and reproducible.
          • Manual curation of bins can improve the result.
          • Known single-copy essential genes of bacteria can be used as positive controls to assess the binning process, in terms of completeness and contamination.
        • The lecture does not mention homologous regions that cannot be resolved via either assembly or mapping.
      • Read based: database-dependent (some OTUs are unknown in any databases), sensitive, limited resolution, quick and lightweight, trade-off between sensitivity and accuracy.
        • Similarity approach: ambiguity in read alignments.
        • Composition approach
    • Purposes
      • Taxonomical identification: taxonomic profile of each sample
        • Binning: assigning reads (mapping-based approach), or contigs and scaffolds (assembly-based approach) to certain OTUs.
      • Quantitative analysis: relative or absolute abundances of targets
      • Functional analysis: a functional profile per sample
        • Mapping reads against a database of marker genes encoding AMR, virulence, genetic transposition, enzymes, etc.
        • Genome annotation pipeline: automatic annotation + manual annotation (curation with human expertise and potential experimental verification).
        • Databases: KEGG, SEED, NOG, COG, GO, Pfam, TIGERFAM.
  • Common sequence search algorithms
    • Suitable for dealing with long reference sequences
      • Variations of BLAST (semi-local alignment): MEGAN4, MG-RAST
      • BWA (Burrows-Wheeler Aligner): MGmapper
      • Bowtie/Bowtie 2
    • Hidden Markov Models: HMMER3.
    • K-mer based methods: Kraken.

3.1. MGmapper

  • High sensitivity and specificity at the species level and above, but the sensitivity decreases at the strain level.
  • Uses BWA MEM for read alignment against reference genomes.
  • Pipeline: QC (cutadapt and removal of PhiX DNA from read sets).
  • Postprocessing
    • Size normalised abundance: no. of reads mapped / size of the reference genome.
    • Number or proportion of unique reads mapped to only one reference.
  • Usage
    • Command-line or web based.
    • Suitable for detection of known species in a metagenomic sample.
    • Not applicable to provide an overview of microbes in an exotic environment that contains a lot of unknown species.
    • Users may want to reduce the tolerance to nucleotide mismatches when the query sequence and the reference sequence are highly similar.

3.2. Kraken

  • A K-mer based method
    • Generation of k-mers from reads and reference genomes, and search query k-mers in reference k-mers.
    • Assignment of each k-mer to a taxonomic unit or the most recent common ancestor of a group of potential OTUs.
    • For each read, prune the tree of taxonomic IDs that are linked to k-mers of the read, result in the highest-weighted root-to-leaf path for classification, and report the taxonomic ID of the lowest node in the tree.
    • Can resolve up to the species and even strain level.
  • Standard and custom databases. The database size affects sensitivity.
  • Work flow: kraken (terminal) > kraken-report (terminal) > pavian::runApp() (visualisation of results in R).
  • Users can then use Bowtie or BWA to align reads to reference genomes of identified OTUs, and produce BAM files, in order to inspect the distribution of genomic locations where the reads are mapped.

3.3. Case study: the EFFORT project

  • A tremendous amount of work went into sample collection and library preparation.
  • Bioinformatic pipeline: a custom-made and streamlined version of MGmapper, and a result database.
  • Lessons from this project
    • Since continuous updates of the working reference database affect results, we need to freeze the database at some time for analysis. Nonetheless, this should be done as late as possible.
    • Assessing the performance (particularly, in terms of accuracy) of software when it is used for purposes other than it was designed for. Some modifications to the code may be necessary.
    • Providing an appropriate amount of alternative results to reviewers when there is not a single right parameter value or way to interpret data. Investigators may want to make a recommendation based on their expertise in using parameters.

4. Quantitative analysis

4.1. Metrics derived from counts of mapped reads

  • Count matrix for OTUs: samples by OTUs
    • A common way for visualisation: heat map
    • Normalisation: may use a combination of the following methods.
      • By the relative read count: divide the raw count per OTU by the total read count per sample.
      • By the reference genome size.
    • Wisconsin double standardisation
  • Richness: number of OTUs (e.g., species) identified in a metagenomic sample.
    • Absolute abundance of OTUs usually cannot be compared between samples owing to variation in read depths.
  • Evenness: relative abundance of OTUs in a sample.
  • Diversity: a measure of richness and evenness
    • Alpha diversity indices: Shannon-Weiner, Simpson, Fisher’s alpha, etc.
    • Beta diversity: Bray Curtis coefficient, resulting in a dissimilarity matrix.
    • Comparable between samples.

4.2. Differential abundance analysis

  • R libraries: DESeq2, MetagenomeSeq, EdgeR, and BaySeq.
  • Confounders to be accounted for.
  • Presence-absence status of genes or genomes is hard to determine in metagenomics.
    • Use spike-in experiments to determine the detection limitation.

4.3. Ordination analysis

  • Summarise multivariate data on various dimensions, and explore patterns — a question driven process.
  • Principal coordination/component analysis (PCoA/PCA).
    • Centroids of clusters can facilitate result explanation.
  • Analysis based on the dissimilarity matrix.
    • R package vegan.

5. Application of metagenomics in surveillance

  • Epidemiological analysis and reports
  • Potential explanatory variables making up metadata
    • About the sampling and analysis processes.
    • Sample IDs code samples in a meaningful way.
    • Steps of quality control and bioinformatic analysis.
  • Potential explanatory variables making up epidemiological data (epidata)
    • Epidemiological information.
  • Statistical analysis
    • Community composition (distribution): cf. Sections 4.1 and 4.3, correlation matrix, and investigators can use network analysis.
    • Epidemiological analysis (determinants): differential abundance analysis, multivariate regression, meta-analysis (spatiotemporal analysis) , sophisticated machine learning methods (e.g., classification models), etc.
    • Visualisation: stack bar plot, forest plot, etc.
  • Data interpretation, from presence to functions: meta-transcriptomics, meta-proteomics, and metabolomics.
  • Integrated and global surveillance
    • Metagenomics can be the next frontier of surveillance.
    • The One Health approach.
    • Collaborating international surveillance community.
    • Global collection of relevant explanatory data.
      • Considerations for determining explanatory variables?
    • Standard protocols for assuring both comparability and reproducibility of results, which in turn affect the interpretation of results.
    • Use global training and ring trials to fill knowledge gaps.