Notes about ClonalFrameML

Here are my notes of the article about ClonalFrameML, a program that detects recombined regions in a multi-sequence alignment, infers phylogenetic relationships when correcting for recombination, reconstructs ancestral state, and imputes SNPs under a maximum-likelihood (ML) framework.

Reference: Didelot, X., & Wilson, D. J. (2015). ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Computational Biology, 11(2), e1004041.

  • Two sources of recombination related to the population under study

    • Internal (intra-population recombination): does not introduce new polymorphism, but results in homoplasy (which is not caused by de novo mutations but homologous recombination) and genetic incompatibility.
    • External (inter-population recombination): introduces new polymorphism, which is particularly prominent when genomes under study are all from a single lineage or even clone.
  • Quantities

    • Rate of point mutation: $\theta/2$ per site per coalescent unit of time $t_0$
    • Recombination rate: $R/2$ per coalescent unit of time $t_0$
    • Coalescent unit of time $t_0 = N_e * g$ (that is, effective population size times the duration of a generation)
    • Recombination-to-mutation rate: $R/\theta$
  • Assumptions

    • Length of recombination region follows an exponential distribution whose probability density function (PDF) is $\lambda e^{-\lambda x}$. Let the mean $\delta = 1 / \lambda$ and per site substitution probability $v$.
    • Constant parameters $R/\theta$, $\delta$, and $v$ of all branches.
  • Overall steps of the ClonalFrameML algorithm

    1. An initial ML tree taken as input.

    2. Ancestral sequence reconstruction for internal nodes and base-call imputation for input sequences. The next three steps can be skipped if the option -imputation_only is turned on.

    3. Estimating recombination and tree parameters using an ML approach

    4. Importation inference for each site using an ML approach

    5. Estimating uncertainty of parameter estimates using a parametric bootstrap method.