Understanding Panaroo's outputs

Page content

This live post summarises my understandings of Panaroo’s methods and outputs in complementary to interpretations in the software’s official documentation and original paper. Please be cautious of my possible misunderstandings. Comments are welcomed.

Methods

How does Panaroo cluster protein-coding genes to create an initial graph?

Panaroo clusters protein sequences with cd-hit (not cd-hit-est) at a high threshold of amino acid identities (default: ≥ 0.98). Each gene refers to a protein-encoding locus in an input genome annotation (in the GFF3 format). This step generates clusters of orthologous genes (COGs) and is implemented in Panaroo’s main function.

# Cluster protein sequences using cdhit
cd_hit_out = args.output_dir + "combined_protein_cdhit_out.txt"
run_cdhit(input_file=args.output_dir + "combined_protein_CDS.fasta",
          output_file=cd_hit_out,
          id=args.id,
          s=args.len_dif_percent,
          quiet=(not args.verbose),
          n_cpu=args.n_cpu)  # Function run_cdhit is defined in script cdhit.py.

Parameter args.id as defined in function get_options corresponds to Panaroo’s option --threshold “sequence identity threshold (default=0.98)”.

Panaroo then identifies two kinds of gene clusters:

  • non-paralogous-genes-only
  • paralogues (marked by “1” in node attribute “paralog” in the final graph final_graph.gml)
    • Each genome appears in one paralogue cluster only once.
    • Paralogues share the same centroid ID in node attributes
    • Paralogues may have more than one centroid.

How are gene families collapsed?

To my understandings of the paper, a gene family refers to a cluster of orthologous gene clusters in the initial graph as determined based on a protein-family similarity threadshold (such as the conventional threshold of 70% by default). Panaroo uses both amino acid similarity and gene context to merge (“collapse”) nodes of initial gene clusters to produce the final pangenome graph. Specifically, if gene clusters sharing a common neighbourhood in the initial graph show amino acid identity of ≥ 70%, these clusters (nodes) are collapsed into a new node. The node collapsion is called by the following code block in function main:

# merge again in case refinding has resolved issues
if args.verbose:
    print("collapse gene families with refound genes...")
	G = collapse_families(G,
                      	  seqid_to_centroid=seqid_to_centroid,
                      	  outdir=temp_dir,
                          family_threshold=args.family_threshold,
                          correct_mistranslations=False,
                          length_outlier_support_proportion=args.
                          length_outlier_support_proportion,
                          n_cpu=args.n_cpu,
                          quiet=(not args.verbose),
                          distances_bwtn_centroids=distances_bwtn_centroids,
                          centroid_to_index=centroid_to_index)[0]

Function argument family_threshold takes the value of Panaroo paramter -f/--family_threshold “protein family sequence identity threshold (default=0.7)”.

Final pangenome graph

File: final_graph.gml. In this graph, each node represents a COG that may be collaposed from clusters of the same protein family, and each edge connects two COGs that are adjecent in at least one contig of a genome.

Node attributes

  • label: a unique numeric identifier of a gene cluster (node), which is assigned by Panaroo. The “shared name” and “shared interaction” attributes in Edge Table are built on node labels.
  • geneIDs: names of genes clustered in a COG and corresponding genome IDs (numeric) are listed in attribute genomeIDs in the same order with semicolon delimiters.
  • lengths: length of the coding sequence in each COG. I haven’t seen different lengths in the same cluster.
  • annotation: gene names from inputs.
  • seqIDs: the last ID in geneIDs and may not be the same as centroid.
  • longCentroidID: ID of the longer centroid sequence, and it may not be the last ID in centroid.
  • dna and protein: nucleotide and protein sequences of each centroid (not geneIDs) in the same order as centroid IDs, separated by semicolons.