Understanding SRST2 outputs

SRST2 is a widely used tool screening Illumina reads of bacterial genomes for known genes (that is, targeted gene detection). Its capability includes MLST profiling and detection of known antimicrobial resistance genes (ARGs), virulence genes, plasmids, etc. I have been using SRST2 throughout my PhD project and coded my package GeneMates on the grounds of SRST2’s outputs. Here, I explain the output formats of SRST2 in order to help users to gain a better understanding of this versatile tool. Comments and corrections from readers are welcomed since this post is based on my own understandings and experience.

Supposing that we aim to screen paired-end Illumina reads of a collection of bacterial genomes for ARGs, we submit several SRST2 jobs through SLURM using the following commands:

# Gene detection
slurm_srst2.py --script srst2.py --output demo --input_pe reads/*_[1,2].fastq.gz --min_coverage 90 --max_divergence 10 --gene_max_mismatch 10 --walltime '0-1:0:0' --threads 4 --memory 8192 --rundir out --other_args "--gene_db data/ARGannot_r2.fasta --save_scores --report_all_consensus" > srst2.log
    
# Compile allele calls per sample
srst2.py --prev_output out/*_demo__genes__ARGannot_r2__results.txt --output demo

The second command line creates a compiled allele profile demo__compiledResults.txt.

1. Output files

This section explains output files of SRST2 when it is run for MLST and gene typing simultaneously.

1.1. MLST

  • [isolate name]_[output prefix]__mlst__[MLST database name]__results.txt: a single-line output on ST and MLST gene calls. An average read depth of all MLST gene calls is also included.
  • isolate name]_[output prefix]__[isolate name].[MLST database name].scores: consists of the score, average depth, coverage, number of mismatches, and other information of every allele of every MLST gene in the MLST database. Therefore, an alignment summary of an allele is included even if this allele is not an allele call for the isolate.
  • [allele name].[isolate name]_[output prefix]__[isolate name].[MLST database name].pileup: an alignment of reads mapping to the allele call of each MLST gene.
  • [isolate name]_[output prefix]__[isolate name].[MLST database name].pileup: a large file that consists of alignments of every allele of every MLST gene for the isolate, even if some alleles are not called during the pipeline.
  • [isolate name]_[output prefix]__[isolate name].[MLST database name].sorted.bam: a binary SAM file corresponding to [isolate name]_[output prefix]__[isolate name].[MLST database name].pileup.

1.2. Gene typing

Format of sequence headers in an SRST2-formatted reference database:

>[clusterID]__[gene]_[product/phenotype]__[allele]__[seqID];[other stuff]

Output files of gene typing are as follows:

  • [isolate name]_[output prefix]__genes__[gene database name]__results.txt: all allele calls of this isolate. Allele variants and ambiguous allele calls are denoted by an asterisk and a question mark respectively. It is in fact the second part of the compiled file [isolate name]_[output prefix]__compiledResults.txt.
  • [isolate name]_[output prefix]__fullgenes__[gene database name]__esults.txt: details of every allele call of this isolate, including allele-specific read depths and divergences from reference sequences.
  • [isolate name]_[output prefix]__[isolate name].[gene database name].scores: the same type of files as the [isolate name]_[output prefix]__[isolate name].[gene database name].scores. The information of read depth of every allele call is the same as those in last file ([isolate name]_[output prefix]__fullgenes__[gene database name]__results.txt). In addition, it contains information of other alleles of the same gene.
  • [cluster ID]__[gene name]__[allele name].[isolate name]_[output prefix]__[isolate name].[gene database name].pileup: an alignment of reads to an allele call, from which a consensus sequence is extracted.
  • [isolate name]_[output prefix]__[isolate name].[gene database name].pileup: the same kind of file as the [isolate name]_[output prefix]__[isolate name].[MLST database name].pileup. Alignments against all reference sequences although most of them are not allele calls.
  • [isolate name]_[output prefix]__[isolate name].[gene database name].sorted.bam: the binary SAM file corresponding to the file [isolate name]_[output prefix]__[isolate name].[gene database name].pileup.

1.3. Compiled results

  • [isolate name]_[output prefix]__compiledResults.txt: the content of [isolate name]_[output prefix]__mlst__[MLST database name]__results.txt plus allele calls of resistance genes in genes__*__results.txt.
  • [isolate name]_[output prefix].all_consensus_alleles.fasta: consensus sequences reconstructed from pileup files for all allele calls, including alleles of MLST genes. Suspicious allele calls (”?") are also included.
  • [isolate name]_[output prefix].new_consensus_alleles.fasta: this file only includes consensus sequences of reliable allele variants ("*") compared with the *.all_consensus_alleles.fasta. Sequences in this file are the same as those in *.all_consensus_alleles.fasta. Suspicious allele calls are not included in this file.

Using PAMmaker, we can compile score files for allele calls:

# Compile score files
PAMmaker/reliability_assessment/collate_topAllele_scores.py --allele_calls out/*_demo__genes__ARGannot_r2__results.txt --allele_scores out/*_Kp__*.ARGannot_r2.scores --prefix mergedScores

This command creates a compiled score file mergedScores__gene.scores for gene typing or mergedScores__mlst.scores for MLST. It is much easier to find out summary statistics of allele calls in this file.

1.4. Pileup files in SRST2’s outputs

SRST2 produces a pileup file for read alignments against each reference sequence and calculates summary statistics (such as mean read depth, coverage, nucleotide identity) based on this file. These statistics are used for allele calling. Here, I explain the pileup file and calculations of these statistics.

There are two kinds of files in pileup format generated by SAMtools mpileup in SRST2’s pipeline. The first kind is allelic pileup files, which describe alignments per sample per allele if any reads are mapped to the reference, while the second kind is sample pileup files, which compile all allelic alignments into a single file for each sample.

These kinds of pileup files differ in formats of filenames. For example, an allelic pileup file of isolate 1 is named as 66__FloR_Phe__FloR__1212.isolate1_demo__isolate1.ARGannot.r2.pileup, whereas the sample pileup file is named as isolate1_S1E10__isolate1.ARGannot.r2.pileup. These files are read and summarised by SRST2 to generate *.score files and *__fullgenes__*__results.txt files.

2. Summary statistics and comparisons used for determining allele calls

SRST2 evaluates coverage, nucleotide divergence, the number of mismatches and read depths per alignment against a reference sequence to make an allele call. Nonetheless, there are subtle differences in comparisons of these summary statistics to user-specified thresholds. Since in our demonstration, we set “–min_coverage 90 –max_divergence 10 –gene_max_mismatch 10” for SRST2, I am going to explain how SRST2 makes comparisons using these thresholds on the basis of code in srst2.py.

2.1. Coverage

The coverage of reads to a reference sequence is the percentage of the reference allele length that remains after removing truncated bases and deletions. Evidently, insertions does not affect the coverage.

coverage_allele[allele] = 100 * (allele_size - total_missing_bases - del_poscount)/float(allele_size) # includes in-read deletions

Under the gene-typing mode, each allele call must show a coverage > minimum coverage. In our demonstration, it is coverage > 90%. Note that the inequality expression is not “≥”. The coverage is not assessed under the MLST mode.

def score_alleles(...)
    if (run_type == "mlst") or (coverage_allele[allele] > args.min_coverage)
        ...

2.2. Nucleotide divergence

In order to calculate the nucleotide divergence for the alignment to each reference sequence, the raw number of mismatches counts substitutions and deletions but not insertions. However, SRST2 only report the number of substituted bases as the number of mismatches in the score file, while the number of deleted bases (not truncated) is accounted for in the number of indels.

mismatch_allele[allele] = total_mismatch - del_poscount  # SNPs only
indel_allele[allele] = del_poscount + ins_poscount  # insertions or deletions

Since the divergence is only printed to full-gene reports but not to score files that are often used in analysis, we can recover the divergence from a score file using the formula “Mismatches / (Size - Truncated_bases) * 100”. Although it is counterintuitive, the exclusion of deletions from the calculation of divergence does make sense because as we have explained about read depths, SRST2 only aims to evaluate how well the presence of a target in a sample is supported by reads. If a query sequence does have deletions, then the divergence of the remaining part of the reference in this sample to the complete reference sequence should only consider SNPs as deleted bases are not present in the query. Therefore, when we think about SRST2’s performance and its results, we should take a target-oriented perspective, namely, how likely a reference sequence is actually present in an isolate, rather than a query-oriented perspective.

Under the gene-typing mode, each allele call must show a divergence ≤ maximum divergence. In our demonstration, it is divergence ≤ 10%, or equivalently, identity ≥ 90%. The divergence is not assessed under the MLST mode either.

def map_fileSet_to_db(...)
    elif run_type == "genes" and len(allele_scores) > 0:
        for gene in allele_scores:
            if divergence*100 <= float(args.max_divergence):
                column_header = cluster_symbols[cluster_id]
                results[sample_name][column_header] = allele_name
                if diffs != "":
                    results[sample_name][column_header] += "*"
                if depth_problem != "":
                    results[sample_name][column_header] += "?"
                if column_header not in gene_list:
                    gene_list.append(column_header)

Note that the divergence is not considered for making an allele call in MLST. The upper bound for the divergence (by default, it is 10%) between a consensus sequence and the most closely-related reference allele sequence does not apply to MLST alleles. Particularly, it only applies to the genotyping process. As a result, an MLST allele can be reported even if its divergence is greater than 10%, as long as its coverage remains above 90% (the default threshold). The cause lies at the stage of writing alleles into output files in function map_fileSet_to_db of srst2.py:

def map_fileSet_to_db(...)
    # Report MLST results to __mlst__ file
    if run_type == "mlst" and len(allele_scores) > 0:
        ...
    elif run_type == "genes" and len(allele_scores) > 0:
        ...
        # store for gene result table only if divergence passes minimum threshold:
        if divergence*100 <= float(args.max_divergence):
            ...

Here, the divergence is assessed only for gene typing. By contrast, the coverage is assessed for both MLST and genotyping because it is implemented in another function read_pileup_data which is called by both kinds of run types, before writing alleles into the outputs. Accordingly, we must be particularly careful when we see an allele variant of an MLST gene. We should not take the allele ID directly from an allele variant without considering its divergence to its reference.

2.3. Number of mismatches to a reference allele

SRST2 determines whether a mismatch (namely, either a single-nucleotide substitution, also known as a SNP, or a deletion) is present at the current position in the reference sequence. Specifically, for this position, SRST2 does not directly count the number of reads having mismatches (“A, G, C, T” or N) from the column of alignment status, instead, it subtracts the number of reads having matches from the per-base read depth. Obviously, the number of reads having mismatches at each position is the total number of substitutions and deletions. Insertions are not counted because they do not contribute to the read depth at any positions.

# In a loop of the following code through all positions
num_mismatch = nuc_depth - num_match  # the number of reads showing mismatches at the current each base
	
# Then assess if we have sufficient evidence for the presence of substitutions or indels at this position.
if num_mismatch > num_match:
    total_mismatch += 1 # record this position as a mismatch (could be a SNP or deletion)

SRST2 requires the number of mismatches ≤ a maximum number to make an allele call under both the MLST mode and the gene-typing mode. In our demonstration, every allele call requires no more than 10 mismatches.

def modify_bowtie_sam(...)
    ...
    with open(raw_bowtie_sam) as sam, open(raw_bowtie_sam + '.mod', 'w') as sam_mod:
        for line in sam:
            ...
            if m != None:
                num_mismatch = m.group(1)
                if int(num_mismatch) <= int(max_mismatch):
                    sam_mod.write('\t'.join([fields[0], str(flag)] + fields[2:]))

2.4. Number of insertions

SRST2 determines whether an insertion is present at the current position. First, it counts the number of reads having insertions at this position. Particularly, the size of each insertion is not taken into account. For instance, in a the pileup file, the alignment for allele SHV-16 is:

164__SHV-OKP-LEN_Bla__SHV-16__1292	272	T	17	,.,+1c,+1c.+1C,+1c.+1C.+1C.+1C.+1C.+1C.+1C.+1C,+1c,+1c,+1c.+1C

In srst2.py, the code counting insertions is as follows:

if aligned_bases[i] == "+":
    i += int(aligned_bases[i+1]) + 2 # skip to next read's result
    ins_readcount += 1
    continue

For each position in the reference, an insertion, whose first base is located at the next position, is considered as confident if there are more than half of aligned reads displaying insertions at the current position:

if ins_readcount > nuc_depth / 2:  # paired-end sequencing?
    ins_poscount += 1  # an allele-level variable: the count of positions where an insertion is present

It is important to emphasise that headers of insertions (that is, strings such as “+1c”) are not counted into the read depth at each position.

2.5. Number of truncations

SRST2 determines whether the current base is not covered by any reads. In other words, this base in the sample is truncated from the reference. According to srst2.py, the number of truncated bases, also reported as the number of (single-base) holes, is the number of reference bases that are NOT covered by any reads (from the beginning to the end of every read) at all. For these bases, their read depth is zero or get skipped in the pileup file. In fact, they must belong to deletions wider than read lengths so that no reads can span it across. On the other hand, the number of deletions (which will be talked about later) is the number of bases that are covered by reads showing consensus gaps at these positions. The read depth is positive for these bases in the reference.

To conclude, a truncation refers to a base that is not covered by any reads at all (namely, from a large regional deletion), whereas a deletion refers to a base that is deleted from the reference but is still covered by reads (that is, from a deleted region that is smaller than the read length). Obviously, there is no intersections between truncated bases (who have no coverage at all) and deletions (whose coverage > 0). Particularly, a truncation may be comprised of either a single base or a fragment, at any places in a reference sequence.

for fields in lines:  # go through every base of the reference allele (one line per base in the pileup file)
    exp_nuc_num += 1
    nuc_num = int(fields[1])  # Actual position in ref allele, read from the pileup file
    nuc_depth = int(fields[3])  # read depth at the current position
    
    # The pileup file skips reference bases that are not covered by any reads. Particularly, no reads span across the truncated region (otherwise, it becomes a multi-base deletion).
    # A truncation at the 5' end of a reference falls into the situation where "nuc_num > exp_nuc_num". The size of a 3' truncation is not calculated here, instead, it is calculated at the allele level (see the next section).

    if nuc_num > exp_nuc_num:  # if this is a large deletion, whose base indices are skipped in the pileup file
        total_missing_bases += abs(exp_nuc_num - nuc_num)  # regional deletion
    exp_nuc_num = nuc_num  # synchronise the expected coordinate
    if nuc_depth == 0:  # if the current base not covered by any reads. This situation does not overlap with the first one.
        total_missing_bases += 1

For example, the resistance gene blaOqxB-49 is truncated by 33 bases at its 3’ end in a genome. The pileup of this gene ends at the 3120th base, whereas the complete reference is 3153-base long.

0__OqxB_Flq__OqxB__49        3119        G        28        ,.,.,,,,,,,,,,,,,,,,,,,,,,,,
0__OqxB_Flq__OqxB__49        3120        C        28        ,$.$,$.$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$,$

In the same genome, this gene harbours another 7-base truncation as well, which is indicated by a leap of coordinates from 2522 to 2530 in the pileup file.

0__OqxB_Flq__OqxB__49        2521        T        7        .$....,,        EHF@BGH
0__OqxB_Flq__OqxB__49        2522        C        6        .$.$.$.$,$,$        EHGKFG  # six reads terminate here
0__OqxB_Flq__OqxB__49        2530        C        10        ^~.^~.^~.^~,^~,^~,^~.^~.^~.^~.        DEDGHHDDFB  # 10 reads start here
0__OqxB_Flq__OqxB__49        2531        A        10        ...,,,....        CGDHFCCDH@

Accordingly, the code total_missing_bases += abs(exp_nuc_num - nuc_num) returns the size of intra-sequence truncation correctly.

2.6. Per-base read depth

For each position in the reference allele sequence, SRST2 directly reads the read depth from the fourth column of the current pileup file. This process tells me the way to get a vector of per-base read depths from a pileup file for any reference sequence. Actually, I can draw a bar plot of these read depths to inspect their distribution.

nuc_depth = int(fields[3])

This per-base read depth is the number of reads having matches, substitutions (SNPs) or deletions at this position (see the example below). The number of reads having insertions following the current position does not contribute to the read depth because insertions are not present in the reference sequence (The term “read depth” is a concept related to a reference sequence). In summary, per-base read depth = matches + substitutions + deletions.

2.7. Average read depth across each alignment and edge depths

According to srst2.py, the average depth does not take insertions into account. It is the mean of per-base read depths on the length of aligned reference sequence. Hence, along with edge depths, the average read depth evaluates how much evidence we gain from reads for aligned bases (including deletions) in the reference sequence. It does not reveal how deep the query sequence is covered by reads.

# Loop through every allele
allele_line = 1 # Keep track of line for this allele
total_depth = 0  # for each allele
...
    # Loop through every base of the current allele
    nuc_depth = int(fields[3])  # read the depth column in the pileup file
    total_depth = total_depth + nuc_depth  # cumulate the read depth of every aligned base
    allele_line += 1  # It becomes the number of aligned bases in the reference, including deletions but excluding insertions.
avg_depth = round(total_depth / float(allele_line),3)

SRST2 requires a read depth > a minimum depth in order to make an allele call for both MLST calling and gene typing. The same inequality expression applies to edge depths.

def parse_scores(...)
    ...
    if hash_edge_depth[top_allele][0] > args.min_edge_depth and hash_edge_depth[top_allele][1] > args.min_edge_depth:
        if next_to_del_depth_allele[top_allele] != "NA":
            if float(next_to_del_depth_allele[top_allele]) > args.min_edge_depth:
                if avg_depth_allele[top_allele] > args.min_depth:
                    adequate_depth = True
                else:
                    depth_problem="depth"+str(avg_depth_allele[top_allele])
            else:
                depth_problem = "del"+str(next_to_del_depth_allele[top_allele])
        elif avg_depth_allele[top_allele] > args.min_depth:
            adequate_depth = True
        else:
            depth_problem="depth"+str(avg_depth_allele[top_allele])
    else:
        depth_problem = "edge"+str(min(hash_edge_depth[top_allele][0],hash_edge_depth[top_allele][1]))

2.8. Cumulative and average edge depths

By default, the calculation of cumulative edge depths particularly applies to the first and last two bases of every reference sequence. Specifically, there are two kinds of edge depths — the edge 1 depth (5’ edge) and the edge 2 depth (3’ edge), which are computed through dividing a corresponding cumulative edge depth by the size of the edge region (2 bp by default). Obviously, an edge depth of zero does not necessarily mean there is a large (that is, > 2 bp) truncation at an end of the reference. It is worthy pointing out here that when a truncation happens at an end of the reference sequence, the corresponding edge depth becomes zero, which is the default value of variables depth_a and depth_z.

edge_a = edge_z = 2  # define the first and last 2 bp in a reference for calculating edge depths
...
depth_a = depth_z = 0  # cumulative edge depth = 0 if either end of the reference failed mapping
nuc_num = int(fields[1]) # actual position in the reference allele
...

# Calculate cumulative depths for the edge region
if nuc_num <= edge_a:  # if the current position is inside of the first 2 bp, then calcualte the edge 1 depth
    depth_a += nuc_depth  # add the read depth of this position to the total (within 2-bp range)
    ...
if abs(nuc_num - allele_size) < edge_z:  # calculate the other depth if the current base is within the last 2 bp
    depth_z += nuc_depth
    ...
...

# Calculate average edge depths
avg_a = depth_a / float(edge_a)  # average coverage depth at the 5' end (2-bp region)—edge depth 1, or start_depth in function score_alleles(..)
avg_z = depth_z / float(edge_z)  # average coverage depth at the 3' end (2-bp region)—edge depth 2, or end_depth in function score_alleles(..)
...

def read_pileup_data(...):
    hash_edge_depth[allele] = (avg_a, avg_z)  # {"allele name" : (avg_a, avg_z)}
    
def score_alleles(..., hash_edge_depth, ...):
    # This function writes *.scores files which contain columns Edge1_depth and Edge2_depth.
    ...

3. Conditions for making a variant and/or dubious allele call?

There are four kinds of allele calls (i.e. the “top allele” identified in srst2.py) in an SRST2 report: precise allele calls (100% nucleotide identity across all of the reference sequence), allele variants (flagged by a “*"), dubious but precise alleles calls (”?"), and dubious allele variants ("*?"). Two independent dimensions of criteria are used for designating a symbol to an allele call: the similarity and alignment coverage/depth (table 1) of an alignment. To be more specific, among all allele calls reported, a sequence is flagged as an allele variant (denoted by an asterisk) if there is any SNP or indel found between the consensus sample sequence and the reference; a dubious allele (denoted by a question mark) is an allele call whose coverage depths fall below the user-set thresholds (average depth >5 and the minimum depth of edge one and two >2 by default, i.e. –min_depth 5 and –min_edge_depth 2). Note that the designation of symbols “*” and “?” are mutually independent, and SRST2 identifies top alleles before assigning them any symbol based on the identity and depth of alignments.

Table 1. symbols used for flagging allele calls based on two criteria: alignment similarity and depth.

Similarity/coverage Confident Dubious
precise ?
variant * *?

Note that the term “dubious” indicates that the evidence for this allele call is insufficient in terms of the alignment coverage.