Microbial Systems Biology

This live post summarises my understandings of Panaroo’s methods and outputs in complementary to interpretations in the software’s official documentation and original paper. Please be cautious of my possible misunderstandings. Comments are welcomed.

This actively developing guidebook covers bioinformatics methods for processing sequencing data from MinION flow cells.

It was a disaster before the New Year that I got a positive result from my predeparture SARS-Cov-2 RT-PCR test at a commercial test centre (Site 1) and had to cancel my international flights for the reunion with my wife. I was shocked by the result as I had been self-isolating for more than 10 days before the test with limited outdoor activities (such as shopping for groceries) and I had no COVID-19 symptoms at all. After this moment of confusion and disappointment, I did a lateral-flow-device (LFD) antigen test at home and got a negative result. The second LFD test on the other day also returned a negative result. In the afternoon of the same day, I had my second PCR test at a different site (Site 2), where a professional swabbed my tonsil and nasal cavity so thoroughly that I even smelled a hint of blood. The result came quickly in the next morning and I immediately booked my third PCR test from another site (Site 3) for the same morning. All the three sites are accredited by the UK Health Security Agency (UK HSA). All my results were reported to the NHS for test and tracing.

Despite the popularity of GFF3 format for genome annotations, to my knowledge there is no published tools for extracting DNA sequences of contigs from the GFF3 files and store them in a multi-FASTA file. EMBOSS seqret is only able to pull out the last contig from the GFF3 file, whereas other tools aim to extract the DNA sequence per feature. Therefore, I develop two Linux one-liners in this post for extract contig sequences from a GFF3 file and transfer them to a FASTA file.

Here are my notes of the article about ClonalFrameML, a program that detects recombined regions in a multi-sequence alignment, infers phylogenetic relationships when correcting for recombination, reconstructs ancestral state, and imputes SNPs under a maximum-likelihood (ML) framework.

SRST2¹, ARIBA², and KmerResistance (web service, code)³ are three widely used pieces of standalone software for read-based detection of target genes in bacterial genomes. Published in 2014, SRST2 is recognised as the pioneer amongst these three tools^{2, 3}. In this post, I compare methodologies underlying these tools in a concise manner to shed light on the selection of appropriate software for gene detection. Particularly, I herein presume that detecting antimicrobial resistance determinants is the only use case. Whenever unspecified, software versions referred to in this post are: SRST2 v0.2.0, ARIBA 2.14.4, and KmerResistance v2.2.

In addition to Conda, [environment modules](https://en.wikipedia.org/wiki/Environment_Modules_(software)) provide users with a convenient approach to switching software environments on Linux machines. This approach is widely used on computer clusters that offer computational services to a large number of users, and the environment modules are shared by authorised users. These modules, however, are not Linux kernel modules, which are automatically launched by the OS at start-up, and they should be manually loaded to the OS by users. I learnt how to use module commands for bioinformatic analysis when I was studying at the University of Melbourne. Loading a module essentially modifies your environmental variable `$PATH`. In this post, I set up a module manager for users of my Xubuntu system.

Linux is a popular family of operating systems (OS) used in bioinformatics. Amongst its numerous distributions, Xubuntu is a lightweight derivative of ubuntu Linux, and aims to run on a machine with low system requirements. As a Windows user, I often need to switch to a Linux environment for program development and test. To this end, VirtualBox offers an easy-to-use but low-in-resources alternative to a dedicated physical machine or disc space (dual OS). This post records my key steps for setting up Xubuntu in VirtualBox for basic bioinformatic work.

The Python script gbk2tbl.py in my GitHub repository BINF_toolkit is a popular tool for preparing input files of NCBI Sequin from GenBank files. Nonetheless, this script has only supported Python 2 since its first release in 2015, causing inconvenience to some users. Today, I got some time to make the script compatible to Python 3 with the tool 2to3 and some manual adjustments. The new script has been tested under Python 3.5.2 and pushed to my GitHub.

Reliable and up-to-date databases play a pivotal role in reference-based detection of antimicrobial resistance genes (ARGs) in bacteria. Nonetheless, these databases differ in their content and quality, making it challenging to decide an appropriate reference database for a particular research project. In order to address this challenge, this post offers a review of several ARG databases that are publicly available and widely used in bacterial genomics. Particularly, I focus on databases that are still undergoing regular maintenance, and I do not discuss any program released with these databases for sequence search or statistical analysis.

Understanding Panaroo's outputs

Guidebook for processing Nanopore sequencing data

A generalised Bayesian model for the probability of getting a false-positive PCR result

Linux one-liners converting GFF3 to FASTA files of contigs

Notes about ClonalFrameML

Comparisons between SRST2, ARIBA, and KmerResistance

Installing environment modules on Xubuntu

Setting up Xubuntu in VirtualBox for bioinformatic work

Script gbk2tbl.py now supports Python 3

Popular reference databases of antimicrobial resistance genes