Software

This live post summarises my understandings of Panaroo’s methods and outputs in complementary to interpretations in the software’s official documentation and original paper. Please be cautious of my possible misunderstandings. Comments are welcomed.

This actively developing guidebook covers bioinformatics methods for processing sequencing data from MinION flow cells.

Despite the popularity of GFF3 format for genome annotations, to my knowledge there is no published tools for extracting DNA sequences of contigs from the GFF3 files and store them in a multi-FASTA file. EMBOSS seqret is only able to pull out the last contig from the GFF3 file, whereas other tools aim to extract the DNA sequence per feature. Therefore, I develop two Linux one-liners in this post for extract contig sequences from a GFF3 file and transfer them to a FASTA file.

SRST2¹, ARIBA², and KmerResistance (web service, code)³ are three widely used pieces of standalone software for read-based detection of target genes in bacterial genomes. Published in 2014, SRST2 is recognised as the pioneer amongst these three tools^{2, 3}. In this post, I compare methodologies underlying these tools in a concise manner to shed light on the selection of appropriate software for gene detection. Particularly, I herein presume that detecting antimicrobial resistance determinants is the only use case. Whenever unspecified, software versions referred to in this post are: SRST2 v0.2.0, ARIBA 2.14.4, and KmerResistance v2.2.

In addition to Conda, [environment modules](https://en.wikipedia.org/wiki/Environment_Modules_(software)) provide users with a convenient approach to switching software environments on Linux machines. This approach is widely used on computer clusters that offer computational services to a large number of users, and the environment modules are shared by authorised users. These modules, however, are not Linux kernel modules, which are automatically launched by the OS at start-up, and they should be manually loaded to the OS by users. I learnt how to use module commands for bioinformatic analysis when I was studying at the University of Melbourne. Loading a module essentially modifies your environmental variable `$PATH`. In this post, I set up a module manager for users of my Xubuntu system.

Linux is a popular family of operating systems (OS) used in bioinformatics. Amongst its numerous distributions, Xubuntu is a lightweight derivative of ubuntu Linux, and aims to run on a machine with low system requirements. As a Windows user, I often need to switch to a Linux environment for program development and test. To this end, VirtualBox offers an easy-to-use but low-in-resources alternative to a dedicated physical machine or disc space (dual OS). This post records my key steps for setting up Xubuntu in VirtualBox for basic bioinformatic work.

The Python script gbk2tbl.py in my GitHub repository BINF_toolkit is a popular tool for preparing input files of NCBI Sequin from GenBank files. Nonetheless, this script has only supported Python 2 since its first release in 2015, causing inconvenience to some users. Today, I got some time to make the script compatible to Python 3 with the tool 2to3 and some manual adjustments. The new script has been tested under Python 3.5.2 and pushed to my GitHub.

I finally got some time this morning to write a Python script gbk2tsv.py, which converts several GenBank files into tab-delimited feature tables (plain text files with an extension “.tsv”). It can be a useful tool when we need to summarise genome annotations or acquire nucleotide and protein sequences of certain genomic features. Although the Holt Lab, where I did my PhD, has an in-house script to do a similar job, it is inappropriate for me to use or share that intellectual property for projects outside of the Holt Lab without a specific permission. Therefore, I decided to create a script from scratch after a discussion on genome annotation with Hao Luo, a PhD student at the Chalmers University of Technology, Sweden, during a lunch break of the course MESB19.

Phylogenetic reconstruction is of crucial importance to elucidate bacterial population structure, epidemiology and evolutionary histories. By far phylogenetic networks and trees are the most common approaches used for studying the evolutionary history of a bacterial population. However, concepts and methodology underlying phylogenetic reconstruction can be challenging to beginners. As such, I share my notes on relevant literature in this post to address these obstacles. In particular, I compare different kinds of phylogenetic networks to show their pros and cons under various conditions.

SRST2 is a widely used tool screening Illumina reads of bacterial genomes for known genes (that is, targeted gene detection). Its capability includes MLST profiling and detection of known antimicrobial resistance genes (ARGs), virulence genes, plasmids, etc. I have been using SRST2 throughout my PhD project and coded my package GeneMates on the grounds of SRST2’s outputs. Here, I explain the output formats of SRST2 in order to help users to gain a better understanding of this versatile tool. Comments and corrections from readers are welcomed since this post is based on my own understandings and experience.

Software

Understanding Panaroo's outputs

Guidebook for processing Nanopore sequencing data

Linux one-liners converting GFF3 to FASTA files of contigs

Comparisons between SRST2, ARIBA, and KmerResistance

Installing environment modules on Xubuntu

Setting up Xubuntu in VirtualBox for bioinformatic work

Script gbk2tbl.py now supports Python 3

gbk2tsv.py: tabulating genomic features in GenBank files

To tree or not to tree: an introduction of phylogenetic networks

Understanding SRST2 outputs