Linux one-liners converting GFF3 to FASTA files of contigs

Despite the popularity of GFF3 format for genome annotations, to my knowledge there is no published tools for extracting DNA sequences of contigs from the GFF3 files and store them in a multi-FASTA file. EMBOSS seqret is only able to pull out the last contig from the GFF3 file, whereas other tools aim to extract the DNA sequence per feature. Therefore, I develop two Linux one-liners in this post for extract contig sequences from a GFF3 file and transfer them to a FASTA file.

Notes about ClonalFrameML

Here are my notes of the article about ClonalFrameML, a program that detects recombined regions in a multi-sequence alignment, infers phylogenetic relationships when correcting for recombination, reconstructs ancestral state, and imputes SNPs under a maximum-likelihood (ML) framework.

Comparisons between SRST2, ARIBA, and KmerResistance

SRST21, ARIBA2, and KmerResistance (web service, code)3 are three widely used pieces of standalone software for read-based detection of target genes in bacterial genomes. Published in 2014, SRST2 is recognised as the pioneer amongst these three tools2, 3. In this post, I compare methodologies underlying these tools in a concise manner to shed light on the selection of appropriate software for gene detection. Particularly, I herein presume that detecting antimicrobial resistance determinants is the only use case. Whenever unspecified, software versions referred to in this post are: SRST2 v0.2.0, ARIBA 2.14.4, and KmerResistance v2.2.

Installing environment modules on Xubuntu

In addition to Conda, [environment modules]( provide users with a convenient approach to switching software environments on Linux machines. This approach is widely used on computer clusters that offer computational services to a large number of users, and the environment modules are shared by authorised users. These modules, however, are not Linux kernel modules, which are automatically launched by the OS at start-up, and they should be manually loaded to the OS by users. I learnt how to use module commands for bioinformatic analysis when I was studying at the University of Melbourne. Loading a module essentially modifies your environmental variable `$PATH`. In this post, I set up a module manager for users of my Xubuntu system.

Setting up Xubuntu in VirtualBox for bioinformatic work

Linux is a popular family of operating systems (OS) used in bioinformatics. Amongst its numerous distributions, Xubuntu is a lightweight derivative of ubuntu Linux, and aims to run on a machine with low system requirements. As a Windows user, I often need to switch to a Linux environment for program development and test. To this end, VirtualBox offers an easy-to-use but low-in-resources alternative to a dedicated physical machine or disc space (dual OS). This post records my key steps for setting up Xubuntu in VirtualBox for basic bioinformatic work.

Script now supports Python 3

The Python script in my GitHub repository BINF_toolkit is a popular tool for preparing input files of NCBI Sequin from GenBank files. Nonetheless, this script has only supported Python 2 since its first release in 2015, causing inconvenience to some users. Today, I got some time to make the script compatible to Python 3 with the tool 2to3 and some manual adjustments. The new script has been tested under Python 3.5.2 and pushed to my GitHub.

Popular reference databases of antimicrobial resistance genes

Reliable and up-to-date databases play a pivotal role in reference-based detection of antimicrobial resistance genes (ARGs) in bacteria. Nonetheless, these databases differ in their content and quality, making it challenging to decide an appropriate reference database for a particular research project. In order to address this challenge, this post offers a review of several ARG databases that are publicly available and widely used in bacterial genomics. Particularly, I focus on databases that are still undergoing regular maintenance, and I do not discuss any program released with these databases for sequence search or statistical analysis. tabulating genomic features in GenBank files

I finally got some time this morning to write a Python script, which converts several GenBank files into tab-delimited feature tables (plain text files with an extension “.tsv”). It can be a useful tool when we need to summarise genome annotations or acquire nucleotide and protein sequences of certain genomic features. Although the Holt Lab, where I did my PhD, has an in-house script to do a similar job, it is inappropriate for me to use or share that intellectual property for projects outside of the Holt Lab without a specific permission. Therefore, I decided to create a script from scratch after a discussion on genome annotation with Hao Luo, a PhD student at the Chalmers University of Technology, Sweden, during a lunch break of the course MESB19.

Notes of an online metagenomics course

In this post, I compile my notes of the course Metagenomics applied to surveillance of pathogens and antimicrobial resistance. This three-week course is offered by the Technical University of Denmark and is freely accessible at Coursera. As a graduate researcher working on antimicrobial resistance (AMR) in bacterial populations, I have read countless pieces of literature about bacterial population genomics, surveillance and metagenomics in the most recent four years, and I am supposed to be familiar with the content of this course. Nevertheless, the course remains quite helpful to me since it leads me to build a comprehensive knowledge framework of metagenomics from individual concepts. Here, I focus on knowledge that was once unfamiliar or ambiguous to me, and it may be new to some readers as well. More information can be found in course materials on Coursera.

To tree or not to tree: an introduction of phylogenetic networks

Phylogenetic reconstruction is of crucial importance to elucidate bacterial population structure, epidemiology and evolutionary histories. By far phylogenetic networks and trees are the most common approaches used for studying the evolutionary history of a bacterial population. However, concepts and methodology underlying phylogenetic reconstruction can be challenging to beginners. As such, I share my notes on relevant literature in this post to address these obstacles. In particular, I compare different kinds of phylogenetic networks to show their pros and cons under various conditions.