Bioinformatics

Linux one-liners converting GFF3 to FASTA files of contigs

Despite the popularity of GFF3 format for genome annotations, to my knowledge there is no published tools for extracting DNA sequences of contigs from the GFF3 files and store them in a multi-FASTA file. EMBOSS seqret is only able to pull out the last contig from the GFF3 file, whereas other tools aim to extract the DNA sequence per feature. Therefore, I develop two Linux one-liners in this post for extract contig sequences from a GFF3 file and transfer them to a FASTA file.

Comparisons between SRST2, ARIBA, and KmerResistance

SRST21, ARIBA2, and KmerResistance (web service, code)3 are three widely used pieces of standalone software for read-based detection of target genes in bacterial genomes. Published in 2014, SRST2 is recognised as the pioneer amongst these three tools2, 3. In this post, I compare methodologies underlying these tools in a concise manner to shed light on the selection of appropriate software for gene detection. Particularly, I herein presume that detecting antimicrobial resistance determinants is the only use case. Whenever unspecified, software versions referred to in this post are: SRST2 v0.2.0, ARIBA 2.14.4, and KmerResistance v2.2.

Setting up Xubuntu in VirtualBox for bioinformatic work

Linux is a popular family of operating systems (OS) used in bioinformatics. Amongst its numerous distributions, Xubuntu is a lightweight derivative of ubuntu Linux, and aims to run on a machine with low system requirements. As a Windows user, I often need to switch to a Linux environment for program development and test. To this end, VirtualBox offers an easy-to-use but low-in-resources alternative to a dedicated physical machine or disc space (dual OS). This post records my key steps for setting up Xubuntu in VirtualBox for basic bioinformatic work.

Script gbk2tbl.py now supports Python 3

The Python script gbk2tbl.py in my GitHub repository BINF_toolkit is a popular tool for preparing input files of NCBI Sequin from GenBank files. Nonetheless, this script has only supported Python 2 since its first release in 2015, causing inconvenience to some users. Today, I got some time to make the script compatible to Python 3 with the tool 2to3 and some manual adjustments. The new script has been tested under Python 3.5.2 and pushed to my GitHub.

Popular reference databases of antimicrobial resistance genes

Reliable and up-to-date databases play a pivotal role in reference-based detection of antimicrobial resistance genes (ARGs) in bacteria. Nonetheless, these databases differ in their content and quality, making it challenging to decide an appropriate reference database for a particular research project. In order to address this challenge, this post offers a review of several ARG databases that are publicly available and widely used in bacterial genomics. Particularly, I focus on databases that are still undergoing regular maintenance, and I do not discuss any program released with these databases for sequence search or statistical analysis.

gbk2tsv.py: tabulating genomic features in GenBank files

I finally got some time this morning to write a Python script gbk2tsv.py, which converts several GenBank files into tab-delimited feature tables (plain text files with an extension “.tsv”). It can be a useful tool when we need to summarise genome annotations or acquire nucleotide and protein sequences of certain genomic features. Although the Holt Lab, where I did my PhD, has an in-house script to do a similar job, it is inappropriate for me to use or share that intellectual property for projects outside of the Holt Lab without a specific permission. Therefore, I decided to create a script from scratch after a discussion on genome annotation with Hao Luo, a PhD student at the Chalmers University of Technology, Sweden, during a lunch break of the course MESB19.

Bioinformatic resources for investigating clostridial metabolism

Solventogenic clostridia offer a promising and sustainable alternative to petroleum-based production of butanol — an important industrial chemical feedstock and fuel additive or replacement1. They also draw our attention for their potential in reducing the emission of greenhouse gases and relieving the threat of global warming. It is of paramount importance to elucidate the metabolism of clostridia for metabolism engineering and industrial applications of gas fermentation. In addition to standard experimental approaches, bioinformatics provides us with an efficient way to identify targets (genetic or biochemical) that can be controlled to improve the product formation. In this post, I briefly summarise bioinformatic resources that are publicly accessible to date for interrogating clostridial metabolism.

Understanding SRST2 outputs

SRST2 is a widely used tool screening Illumina reads of bacterial genomes for known genes (that is, targeted gene detection). Its capability includes MLST profiling and detection of known antimicrobial resistance genes (ARGs), virulence genes, plasmids, etc. I have been using SRST2 throughout my PhD project and coded my package GeneMates on the grounds of SRST2’s outputs. Here, I explain the output formats of SRST2 in order to help users to gain a better understanding of this versatile tool. Comments and corrections from readers are welcomed since this post is based on my own understandings and experience.

A complete list of my software

This page introduces my computer code developed and published for the research community since 2015. 1. Population genomics 1.1. Detection of horizontal gene co-transfer between bacteria GeneMates The latest version: v0.2.2, which was released on 21 March 2020. (Documentation) This R package implements my network approach for the detection of intra-species horizontal gene co-transfer (HGcoT) between bacteria. A manuscript is preparing for it. GeneMates takes as input bacterial whole-genome sequencing (WGS) data (in the forms of short reads and/or genome assemblies) and creates networks showing evidence of HGcoT at the allele level.