gbk2tsv.py: tabulating genomic features in GenBank files

2019-09-13

I finally got some time this morning to write a Python script gbk2tsv.py, which converts several GenBank files into tab-delimited feature tables (plain text files with an extension “.tsv”). It can be a useful tool when we need to summarise genome annotations or acquire nucleotide and protein sequences of certain genomic features. Although the Holt Lab, where I did my PhD, has an in-house script to do a similar job, it is inappropriate for me to use or share that intellectual property for projects outside of the Holt Lab without a specific permission. Therefore, I decided to create a script from scratch after a discussion on genome annotation with Hao Luo, a PhD student at the Chalmers University of Technology, Sweden, during a lunch break of the course MESB19.

Usage

usage: gbk2tsv.py [-h] -g GBKS [GBKS ...] [-o OUTDIR] [-f FEATURES] [-n] [-p]

Convert GenBank files to tab-delimited text files

optional arguments:
  -h, --help            show this help message and exit
  -g GBKS [GBKS ...], --gbk GBKS [GBKS ...]
                        Input GenBank files
  -o OUTDIR, --outdir OUTDIR
                        Output directory (no backslash or forward slash)
  -f FEATURES, --features FEATURES
                        Comma-separated features to store (default
                        CDS,tRNA,rRNA)
  -n, --nucl_seq        Turn on this option to print nucleotide sequences of
                        features
  -p, --prot_seq        Turn on this option to print protein sequences of CDS

The script accepts three forms of input file names:

1. Single GenBank file

python gbk2tsv.py --gbk demo.gbk

2. Multiple GenBank files with known names

python gbk2tsv.py --gbk demo1.gbk demo2.gbk demo3.gbk

3. Multiple GenBank files matched to a wildcard

python gbk2tsv.py --gbk *.gbk

For details of using BioPython to process GenBank files, readers may see a post by Peter Cock from the University of Warwick.