I finally got some time this morning to write a Python script gbk2tsv.py, which converts several GenBank files into tab-delimited feature tables (plain text files with an extension “.tsv”). It can be a useful tool when we need to summarise genome annotations or acquire nucleotide and protein sequences of certain genomic features. Although the Holt Lab, where I did my PhD, has an in-house script to do a similar job, it is inappropriate for me to use or share that intellectual property for projects outside of the Holt Lab without a specific permission. Therefore, I decided to create a script from scratch after a discussion on genome annotation with Hao Luo, a PhD student at the Chalmers University of Technology, Sweden, during a lunch break of the course MESB19.
usage: gbk2tsv.py [-h] -g GBKS [GBKS ...] [-o OUTDIR] [-f FEATURES] [-n] [-p] Convert GenBank files to tab-delimited text files optional arguments: -h, --help show this help message and exit -g GBKS [GBKS ...], --gbk GBKS [GBKS ...] Input GenBank files -o OUTDIR, --outdir OUTDIR Output directory (no backslash or forward slash) -f FEATURES, --features FEATURES Comma-separated features to store (default CDS,tRNA,rRNA) -n, --nucl_seq Turn on this option to print nucleotide sequences of features -p, --prot_seq Turn on this option to print protein sequences of CDS
The script accepts three forms of input file names:
1. Single GenBank file
python gbk2tsv.py --gbk demo.gbk
2. Multiple GenBank files with known names
python gbk2tsv.py --gbk demo1.gbk demo2.gbk demo3.gbk
3. Multiple GenBank files matched to a wildcard
python gbk2tsv.py --gbk *.gbk
For details of using BioPython to process GenBank files, readers may see a post by Peter Cock from the University of Warwick.