Setting up Xubuntu in VirtualBox for bioinformatic work

Page content

Linux is a popular family of operating systems (OS) used in bioinformatics. Amongst its numerous distributions, Xubuntu is a lightweight derivative of ubuntu Linux, and aims to run on a machine with low system requirements. As a Windows user, I often need to switch to a Linux environment for program development and test. To this end, VirtualBox offers an easy-to-use but low-in-resources alternative to a dedicated physical machine or disc space (dual OS). This post records my key steps for setting up Xubuntu in VirtualBox for basic bioinformatic work.

Software versions

  • VirtualBox 6.0
  • Linux: Xubuntu 18.04 (Bionic Beaver)



1. Accessing code repositories

1.1. Microsoft Onedrive

Use Onedrive Free Client to access your personal account1.

sudo apt-get install onedrive
onedrive

Authorize this app visiting:

https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=...

Enter the response uri:

Copy the response URL (Something looks like: https://login.microsoftonline.com/common/oauth2/nativeclient?code=xxxxx-xxxx-xxx) in your web browser and paste it to the prompt for a uri in the terminal. Push the Enter key and you will have an OneDrive folder synchronised in your home directory. Then you can run command onedrive to synchronise files each time.

Issues

  • OneDrive on Ubuntu does not recognise directories whose names start with a period (such as “.git” for Git). As a result, so far a Git repository cannot be synchronised between Windows and Linux through OneDrive, causing a potential problem in keeping the .git directory synchronised if users push their commits from both OS.
  • In addition, I found the command onedrive -m works better than onedrive in keeping files up-to-date between OS when OneDrive is running in both the host OS (Windows) and the guest OS (Xubuntu): sometimes changes made under Xubuntu are not uploaded using the latter command.

1.2. Git and GitHub

sudo apt-get install git-core git-gui git-doc  # install three components of Git
git config --global user.email "wanyuac@microbialsystems.cn"

# Clone a public repository
cd Code
git clone https://github.com/wanyuac/BINF_toolkit.git

Follow the Git manual for other manipulations.



2. Installing Conda for managing software environments

I became aware of the importance of software environments (specifically, systems for package management and environment management2) after reading a blog post. Accordingly, I tried Conda and its Bioconda channel on my computer.

2.1. Conda

Conda provides every user with a versatile manager to deal with package dependencies. I Installed Miniconda3 v4.7.12 on my Xubuntu virtual machine. It is astonishing that the shell script for installing Miniconda is composed of 281,032 lines and has a file size of 70.5 MB.

The most important is that we only need to install a single copy of conda in a system directory (such as /opt) for all users of a server. Users do not need to install their own conda release (Section 2.1.1). By default, however, Conda will be installed in a user’s own home directory and become inaccessible to other users4.

2.1.1. Multi-user installation

Administrator of a bioinformatics unit can provide users with a centralised Conda system for their routine work. Installation process:

sudo bash ./Miniconda3-latest-Linux-x86_64.sh
>>> /opt/miniconda3  # Destination directory
Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes  # Done, simple and fast

which conda
/home/yu/Program/miniconda3/bin/conda  # Still using the private Conda program

echo $PATH
/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

/opt/miniconda3/bin/conda info  # Still uses the user config file ~/.condarc

# Create a symbolic link in /usr/local/bin/ so that every user can access conda
# Note that by default /usr/local/bin/ is included in everyone's $PATH.
sudo ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda

which conda
/opt/miniconda3/bin/conda  # Now basically users do not need to install conda under their home directories.

Command conda init adds a section from “# »> conda initialize »>” to “# «< conda initialize «<” into ~/.bashrc. This section exports to $PATH and launches a Conda environment at login; command conda info prints the location of the user config file .condarc (at ~/.condarc).

Clearance of user’s previous private installation when possible.

rm -rf ~/Program/miniconda3

Shared environments?

It does not seem to be a good idea to create shared environments for all users based on discussions in a post. Instead, we could either share an environment YML file or create several program modules on the server.

2.1.2. Single-user installation (depreciated manner)

Installation of a private Conda program does not require the root privilege:

cd ~/Program

# Download the installer dependent to Python 3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh  # Installation
>>> ~/Program/miniconda3  # Specify a non-default path for installation

conda info  # Display information of the current installation

which conda  # Check the installation path of conda
/home/yu/Program/miniconda3/bin/conda  # This directory is inaccessible to other users.

There is no difference between the usage of a centralised Conda installation and a private installation.

2.2. Bioconda

Bioconda is a bioinformatic channel for Conda. Its installation is basically an addition of Bioconda and its dependent channels to Conda. Installation instructions also include a method for creating a new environment using the command conda create.

conda config --add channels defaults  # Most likely, it has been registered upon installation.
conda config --add channels bioconda
conda config --add channels conda-forge  # Highest priority

cat .condarc
channels:
  - conda-forge
  - bioconda
  - defaults
  
conda env list
# conda environments:
#
base                  *  /opt/miniconda3

Now Bioconda is enabled.



3. Establishing software environments

In practice, creating several minimal software environments (e.g., an R-only or Python-only environment) for specific analyses are less likely to encounter issues of incompatibility between packages than creating an admixed environment for all possible analyses. This trick reflects the strength of using Conda and other environment managers in bioinformatics.

3.1. Python and Biopython

3.1.1. Private environment

Online instructions

conda create --name py3 python=3.7  # Environment location: /home/yu/.conda/envs/py3
conda activate py3  # Python 3.7.3 is launched

(py3) yu@Sysbio:~$ python -V  # "(py3)" shows the current active environment.
Python 3.7.3

conda install --channel conda-forge biopython  # Under py3

I noted that ipython and jupyter have been installed as well. Since some bioinformatic software, such as SRST2, still relies on Python 2, I created a Python 2 environment as well.

conda create --name py2 python=2.7  # It will install Python 2.7.15 from channel conda-forge
(py3) yu@Sysbio:~$ conda activate py2  # Switch to the new environment
(py2) yu@Sysbio:~$ python -V  # Returns Python 2.7.15
(py2) yu@Sysbio:~$ conda deactivate
(py3) yu@Sysbio:~$   # Returns to the py3 environment
(py3) yu@Sysbio:~$ conda info --envs  # The same as the command "conda env list"
(py3) yu@Sysbio:~$ conda info --envs
# conda environments:
#
py2                      /home/yu/.conda/envs/py2
py3                   *  /home/yu/.conda/envs/py3  # The asterisk denotes the current environment.
base                     /opt/miniconda3

Install Biopython for Python 2:

(py2) yu@Sysbio:~$ conda install --channel conda-forge biopython  # Biopython 1.74

Display a list of packages in each environment:

(py2) yu@Sysbio:~$ conda list --name py2
# packages in environment at /home/yu/Program/miniconda3/envs/py2:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
biopython                 1.74             py27h516909a_0    conda-forge
ca-certificates           2019.9.11            hecc5488_0    conda-forge
certifi                   2019.9.11                py27_0    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_2    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h6e990d7_2    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0  
ncurses                   6.1               hf484d3e_1002    conda-forge
numpy                     1.16.4           py27h95a1406_0    conda-forge
openssl                   1.1.1c               h516909a_0    conda-forge
pip                       19.3.1                   py27_0    conda-forge
python                    2.7.15            h5a48372_1009    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
setuptools                41.4.0                   py27_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
tk                        8.6.9             hed695b0_1003    conda-forge
wheel                     0.33.6                   py27_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

(py3) yu@Sysbio:~$ gcc --version
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

(py2) yu@Sysbio:~$ gcc --version  # Shares the same GCC as the py3 environment
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

Both environments share the same Biopython version. Nevertheless, as I said in the previous section, a comprehensive environment (e.g., the base environment) may not be preferable.

3.1.2. New shared environment for all users

We can create a shared environment in a system directory for all users:

sudo conda create --prefix /opt/miniconda3/envs/ --name py3.5  # Illegal command. Cannot use --prefix and --name at the same time.
sudo conda create --prefix /opt/miniconda3/envs/py3.5 python=3.5.2  # The directory name "py3.5" in the prefix becomes the environment name.
# To activate this environment, use
#
#     $ conda activate /opt/miniconda3/envs/py3.5
#
# To deactivate an active environment, use
#
#     $ conda deactivate
(base) yu@Sysbio:~$ conda activate /opt/miniconda3/envs/py3.5
(py3.5) yu@Sysbio:~$  # The environment prefix looks concise, which is desirable.

# In fact, we can activate the shared environment py3.5 using a simpler command:
(base) yu@Sysbio:~$ conda activate py3.5
(py3.5) yu@Sysbio:~$

(py3.5) yu@Sysbio:~$ conda env list  # Environments are recorded in ~/.conda/environments.txt.
# conda environments:
#
py2                      /home/yu/.conda/envs/py2
py3                      /home/yu/.conda/envs/py3
r3.6                     /home/yu/.conda/envs/r3.6
base                     /opt/miniconda3
py3.5                 *  /opt/miniconda3/envs/py3.5

Another user can use the shared environment through the following steps:

user1@Sysbio:~$ conda init bash  # Only need to run this command once. Restart the console afterwards.

(base)user1@Sysbio:~$ conda env list  # The new environment appears on the list.
# conda environments:
#
base                  *  /opt/miniconda3
py3.5                    /opt/miniconda3/envs/py3.5

(base)user1@Sysbio:~$ conda activate py3.5  # We only need to type the key rather than the full path to activate an environment.
(py3.5)user1@Sysbio:~$

(py3.5) yu@Sysbio:~$ echo $PATH
/opt/miniconda3/envs/py3.5/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

Nonetheless, only the administrator (a user who is recorded in the sudoers file and has the root password) can make changes to the shared environment. This is a desirable control for safety. In addition, as we can see in the last line of output above, the directory of installed programs (/opt/miniconda3/envs/py3.5) under the environment py3.5 is added to $PATH when the environment is active. The directory “/opt/miniconda3/condabin” is common to all environments since it only contains the Conda program.

3.1.3. Shared default environment “base” (depreciated)

I do not recommend to change the shared environment “base” because it is the default environment of every user. Despite this depreciation, I did a few changes to the base environement for a demonstration: since Python 3.7.4 and its dependencies have been installed with Miniconda, I performed an upgrade of these packages using the following commands in order to obtained the latest compatible versions of the conda-forge channel (which has the highest priority):

(base) yu@Sysbio:~$ sudo conda update --help  # Check options for update
(base) yu@Sysbio:~$ sudo conda update --channel conda-forge python
python -V
  Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21) 
  [GCC 7.3.0] :: Anaconda, Inc. on linux

To install Biopython through Anaconda:

(base) yu@Sysbio:~$ sudo conda install --channel conda-forge biopython  # Search biopython in channel conda-forge

Done. I restored the default base environment afterwards.

3.1.4. Deactivating an environment

Finally, deactivation of an environment involves removal of the path to the software directory of the current environment from $PATH. In practice, it is easy to quit any Conda environment and return to the system’s default - just deactivate the base environment:

(base) yu@Sysbio:~$ echo $PATH
/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

(base) yu@Sysbio:~$ conda deactivate

yu@Sysbio:~$ echo $PATH
/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

yu@Sysbio:~$ which conda
/opt/miniconda3/condabin/conda

yu@Sysbio:~$ python  # System's default environment
Python 2.7.15+ (default, Oct  7 2019, 17:39:04) 

Note that the path “/opt/miniconda3/bin”, which only contains the conda program, was removed from $PATH, whereas “/opt/miniconda3/condabin” (contains tools of Conda, including programs “activate” and “deactivate”, and installed software in the base environment at /opt/miniconda3/) was retained after deactivation of the base environement.

3.2. R and packages

conda create --name r3.6
conda install --channel conda-forge r-essentials r-base  # Find an R recipe in the channel and install the corresponding R components
which R  # Display where the R program is: /home/yu/.conda/envs/r3.6/bin/R
R  # Run R

# Install packages via conda
conda install --channel conda-forge r-ggplot2

(r3.6) yu@Sysbio:~$ conda env list  # Print paths of all environments
# conda environments:
#
py2                      /home/yu/.conda/envs/py2
py3                      /home/yu/.conda/envs/py3
r3.6                  *  /home/yu/.conda/envs/r3.6
base                     /opt/miniconda3

R 3.6.1 (released on 5 July 2019) has been installed successfuly under the base environment. Notice the ggtree package is not available from the current channel.

See the Anaconda documentation for more instructions.

3.3. Rstudio

RStudio 1.2.5001 - Ubuntu 18/Debian 10 (64-bit) was downloaded from the Rstudio website and installed automatically (path to the program: /usr/lib/rstudio/bin/rstudio). Then Rstudio can be found under the Development sub-menue in the Whisker Menu.

Configuring Rstudio for R installed through Conda

By default, Rstudio does not recognise the R program (3.6.1) installed through Conda. Instead, it launches the R version 3.4.4 pre-installed as /usr/lib/R/bin/R.

Solution: add the following command to ~/.profile3.

export RSTUDIO_WHICH_R=$HOME/.conda/envs/r3.6/bin/R  # Specify the R version for Rstudio

Logout the current Xubuntu session, login again, re-open Rstudio, and now we can see that R 3.6.1 is launched correctly.



References

  1. https://askubuntu.com/questions/958406/how-to-setup-onedrive-in-ubuntu-17-04.
  2. https://docs.conda.io/en/latest/.
  3. https://support.rstudio.com/hc/en-us/articles/200486138-Changing-R-versions-for-RStudio-desktop.
  4. https://docs.conda.io/projects/conda/en/latest/user-guide/configuration/admin-multi-user-install.html.



Appendix

Adding a user to a particular group

sudo adduser --ingroup users user1
Adding user `user1' ...
Adding new user `user1' (1001) with group `users' ...