Setting up Xubuntu in VirtualBox for bioinformatic work
Linux is a popular family of operating systems (OS) used in bioinformatics. Amongst its numerous distributions, Xubuntu is a lightweight derivative of ubuntu Linux, and aims to run on a machine with low system requirements. As a Windows user, I often need to switch to a Linux environment for program development and test. To this end, VirtualBox offers an easy-to-use but low-in-resources alternative to a dedicated physical machine or disc space (dual OS). This post records my key steps for setting up Xubuntu in VirtualBox for basic bioinformatic work.
- VirtualBox 6.0
- Linux: Xubuntu 18.04 (Bionic Beaver)
1. Accessing code repositories
1.1. Microsoft Onedrive
Use Onedrive Free Client to access your personal account1.
sudo apt-get install onedrive onedrive Authorize this app visiting: https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=... Enter the response uri:
Copy the response URL (Something looks like: https://login.microsoftonline.com/common/oauth2/nativeclient?code=xxxxx-xxxx-xxx) in your web browser and paste it to the prompt for a uri in the terminal. Push the Enter key and you will have an OneDrive folder synchronised in your home directory. Then you can run command
onedrive to synchronise files each time.
- OneDrive on Ubuntu does not recognise directories whose names start with a period (such as “.git” for Git). As a result, so far a Git repository cannot be synchronised between Windows and Linux through OneDrive, causing a potential problem in keeping the .git directory synchronised if users push their commits from both OS.
- In addition, I found the command
onedrive -mworks better than
onedrivein keeping files up-to-date between OS when OneDrive is running in both the host OS (Windows) and the guest OS (Xubuntu): sometimes changes made under Xubuntu are not uploaded using the latter command.
1.2. Git and GitHub
sudo apt-get install git-core git-gui git-doc # install three components of Git git config --global user.email "firstname.lastname@example.org" # Clone a public repository cd Code git clone https://github.com/wanyuac/BINF_toolkit.git
Follow the Git manual for other manipulations.
2. Installing Conda for managing software environments
I became aware of the importance of software environments (specifically, systems for package management and environment management2) after reading a blog post. Accordingly, I tried Conda and its Bioconda channel on my computer.
Conda provides every user with a versatile manager to deal with package dependencies. I Installed Miniconda3 v4.7.12 on my Xubuntu virtual machine. It is astonishing that the shell script for installing Miniconda is composed of 281,032 lines and has a file size of 70.5 MB.
The most important is that we only need to install a single copy of conda in a system directory (such as /opt) for all users of a server. Users do not need to install their own conda release (Section 2.1.1). By default, however, Conda will be installed in a user’s own home directory and become inaccessible to other users4.
2.1.1. Multi-user installation
Administrator of a bioinformatics unit can provide users with a centralised Conda system for their routine work. Installation process:
sudo bash ./Miniconda3-latest-Linux-x86_64.sh >>> /opt/miniconda3 # Destination directory Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no] [no] >>> yes # Done, simple and fast which conda /home/yu/Program/miniconda3/bin/conda # Still using the private Conda program echo $PATH /opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin /opt/miniconda3/bin/conda info # Still uses the user config file ~/.condarc # Create a symbolic link in /usr/local/bin/ so that every user can access conda # Note that by default /usr/local/bin/ is included in everyone's $PATH. sudo ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda which conda /opt/miniconda3/bin/conda # Now basically users do not need to install conda under their home directories.
conda init adds a section from “# »> conda initialize »>” to “# «< conda initialize «<” into ~/.bashrc. This section exports to $PATH and launches a Conda environment at login; command
conda info prints the location of the user config file .condarc (at ~/.condarc).
Clearance of user’s previous private installation when possible.
rm -rf ~/Program/miniconda3
It does not seem to be a good idea to create shared environments for all users based on discussions in a post. Instead, we could either share an environment YML file or create several program modules on the server.
2.1.2. Single-user installation (depreciated manner)
Installation of a private Conda program does not require the root privilege:
cd ~/Program # Download the installer dependent to Python 3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh # Installation >>> ~/Program/miniconda3 # Specify a non-default path for installation conda info # Display information of the current installation which conda # Check the installation path of conda /home/yu/Program/miniconda3/bin/conda # This directory is inaccessible to other users.
There is no difference between the usage of a centralised Conda installation and a private installation.
Bioconda is a bioinformatic channel for Conda. Its installation is basically an addition of Bioconda and its dependent channels to Conda. Installation instructions also include a method for creating a new environment using the command
conda config --add channels defaults # Most likely, it has been registered upon installation. conda config --add channels bioconda conda config --add channels conda-forge # Highest priority cat .condarc channels: - conda-forge - bioconda - defaults conda env list # conda environments: # base * /opt/miniconda3
Now Bioconda is enabled.
3. Establishing software environments
In practice, creating several minimal software environments (e.g., an R-only or Python-only environment) for specific analyses are less likely to encounter issues of incompatibility between packages than creating an admixed environment for all possible analyses. This trick reflects the strength of using Conda and other environment managers in bioinformatics.
3.1. Python and Biopython
3.1.1. Private environment
conda create --name py3 python=3.7 # Environment location: /home/yu/.conda/envs/py3 conda activate py3 # Python 3.7.3 is launched (py3) yu@Sysbio:~$ python -V # "(py3)" shows the current active environment. Python 3.7.3 conda install --channel conda-forge biopython # Under py3
I noted that ipython and jupyter have been installed as well. Since some bioinformatic software, such as SRST2, still relies on Python 2, I created a Python 2 environment as well.
conda create --name py2 python=2.7 # It will install Python 2.7.15 from channel conda-forge (py3) yu@Sysbio:~$ conda activate py2 # Switch to the new environment (py2) yu@Sysbio:~$ python -V # Returns Python 2.7.15 (py2) yu@Sysbio:~$ conda deactivate (py3) yu@Sysbio:~$ # Returns to the py3 environment (py3) yu@Sysbio:~$ conda info --envs # The same as the command "conda env list" (py3) yu@Sysbio:~$ conda info --envs # conda environments: # py2 /home/yu/.conda/envs/py2 py3 * /home/yu/.conda/envs/py3 # The asterisk denotes the current environment. base /opt/miniconda3
Install Biopython for Python 2:
(py2) yu@Sysbio:~$ conda install --channel conda-forge biopython # Biopython 1.74
Display a list of packages in each environment:
(py2) yu@Sysbio:~$ conda list --name py2 # packages in environment at /home/yu/Program/miniconda3/envs/py2: # # Name Version Build Channel _libgcc_mutex 0.1 main biopython 1.74 py27h516909a_0 conda-forge ca-certificates 2019.9.11 hecc5488_0 conda-forge certifi 2019.9.11 py27_0 conda-forge libblas 3.8.0 14_openblas conda-forge libcblas 3.8.0 14_openblas conda-forge libffi 3.2.1 he1b5a44_1006 conda-forge libgcc-ng 9.1.0 hdf63c60_0 libgfortran-ng 7.3.0 hdf63c60_2 conda-forge liblapack 3.8.0 14_openblas conda-forge libopenblas 0.3.7 h6e990d7_2 conda-forge libstdcxx-ng 9.1.0 hdf63c60_0 ncurses 6.1 hf484d3e_1002 conda-forge numpy 1.16.4 py27h95a1406_0 conda-forge openssl 1.1.1c h516909a_0 conda-forge pip 19.3.1 py27_0 conda-forge python 2.7.15 h5a48372_1009 conda-forge readline 8.0 hf8c457e_0 conda-forge setuptools 41.4.0 py27_0 conda-forge sqlite 3.30.1 hcee41ef_0 conda-forge tk 8.6.9 hed695b0_1003 conda-forge wheel 0.33.6 py27_0 conda-forge zlib 1.2.11 h516909a_1006 conda-forge (py3) yu@Sysbio:~$ gcc --version gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 (py2) yu@Sysbio:~$ gcc --version # Shares the same GCC as the py3 environment gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Both environments share the same Biopython version. Nevertheless, as I said in the previous section, a comprehensive environment (e.g., the base environment) may not be preferable.
3.1.2. New shared environment for all users
We can create a shared environment in a system directory for all users:
sudo conda create --prefix /opt/miniconda3/envs/ --name py3.5 # Illegal command. Cannot use --prefix and --name at the same time. sudo conda create --prefix /opt/miniconda3/envs/py3.5 python=3.5.2 # The directory name "py3.5" in the prefix becomes the environment name. # To activate this environment, use # # $ conda activate /opt/miniconda3/envs/py3.5 # # To deactivate an active environment, use # # $ conda deactivate (base) yu@Sysbio:~$ conda activate /opt/miniconda3/envs/py3.5 (py3.5) yu@Sysbio:~$ # The environment prefix looks concise, which is desirable. # In fact, we can activate the shared environment py3.5 using a simpler command: (base) yu@Sysbio:~$ conda activate py3.5 (py3.5) yu@Sysbio:~$ (py3.5) yu@Sysbio:~$ conda env list # Environments are recorded in ~/.conda/environments.txt. # conda environments: # py2 /home/yu/.conda/envs/py2 py3 /home/yu/.conda/envs/py3 r3.6 /home/yu/.conda/envs/r3.6 base /opt/miniconda3 py3.5 * /opt/miniconda3/envs/py3.5
Another user can use the shared environment through the following steps:
user1@Sysbio:~$ conda init bash # Only need to run this command once. Restart the console afterwards. (base)user1@Sysbio:~$ conda env list # The new environment appears on the list. # conda environments: # base * /opt/miniconda3 py3.5 /opt/miniconda3/envs/py3.5 (base)user1@Sysbio:~$ conda activate py3.5 # We only need to type the key rather than the full path to activate an environment. (py3.5)user1@Sysbio:~$ (py3.5) yu@Sysbio:~$ echo $PATH /opt/miniconda3/envs/py3.5/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Nonetheless, only the administrator (a user who is recorded in the sudoers file and has the root password) can make changes to the shared environment. This is a desirable control for safety. In addition, as we can see in the last line of output above, the directory of installed programs (/opt/miniconda3/envs/py3.5) under the environment py3.5 is added to $PATH when the environment is active. The directory “/opt/miniconda3/condabin” is common to all environments since it only contains the Conda program.
3.1.3. Shared default environment “base” (depreciated)
I do not recommend to change the shared environment “base” because it is the default environment of every user. Despite this depreciation, I did a few changes to the base environement for a demonstration: since Python 3.7.4 and its dependencies have been installed with Miniconda, I performed an upgrade of these packages using the following commands in order to obtained the latest compatible versions of the conda-forge channel (which has the highest priority):
(base) yu@Sysbio:~$ sudo conda update --help # Check options for update (base) yu@Sysbio:~$ sudo conda update --channel conda-forge python python -V Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] :: Anaconda, Inc. on linux
To install Biopython through Anaconda:
(base) yu@Sysbio:~$ sudo conda install --channel conda-forge biopython # Search biopython in channel conda-forge
Done. I restored the default base environment afterwards.
3.1.4. Deactivating an environment
Finally, deactivation of an environment involves removal of the path to the software directory of the current environment from $PATH. In practice, it is easy to quit any Conda environment and return to the system’s default - just deactivate the base environment:
(base) yu@Sysbio:~$ echo $PATH /opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin (base) yu@Sysbio:~$ conda deactivate yu@Sysbio:~$ echo $PATH /opt/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin yu@Sysbio:~$ which conda /opt/miniconda3/condabin/conda yu@Sysbio:~$ python # System's default environment Python 2.7.15+ (default, Oct 7 2019, 17:39:04)
Note that the path “/opt/miniconda3/bin”, which only contains the conda program, was removed from $PATH, whereas “/opt/miniconda3/condabin” (contains tools of Conda, including programs “activate” and “deactivate”, and installed software in the base environment at /opt/miniconda3/) was retained after deactivation of the base environement.
3.2. R and packages
conda create --name r3.6 conda install --channel conda-forge r-essentials r-base # Find an R recipe in the channel and install the corresponding R components which R # Display where the R program is: /home/yu/.conda/envs/r3.6/bin/R R # Run R # Install packages via conda conda install --channel conda-forge r-ggplot2 (r3.6) yu@Sysbio:~$ conda env list # Print paths of all environments # conda environments: # py2 /home/yu/.conda/envs/py2 py3 /home/yu/.conda/envs/py3 r3.6 * /home/yu/.conda/envs/r3.6 base /opt/miniconda3
R 3.6.1 (released on 5 July 2019) has been installed successfuly under the base environment. Notice the ggtree package is not available from the current channel.
See the Anaconda documentation for more instructions.
RStudio 1.2.5001 - Ubuntu 18/Debian 10 (64-bit) was downloaded from the Rstudio website and installed automatically (path to the program: /usr/lib/rstudio/bin/rstudio). Then Rstudio can be found under the Development sub-menue in the Whisker Menu.
Configuring Rstudio for R installed through Conda
By default, Rstudio does not recognise the R program (3.6.1) installed through Conda. Instead, it launches the R version 3.4.4 pre-installed as /usr/lib/R/bin/R.
Solution: add the following command to ~/.profile3.
export RSTUDIO_WHICH_R=$HOME/.conda/envs/r3.6/bin/R # Specify the R version for Rstudio
Logout the current Xubuntu session, login again, re-open Rstudio, and now we can see that R 3.6.1 is launched correctly.
Adding a user to a particular group
sudo adduser --ingroup users user1 Adding user `user1' ... Adding new user `user1' (1001) with group `users' ...