Skip to main content

A virus identifier for High Throughput Sequencing datasets

Project description

VirHunter

VirHunter is a deep learning method that uses Convolutional Neural Networks (CNNs) and a Random Forest Classifier to identify viruses in sequening datasets. More precisely, VirHunter classifies previously assembled contigs as viral, host and bacterial (contamination).

System Requirements

VirHunter installation requires a Unix environment with python 3.8. It was tested on Linux and macOS operating systems. For now, VirHunter is still not fully compatible with M1 chip MacBook.

In order to run VirHunter your installation should include conda. If you are installing it for the first time, we suggest you to use a lightweight miniconda. Otherwise, you can use pip for the dependencies' installation.

Installation

The full installation process should take less than 15 minutes on a standard computer.

Then clone the repository from github

git clone https://github.com/cbib/virhunter.git

Go to the VirHunter root folder

cd virhunter/

Installing dependencies with Conda

First, you have to create the environment from the envs/environment.yml file. The installation may take around 500 Mb of drive space.

conda env create -f envs/environment.yml

Second, activate the environment:

conda activate virhunter

Installing dependencies with pip

If you don't have Conda installed in your system, you can install python dependencies via pip program:

pip install -r envs/requirements.txt

Then if you have macOS you will need to install wget library to run some scripts (Conda installation already has it). You can do this with brew package manager.

brew install wget

Testing installation of the VirHunter

You can test that VirHunter was successfully installed on the toy dataset we provide. IMPORTANT: the toy dataset is intended only to test the correct work of VirHunter. The trained modules should not be used for prediction on your datasets!

First, you have to download the toy dataset

bash scripts/download_test_installation.sh

Then launch the script for testing training and prediction python scripts of VirHunter

bash scripts/test_installation.sh

Using VirHunter for prediction

VirHunter takes as input a fasta file with contigs and outputs a prediction for each contig to be viral, host (plant) or bacterial.

For given contigs VirHunter produces a tab delimited csv file with prediction. id stores the fasta header of a contig, length describes the length of the contig. Columns virus, plant and bacteria store the number of fragments of the contig that received corresponding prediction by the RF classifier. Finally, column decision tell you about the final decision for a given contig. You should refer to it, when filtering viral contigs.

To do predictions VirHunter needs to be fully trained for fragment sizes 500 and 1000. VirHunter will discard from prediction contigs shorter than 500 bp. VirHunter trained on 500 fragment size will be used for contigs with 750 < length < 1500. The VirHunter trained on fragment size 1500 will be used for contigs longer than 1500 bp.

Before running VirHunter you have to fill in the predict_config.yaml file.

To run VirHunter you can use the already pre-trained models. Provided are fully trained models for 3 host species (peach, grapevine, sugar beet) and for fragment sizes 500 and 1000. Weights for these models can be downloaded by running the download_weights.sh script:

bash scripts/download_weights.sh

Once the weights are downloaded, if you want for example to use the weights of the model trained on peach, you should add in the configs/predict_config.yaml paths weights/peach/1000 and weights/peach/500.

The command to run predictions is then:

python virhunter/predict.py configs/predict_config.yaml

Training your own model

You can train your own model, for example for a specific host species. Before training, you need to collect sequence data for training for three reference datasets: viruses, bacteria and host. Examples are provided by running scripts/download_test_installation.sh that will download viruses.fasta, host.fasta and bacteria.fasta files (real reference datasets should correspond e.g. to the whole genome of the host, all bacteria and all viruses from the NCBI).

Training requires execution of the following steps:

  • prepare the training dataset for the neural network module from fasta files with prepare_ds_nn.py. This step splits the reference datasets into fragments of fixed size (specified in the config.yaml file, see below)
  • prepare the training dataset for Random Forest classifier module with prepare_ds_rf.py
  • train the neural network module with train_nn.py
  • train the Random Forest module with train_rf.py

The successful training of VirHunter produces weights for the three neural networks from the first module and weights for the trained Random Forest classifier. They can be subsequently used for prediction.

To execute the steps of the training you must first fill in the train_config.yaml. This file already contains information on all expected inputs. Once train_config.yaml is filled you can launch the scripts consecutively providing them with the config file like this:

python virhunter/prepare_ds_nn.py configs/train_config.yaml

VirHunter on GPU

If you plan to train VirHunter on GPU, please use environment_gpu.yml or requirements_gpu.txt for dependencies installation. Those recipes were tested only on the Linux cluster with multiple GPUs. If you plan to train VirHunter on cluster with multiple GPUs, you will need to uncomment line with CUDA_VISIBLE_DEVICES variable and replace "" with "N" in header of train_nn.py, where N is the number of GPU you want to use.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "N"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

virhunter-1.0.0.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

virhunter-1.0.0-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file virhunter-1.0.0.tar.gz.

File metadata

  • Download URL: virhunter-1.0.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.50.2 importlib-metadata/4.11.1 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for virhunter-1.0.0.tar.gz
Algorithm Hash digest
SHA256 195329e55815b82356db2cf88857a44b887c022e7a4ab43f9f00b31e76e31c5e
MD5 dc17ffa726ac0666e8420e05d2d128d5
BLAKE2b-256 6f26ece8083d93139a177c239a1dc6c67359fe6ff200bbbcbd66d342446593c1

See more details on using hashes here.

File details

Details for the file virhunter-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: virhunter-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 4.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.50.2 importlib-metadata/4.11.1 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for virhunter-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5579484e179a3f5c8291f1c7b97f5dc1b61e34cbb04bdd93f603682789c41ab8
MD5 ad59781203c558ae4b59f733447deaa6
BLAKE2b-256 223081faa86fdc5c8a9def9bbf3b9ef2dbdf244ef968ec99d55b6dbf14e463fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page