A virus identifier for High Throughput Sequencing datasets
Project description
VirHunter
VirHunter is a deep learning method that uses Convolutional Neural Networks (CNNs) and a Random Forest Classifier to identify viruses in sequening datasets. More precisely, VirHunter classifies previously assembled contigs as viral, host and bacterial (contamination).
System Requirements
VirHunter installation requires a Unix environment with python 3.8. It was tested on Linux and macOS operating systems. For now, VirHunter is still not fully compatible with M1 chip MacBook.
In order to run VirHunter your installation should include conda. If you are installing it for the first time, we suggest you to use a lightweight miniconda. Otherwise, you can use pip for the dependencies' installation.
Installation
The full installation process should take less than 15 minutes on a standard computer.
Then clone the repository from github
git clone https://github.com/cbib/virhunter.git
Go to the VirHunter root folder
cd virhunter/
Installing dependencies with Conda
First, you have to create the environment from the envs/environment.yml file.
The installation may take around 500 Mb of drive space.
conda env create -f envs/environment.yml
Second, activate the environment:
conda activate virhunter
Installing dependencies with pip
If you don't have Conda installed in your system, you can install python dependencies via pip program:
pip install -r envs/requirements.txt
Then if you have macOS you will need to install wget library to run some scripts (Conda installation already has it). You can do this with brew package manager.
brew install wget
Testing installation of the VirHunter
You can test that VirHunter was successfully installed on the toy dataset we provide. IMPORTANT: the toy dataset is intended only to test the correct work of VirHunter. The trained modules should not be used for prediction on your datasets!
First, you have to download the toy dataset
bash scripts/download_test_installation.sh
Then launch the script for testing training and prediction python scripts of VirHunter
bash scripts/test_installation.sh
Using VirHunter for prediction
VirHunter takes as input a fasta file with contigs and outputs a prediction for each contig to be viral, host (plant) or bacterial.
For given contigs VirHunter produces a tab delimited csv file with prediction. id stores the fasta header of a contig,
length describes the length of the contig. Columns virus, plant and bacteria store the number of fragments of the contig
that received corresponding prediction by the RF classifier. Finally, column decision tell you about the final decision for a given contig.
You should refer to it, when filtering viral contigs.
To do predictions VirHunter needs to be fully trained for fragment sizes 500 and 1000. VirHunter will discard from prediction
contigs shorter than 500 bp. VirHunter trained on 500 fragment size will be used for contigs with 750 < length < 1500. The VirHunter
trained on fragment size 1500 will be used for contigs longer than 1500 bp.
Before running VirHunter you have to fill in the predict_config.yaml file.
To run VirHunter you can use the already pre-trained models. Provided are fully trained models for 3 host species (peach, grapevine, sugar beet) and
for fragment sizes 500 and 1000. Weights for these models can be downloaded by running the download_weights.sh script:
bash scripts/download_weights.sh
Once the weights are downloaded, if you want for example to use the weights of the model trained on peach,
you should add in the configs/predict_config.yaml paths weights/peach/1000 and weights/peach/500.
The command to run predictions is then:
python virhunter/predict.py configs/predict_config.yaml
Training your own model
You can train your own model, for example for a specific host species. Before training, you need to collect sequence
data for training for three reference datasets: viruses, bacteria and host.
Examples are provided by running scripts/download_test_installation.sh that will download viruses.fasta,
host.fasta and bacteria.fasta files (real reference datasets should correspond
e.g. to the whole genome of the host, all bacteria and all viruses from the NCBI).
Training requires execution of the following steps:
- prepare the training dataset for the neural network module from fasta files with
prepare_ds_nn.py. This step splits the reference datasets into fragments of fixed size (specified in theconfig.yamlfile, see below) - prepare the training dataset for Random Forest classifier module with
prepare_ds_rf.py - train the neural network module with
train_nn.py - train the Random Forest module with
train_rf.py
The successful training of VirHunter produces weights for the three neural networks from the first module and weights for the trained Random Forest classifier. They can be subsequently used for prediction.
To execute the steps of the training you must first fill in the train_config.yaml. This file already contains information on all expected inputs.
Once train_config.yaml is filled you can launch the scripts consecutively providing them with the config file like this:
python virhunter/prepare_ds_nn.py configs/train_config.yaml
VirHunter on GPU
If you plan to train VirHunter on GPU, please use environment_gpu.yml or requirements_gpu.txt for dependencies installation.
Those recipes were tested only on the Linux cluster with multiple GPUs.
If you plan to train VirHunter on cluster with multiple GPUs, you will need to uncomment line with
CUDA_VISIBLE_DEVICES variable and replace "" with "N" in header of train_nn.py, where N is the number of GPU you want to use.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "N"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file virhunter-1.0.0.tar.gz.
File metadata
- Download URL: virhunter-1.0.0.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.50.2 importlib-metadata/4.11.1 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
195329e55815b82356db2cf88857a44b887c022e7a4ab43f9f00b31e76e31c5e
|
|
| MD5 |
dc17ffa726ac0666e8420e05d2d128d5
|
|
| BLAKE2b-256 |
6f26ece8083d93139a177c239a1dc6c67359fe6ff200bbbcbd66d342446593c1
|
File details
Details for the file virhunter-1.0.0-py3-none-any.whl.
File metadata
- Download URL: virhunter-1.0.0-py3-none-any.whl
- Upload date:
- Size: 4.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.50.2 importlib-metadata/4.11.1 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5579484e179a3f5c8291f1c7b97f5dc1b61e34cbb04bdd93f603682789c41ab8
|
|
| MD5 |
ad59781203c558ae4b59f733447deaa6
|
|
| BLAKE2b-256 |
223081faa86fdc5c8a9def9bbf3b9ef2dbdf244ef968ec99d55b6dbf14e463fa
|