A virus identifier for High Throughput Sequencing datasets

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

VirHunter

VirHunter is a deep learning method that uses Convolutional Neural Networks (CNNs) and a Random Forest Classifier to identify viruses in sequening datasets. More precisely, VirHunter classifies previously assembled contigs as viral, host and bacterial (contamination).

System Requirements

VirHunter installation requires a Unix environment with python 3.8. It was tested on Linux and macOS operating systems. For now, VirHunter is still not fully compatible with M1 chip MacBook.

In order to run VirHunter your installation should include conda. If you are installing it for the first time, we suggest you to use a lightweight miniconda. Otherwise, you can use pip for the dependencies' installation.

Installation

The full installation process should take less than 15 minutes on a standard computer.

Then clone the repository from github

git clone https://github.com/cbib/virhunter.git

Go to the VirHunter root folder

cd virhunter/

Installing dependencies with Conda

First, you have to create the environment from the envs/environment.yml file. The installation may take around 500 Mb of drive space.

conda env create -f envs/environment.yml

Second, activate the environment:

conda activate virhunter

Installing dependencies with pip

If you don't have Conda installed in your system, you can install python dependencies via pip program:

pip install -r envs/requirements.txt

Then if you have macOS you will need to install wget library to run some scripts (Conda installation already has it). You can do this with brew package manager.

brew install wget

Testing installation of the VirHunter

You can test that VirHunter was successfully installed on the toy dataset we provide. IMPORTANT: the toy dataset is intended only to test the correct work of VirHunter. The trained modules should not be used for prediction on your datasets!

First, you have to download the toy dataset

bash scripts/download_test_installation.sh

Then launch the script for testing training and prediction python scripts of VirHunter

bash scripts/test_installation.sh

Using VirHunter for prediction

VirHunter takes as input a fasta file with contigs and outputs a prediction for each contig to be viral, host (plant) or bacterial.

For given contigs VirHunter produces a tab delimited csv file with prediction. id stores the fasta header of a contig, length describes the length of the contig. Columns virus, plant and bacteria store the number of fragments of the contig that received corresponding prediction by the RF classifier. Finally, column decision tell you about the final decision for a given contig. You should refer to it, when filtering viral contigs.

To do predictions VirHunter needs to be fully trained for fragment sizes 500 and 1000. VirHunter will discard from prediction contigs shorter than 500 bp. VirHunter trained on 500 fragment size will be used for contigs with 750 < length < 1500. The VirHunter trained on fragment size 1500 will be used for contigs longer than 1500 bp.

Before running VirHunter you have to fill in the predict_config.yaml file.

To run VirHunter you can use the already pre-trained models. Provided are fully trained models for 3 host species (peach, grapevine, sugar beet) and for fragment sizes 500 and 1000. Weights for these models can be downloaded by running the download_weights.sh script:

bash scripts/download_weights.sh

Once the weights are downloaded, if you want for example to use the weights of the model trained on peach, you should add in the configs/predict_config.yaml paths weights/peach/1000 and weights/peach/500.

The command to run predictions is then:

python virhunter/predict.py configs/predict_config.yaml

Training your own model

You can train your own model, for example for a specific host species. Before training, you need to collect sequence data for training for three reference datasets: viruses, bacteria and host. Examples are provided by running scripts/download_test_installation.sh that will download viruses.fasta, host.fasta and bacteria.fasta files (real reference datasets should correspond e.g. to the whole genome of the host, all bacteria and all viruses from the NCBI).

Training requires execution of the following steps:

prepare the training dataset for the neural network module from fasta files with prepare_ds_nn.py. This step splits the reference datasets into fragments of fixed size (specified in the config.yaml file, see below)
prepare the training dataset for Random Forest classifier module with prepare_ds_rf.py
train the neural network module with train_nn.py
train the Random Forest module with train_rf.py

The successful training of VirHunter produces weights for the three neural networks from the first module and weights for the trained Random Forest classifier. They can be subsequently used for prediction.

To execute the steps of the training you must first fill in the train_config.yaml. This file already contains information on all expected inputs. Once train_config.yaml is filled you can launch the scripts consecutively providing them with the config file like this:

python virhunter/prepare_ds_nn.py configs/train_config.yaml

VirHunter on GPU

If you plan to train VirHunter on GPU, please use environment_gpu.yml or requirements_gpu.txt for dependencies installation. Those recipes were tested only on the Linux cluster with multiple GPUs. If you plan to train VirHunter on cluster with multiple GPUs, you will need to uncomment line with CUDA_VISIBLE_DEVICES variable and replace "" with "N" in header of train_nn.py, where N is the number of GPU you want to use.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "N"

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.0

Feb 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

virhunter-1.0.0.tar.gz (4.3 kB view details)

Uploaded Feb 22, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

virhunter-1.0.0-py3-none-any.whl (4.3 kB view details)

Uploaded Feb 22, 2022 Python 3

File details

Details for the file virhunter-1.0.0.tar.gz.

File metadata

Download URL: virhunter-1.0.0.tar.gz
Upload date: Feb 22, 2022
Size: 4.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.50.2 importlib-metadata/4.11.1 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for virhunter-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`195329e55815b82356db2cf88857a44b887c022e7a4ab43f9f00b31e76e31c5e`
MD5	`dc17ffa726ac0666e8420e05d2d128d5`
BLAKE2b-256	`6f26ece8083d93139a177c239a1dc6c67359fe6ff200bbbcbd66d342446593c1`

See more details on using hashes here.

File details

Details for the file virhunter-1.0.0-py3-none-any.whl.

File metadata

Download URL: virhunter-1.0.0-py3-none-any.whl
Upload date: Feb 22, 2022
Size: 4.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.50.2 importlib-metadata/4.11.1 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for virhunter-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5579484e179a3f5c8291f1c7b97f5dc1b61e34cbb04bdd93f603682789c41ab8`
MD5	`ad59781203c558ae4b59f733447deaa6`
BLAKE2b-256	`223081faa86fdc5c8a9def9bbf3b9ef2dbdf244ef968ec99d55b6dbf14e463fa`

See more details on using hashes here.

virhunter 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VirHunter

System Requirements

Installation

Installing dependencies with Conda

Installing dependencies with pip

Testing installation of the VirHunter

Using VirHunter for prediction

Training your own model

VirHunter on GPU

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes