Skip to main content

A module for the segmentation of phage endolysin domains based on the PAE matrix from AlphaFold.

Project description

Segmentation of PhAge Endolysin Domains

SPAED is a tool to identify domains in phage endolysins. It takes as input the PAE file(s) obtained from AlphaFold and outputs a csv file with delineations.

Additional scripts are provided to visualize predicted domains with PyMOL and to obtain their amino acid sequences.

Installation & usage

Check out www.spaed.ca to launch SPAED quickly!

First create a virtual environment, then:

From pypi:

pip install spaed  ### note the spelling of spaed

ex. spaed pae_path --output_file spaed_predictions.csv

From source:

git clone https://github.com/Rousseau-Team/spaed.git

pip install numpy pandas scipy

ex. python spaed/src/spaed/spaed.py pae_path

Advanced usage

Optional dependency for structure visualisation: pymol (conda install -c conda-forge -c schrodinger pymol-bundle). Python>3.10 is required, 3.12.9 worked for me.
ex. (install from pip). pymol_vis pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}
ex. (install from source). python spaed/src/spaed/pymol_vis.py pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}

Positional arguments:

  • pae_path - Folder of or singular PAE file in json format as outputted by Alphafold2/3 or Colabfold.

Optional arguments:

  • output_file - File to save table of segmented domains in csv format. (default spaed_predictions.csv)
  • fasta_path - Path to fasta file or folder containing fasta files. If specified, spaed will save the sequences corresponding to predicted domains,linkers and terminal disordered regions into new fasta files named "spaed_predicted_{seq_type}.faa" in the same output folder as output_file. Ensure fasta names or headers correspond to entries in pae files.
  • RATIO_NUM_CLUSTERS - Maximum number of clusters initially generated by hierarchical clustering corresponds to len(protein) // RATIO_NUM_CLUSTERS. (Default 10). For a protein 400 residues long, 40 clusters will be generated.
  • MIN_DOMAIN_SIZE - Minimum size a domain can have. (default 30).
  • PAE_SCORE_CUTOFF - Cutoff on the PAE score used to make adjustments to predicted domains/linkers/terminal disordered regions. Residues with PAE score < PAE_SCORE_CUTOFF are considered close together. (default = 4).
  • MIN_DISORDERED_SIZE - Minimum size a terminal disordered region can be to be considered a separate entity from the domain it is next to (default 20).
  • FREQ_DISORDERED - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered "not part of a domain". Values <MIN_DOMAIN_SIZE are logical, but as it increases, the more leniant the algorithm becomes to non-domain regions (more will be predicted). (default 6).
  • PROP_DISORDERED - Proportion of residues in a given region that must meet FREQ_DISORDERED criteria to be considered a terminal disordered region. The greater the value, the stricter the criteria to predict the region as disordered. (default 80%).
  • FREQ_LINKER - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered as part of the linker. Values < MIN_DOMAIN_SIZE are logical as they are less than the expected size of the nearest domain. Increasing leads to a more leniant assignment of residues as part of the linker. (default 20).
  • version - Display installed SPAED version number.

If you are interested in looking at the disordered regions in N- or C-terminal, consider increasing FREQ_DISORDERED ([4-30]), decreasing MIN_DISORDERED_SIZE ([10-30]) or decreasing PROP_DISORDERED ([50-95]). This will result in more (and longer) terminal disordered regions being detected, but also many false positives. I would not change them all at the same time as this will probably increase the sensitivity too much.

If you are interested in linkers or have a protein that is less well folded, consider modifying the FREQ_LINKER parameter ([4-30]). This value is used to adjust the boundaries of the linkers and as such, a higher value will result in longer linkers. However, linkers that were missed will still not be detected.

Outputs

A csv file containing the proteinID, protein length, number of predicted domains, domain delineations, linker delineations, terminal disordered region delineations. Delineations for each domain are separated by a ";".
Ex.

length # domains domains linkers disordered
prot 1 251 2 1-120;130-251 121-129
prot 2 386 2 86-203;217-386 204-216 1-85

Citation

Alexandre Boulay, Emma Cremelie, Clovis Galiez, Yves Briers, Elsa Rousseau, Roberto Vázquez, SPAED: harnessing AlphaFold output for accurate segmentation of phage endolysin domains, Bioinformatics, Volume 41, Issue 10, October 2025, btaf531, https://doi.org/10.1093/bioinformatics/btaf531.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spaed-1.0.6.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spaed-1.0.6-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file spaed-1.0.6.tar.gz.

File metadata

  • Download URL: spaed-1.0.6.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.20

File hashes

Hashes for spaed-1.0.6.tar.gz
Algorithm Hash digest
SHA256 a018b4866ef2e47d351804c550e337930764b814714d32e95090aa0c711a0f36
MD5 08c91087dc729a9fc0a54cff3f7cb139
BLAKE2b-256 724e9f50a305a9b115a005d60604e82c89bfa62cec6c4a07eae5415f6f407e0f

See more details on using hashes here.

File details

Details for the file spaed-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: spaed-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.20

File hashes

Hashes for spaed-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 646ecf94e3c7fe3d2b10e6ccf9e37ae54226a42c2e5eb169baf330aea752a604
MD5 a2f865a611792b5f0d73c717e13bf7a2
BLAKE2b-256 0497f5612915c507b314e6d422b061f6acbc53225a7f12e926d0aeb9fde459f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page