NVIDIA Resiliency Package

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language

Project description

Nvidia Resiliency Extension

This project combines multiple resiliency-related solutions.

Fault Tolerance package
Straggler Detection package
PyTorch Lightning callbacks

Installation:

From sources

git clone --recursive <this repo URL>
cd <repo>
pip install .

Requirements:

Python >= 3.10
gcc >= 8.0
CUDA >= 11.8

Fault Tolerance integration guide

This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo).

Let's define some terms used in this section:

PTL is PyTorch Lightning
Fault Tolerance, FT is the fault_tolerance package, included in nvidia_resiliency_ext.
FT callback, FaultToleranceCallback is a PTL callback defined in ptl_resiliency package, included in nvidia_resiliency_ext.
ft_launcher is a launcher tool included in the FT, which is based on torchrun.
heartbeat is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.
rank monitor is a special side-process started by ft_launcher that monitors heartbeats from its rank.
timeouts are time intervals used by a rank monitor to detect that a rank is not alive. There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats.
launcher script is a bash script that invokes ft_launcher.

0. Use `ft_launcher` to start the workload

ft_launcher is similar to torchrun but it starts a rank monitor for each started rank.
ft_launcher takes the FT configuration in a YAML file (--fault-tol-cfg-path) or via CLI args (--ft-param-...).
FT configuration items are described in FaultToleranceConfig docstring.

1. Add FT callback to the trainer

Add FT callback to PTL callbacks.

fault_tol_cb = FaultToleranceCallback(
    autoresume=True,
    calculate_timeouts=True,
    logger_name="test_logger",
    exp_dir=tmp_path,
)

trainer = pl.Trainer(
    ...
    callbacks=[..., fault_tol_cb],
)

Core FT callback functionality is:

Establishing a connection with a rank monitor
Sending heartbeats during training and evaluation steps
Disconnecting from a rank monitor

Optionally, it can also:

Compute timeouts that will be used instead of timeouts defined in the FT config
Create a flag file when the training is completed

FT callback initialization params:

def __init__(
    self,
    autoresume: bool,
    calculate_timeouts: bool,
    simulated_fault_params: Optional[Any] = None,
    exp_dir: Union[str, pathlib.Path, None] = None,
    logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback",
):
    """
    Initialize callback instance.

    This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook.

    Args:
        autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).
        calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals.
            Calculated timeouts overwrite the timeouts from the FT config.
            Timeouts are computed at the end of a training job, if there was checkpoint loading and saving.
            For example, for training started from scratch, the timeouts are computed at the end of the second job.
        simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None.
        exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved.
            Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`.
            Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`.
        logger_name (Optional[str], optional): Logger name to be used.
            Defaults to "nemo_logger.FaultToleranceCallback".
    """

2. Implementing auto-resume

Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs.

NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the FaultToleranceCallback.

FaultToleranceCallback exposes an "interface" that allows implementing an auto-resume launcher script.
Specifically, if autoresume=True the FT callback creates a special marker file when a training is completed.
The marker file location is expected to be set in the FAULT_TOL_FINISHED_FLAG_FILE environment variable.

The following mechanism can be used to implement an auto-resuming launcher script:

Launcher script starts ranks with ft_launcher
FAULT_TOL_FINISHED_FLAG_FILE should be passed to rank processes
When a ft_launcher exits, a launcher script checks if the FAULT_TOL_FINISHED_FLAG_FILE file was created.
- If FAULT_TOL_FINISHED_FLAG_FILE exists, the auto-resume loop can be broken, as the training is completed.
- If FAULT_TOL_FINISHED_FLAG_FILE does not exist, the continuation job can be issued (other conditions can be checked e.g. if the maximum number of failures is not reached).

Straggler Detection integration guide

Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks.

straggler_cb_args = dict(
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_log=3,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    logger_name="test_logger",
)

straggler_det_cb = StragglerDetectionCallback(**cb_args)

trainer = pl.Trainer(
    ...
    callbacks=[..., straggler_det_cb],
)

StragglerDetectionCallback initialization params:

def __init__(
    self,
    report_time_interval: float,
    calc_relative_gpu_perf: bool,
    calc_individual_gpu_perf: bool,
    num_gpu_perf_scores_to_log: int,
    gpu_relative_perf_threshold: float,
    gpu_individual_perf_threshold: float,
    stop_if_detected: bool,
    logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback",
):
    """
    Initialize straggler detection callback instance.

    Args:
        report_time_interval (float): Interval [seconds] of the straggler check
        calc_relative_gpu_perf (bool): Calculate relative GPU performance
        calc_individual_gpu_perf (bool): Calculate individual GPU performance
        num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected)
        gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores
        gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores
        stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected
        logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback".

    Raises:
        ValueError: If invalid config was provided.
    """

More info on straggler detection can be found in the straggler package's README.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.6.0

May 14, 2026

0.5.0

Nov 13, 2025

0.4.1

Jul 17, 2025

0.4.0

May 28, 2025

0.3.0

Mar 18, 2025

0.2.1

Feb 22, 2025

0.2.0

Dec 17, 2024

This version

0.1.3

Oct 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl (3.4 MB view details)

Uploaded Oct 15, 2024 CPython 3.12manylinux: glibc 2.31+ x86-64

nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl (3.4 MB view details)

Uploaded Oct 15, 2024 CPython 3.11manylinux: glibc 2.31+ x86-64

nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl (3.4 MB view details)

Uploaded Oct 15, 2024 CPython 3.10manylinux: glibc 2.31+ x86-64

File details

Details for the file nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

Download URL: nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl
Upload date: Oct 15, 2024
Size: 3.4 MB
Tags: CPython 3.12, manylinux: glibc 2.31+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm	Hash digest
SHA256	`332997f4a9237d137a0b74be5b18e87923d2afbed86b0382334c9dba36db2652`
MD5	`b33ed4b9478acc07f8ec62456305516b`
BLAKE2b-256	`60064517300290520936391abd7ebbb59a7e65d047d5c8cfd2db14adf95aeff3`

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

Download URL: nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl
Upload date: Oct 15, 2024
Size: 3.4 MB
Tags: CPython 3.11, manylinux: glibc 2.31+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm	Hash digest
SHA256	`5b07eb0a65096677bfe9a71162808a5f4a41a7145e9fa57fb93955ed22f24218`
MD5	`246e210b75c87c9984473b528cf92948`
BLAKE2b-256	`9270c3d7f91929ff76e9a95809fc28be81fbe62ff049d1277d1d9e671948a66a`

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl.

File metadata

Download URL: nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl
Upload date: Oct 15, 2024
Size: 3.4 MB
Tags: CPython 3.10, manylinux: glibc 2.31+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm	Hash digest
SHA256	`c4c3d963f66f3ae20de5860e16204439e69b096466fb9f36d75a9bc61fd7c328`
MD5	`d21dd5789c80de0d44764347e5f355e5`
BLAKE2b-256	`ed1125854e1c68940b281532f1016eb737102e1d540f8a1d84307bca00baa497`

See more details on using hashes here.

nvidia-resiliency-ext 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Nvidia Resiliency Extension

Installation:

From sources

Fault Tolerance integration guide

0. Use `ft_launcher` to start the workload

1. Add FT callback to the trainer

2. Implementing auto-resume

Straggler Detection integration guide

Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks.

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

nvidia-resiliency-ext 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Nvidia Resiliency Extension

Installation:

From sources

Fault Tolerance integration guide

0. Use ft_launcher to start the workload

1. Add FT callback to the trainer

2. Implementing auto-resume

Straggler Detection integration guide

Include plt_resiliency.StragglerDetectionCallback in a PTL trainer callbacks.

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

0. Use `ft_launcher` to start the workload

Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks.