Skip to main content

NVIDIA Resiliency Package

Project description

NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads. Users can modularly integrate NVRx capabilities into their own infrastructure to maximize AI training productivity at scale. NVRx maximizes goodput by enabling system-wide health checks, quickly detecting faults at runtime and resuming training automatically. NVRx minimizes loss of work by enabling fast and frequent checkpointing.

For detailed documentation and usage information about each component, please refer to https://nvidia.github.io/nvidia-resiliency-ext/.

⚠️ NOTE: This project is still experimental and under active development. The code, features, and documentation are evolving rapidly. Please expect frequent updates and breaking changes. Contributions are welcome and we encourage you to watch for updates.

Figure highlighting core NVRx features including automatic restart, hierarchical checkpointing, fault detection and health checks

Core Components and Capabilities

  • Fault Tolerance

    • Detection of hung ranks.
    • Restarting training in-job, without the need to reallocate SLURM nodes.
  • In-Process Restarting

    • Detecting failures and enabling quick recovery.
  • Async Checkpointing

    • Providing an efficient framework for asynchronous checkpointing.
  • Local Checkpointing

    • Providing an efficient framework for local checkpointing.
  • Straggler Detection

    • Monitoring GPU and CPU performance of ranks.
    • Identifying slower ranks that may impede overall training efficiency.
  • Framework Integration

    • Facilitating seamless fault tolerance and straggler detection integration with PyTorch Lightning based workloads.
    • Providing integration with NVIDIA NeMo framework, a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

Installation

From sources

  • git clone https://github.com/NVIDIA/nvidia-resiliency-ext
  • cd nvidia-resiliency-ext
  • pip install .
  • pip install .[attribution] if you also need log-analysis / attribution extras

From PyPI wheel

  • pip install nvidia-resiliency-ext
  • pip install 'nvidia-resiliency-ext[attribution]' for attribution extras

Platform Support

Category Supported Versions / Requirements
Architecture x86_64, arm64
Operating System Ubuntu 22.04, 24.04
Python Version >= 3.10, < 3.13
PyTorch Version >= 2.5.1, >= 2.8.0 (Fault Attribution)
CUDA & CUDA Toolkit >= 12.8
NVML Driver >= 535 (570 required for GPU health check)
NCCL Version < 2.28.3 OR >= 2.28.9 (avoid NCCL 2.28.3–2.28.8 due to inprocess issue)
TE Version >= 2.5

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nvidia_resiliency_ext-0.6.0-cp312-cp312-manylinux_2_39_x86_64.whl (773.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.39+ x86-64

nvidia_resiliency_ext-0.6.0-cp312-cp312-manylinux_2_39_aarch64.whl (770.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.39+ ARM64

nvidia_resiliency_ext-0.6.0-cp311-cp311-manylinux_2_39_x86_64.whl (772.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.39+ x86-64

nvidia_resiliency_ext-0.6.0-cp311-cp311-manylinux_2_39_aarch64.whl (770.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.39+ ARM64

nvidia_resiliency_ext-0.6.0-cp310-cp310-manylinux_2_39_x86_64.whl (771.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.39+ x86-64

nvidia_resiliency_ext-0.6.0-cp310-cp310-manylinux_2_39_aarch64.whl (769.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.39+ ARM64

File details

Details for the file nvidia_resiliency_ext-0.6.0-cp312-cp312-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.6.0-cp312-cp312-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 7d5ac6567841aef0173af3bcc0c2044971d2b225918028b99af6e4eb7e6e2f5d
MD5 27cf308551aec17d900e0d3c35c54023
BLAKE2b-256 c40b957a223c71959497e96ed868acb2ec869170caa95a60a0edbab8909aa98b

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.6.0-cp312-cp312-manylinux_2_39_aarch64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.6.0-cp312-cp312-manylinux_2_39_aarch64.whl
Algorithm Hash digest
SHA256 a71c1a1650b32c72f7fdf7d6f73b4d275961ce5d68bd68efaff9e0d7099031cb
MD5 43a1f08e8b08e9681491fa5ca841b6d2
BLAKE2b-256 2de18d3be847258068862df502123f92cc76dabe869fccde893ecf9ba07168d1

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.6.0-cp311-cp311-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.6.0-cp311-cp311-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 ced5ac042f8fc1ba05b7986f47216429a8e7c63296998e5711bda208217281fb
MD5 e5051341335cb78ab277c1143cc3609b
BLAKE2b-256 fb5fb0eccb1a8248f299c9e31203361da7861e254439ec6de93dc7fec34dcd36

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.6.0-cp311-cp311-manylinux_2_39_aarch64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.6.0-cp311-cp311-manylinux_2_39_aarch64.whl
Algorithm Hash digest
SHA256 d54cdcdbc3e64552910c8ff5f65c5943c6041a15772dbe5b7c3e5938cfe7079e
MD5 643ab9533b8bf659c25685772bed4546
BLAKE2b-256 acbb212965e51e3b2b6648f9b964c8ddd6aae8fc7805f2f240dc9edcf4761e04

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.6.0-cp310-cp310-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.6.0-cp310-cp310-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 a3f251d6c189593946463def0d8649d3566c6ab2a4be02aacc75bf25017cdde3
MD5 50570e1f253288bd519eba1602ad0313
BLAKE2b-256 c116253803fd02fe4265c7ae84b776be3bf445a3fed3c39830dcf4913ad8fd29

See more details on using hashes here.

File details

Details for the file nvidia_resiliency_ext-0.6.0-cp310-cp310-manylinux_2_39_aarch64.whl.

File metadata

File hashes

Hashes for nvidia_resiliency_ext-0.6.0-cp310-cp310-manylinux_2_39_aarch64.whl
Algorithm Hash digest
SHA256 da7086be06b8359f27ac0e68fca08981d9cadfb937b4543e2164d371b59202ad
MD5 e1068adbf5807dd352dd59e4d7d8182a
BLAKE2b-256 f2723d1f92f0c3cd26d32707a393e539b3f6d2b59733e1de4442d4289766d07c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page