Skip to main content

MLflow deployment plugin for Modal serverless GPU infrastructure (actively maintained)

Project description

mlflow-modal-deploy

CI CodeQL PyPI version Downloads License Python 3.10+

Deploy MLflow models to Modal's serverless GPU infrastructure with a single command.

If you find this project useful, please consider giving it a star! It helps others discover the project and motivates continued development. Using it in production? Share your experience - we'd love to hear from you!

Installation

pip install mlflow-modal-deploy

Features

  • One-command deployment: Deploy any MLflow model to Modal's serverless infrastructure
  • GPU support: T4, L4, L40S, A10, A100, A100-40GB, A100-80GB, H100, H200, B200
  • Streaming predictions: predict_stream() API compatible with MLflow Databricks client
  • Auto-scaling: Configure min/max containers, scale-down windows
  • Dynamic batching: Built-in request batching for high-throughput workloads
  • Automatic dependency detection: Extracts requirements from model artifacts
  • Wheel file support: Handles private dependencies packaged as wheel files
  • Private PyPI support: Deploy with private packages via pip_index_url or Modal secrets
  • MLflow CLI integration: Use familiar mlflow deployments commands

How it Works

flowchart LR
    A[MLflow Model] --> B[Extract Dependencies]
    B --> C[Modal Volume]
    C --> D[Generate Modal App]
    D --> E[HTTPS Endpoint]
  1. Extract: MLflow model artifacts and dependencies are extracted from the model URI
  2. Upload: Model files are uploaded to a Modal Volume for persistent storage
  3. Generate: A Modal app is generated with FastAPI endpoints (/invocations, /predict_stream)
  4. Deploy: Modal builds a container with all dependencies and deploys to serverless infrastructure
  5. Serve: An HTTPS endpoint URL is returned, ready to handle prediction requests

The generated container mirrors your training environment, ensuring consistent behavior between development and production.

Quick Start

Python API

from mlflow.deployments import get_deploy_client

# Get the Modal deployment client
client = get_deploy_client("modal")

# Deploy a model
deployment = client.create_deployment(
    name="my-classifier",
    model_uri="runs:/abc123/model",
    config={
        "gpu": "T4",
        "memory": 2048,
        "min_containers": 1,
    }
)

print(f"Deployed to: {deployment['endpoint_url']}")

# Make predictions
predictions = client.predict(
    deployment_name="my-classifier",
    inputs={"feature1": [1, 2, 3], "feature2": [4, 5, 6]}
)

CLI

# Deploy a model
mlflow deployments create -t modal -m runs:/abc123/model --name my-model

# Deploy with GPU
mlflow deployments create -t modal -m runs:/abc123/model --name gpu-model \
    -C gpu=T4 -C memory=4096

# List deployments
mlflow deployments list -t modal

# Get deployment info
mlflow deployments get -t modal --name my-model

# Delete deployment
mlflow deployments delete -t modal --name my-model

Configuration Options

Option Type Default Description
gpu str/list None GPU type (T4, L4, L40S, A10, A100, A100-40GB, A100-80GB, H100, H200, B200), multi-GPU (H100:8), dedicated (H100!), or fallback list (["H100", "A100"])
memory int 512 Memory allocation in MB
cpu float 1.0 CPU cores
timeout int 300 Request timeout in seconds
startup_timeout int None Container startup timeout (overrides timeout during model loading)
scaledown_window int 60 Seconds before idle container scales down
concurrent_inputs int 1 Max concurrent requests per container
target_inputs int None Target concurrency for autoscaler (enables smarter scaling)
min_containers int 0 Minimum warm containers
max_containers int None Maximum containers
buffer_containers int None Extra idle containers to maintain under load
enable_batching bool False Enable dynamic batching
max_batch_size int 8 Max batch size when batching enabled
batch_wait_ms int 100 Batch wait time in milliseconds
python_version str auto Python version (auto-detected from model)
extra_pip_packages list [] Additional pip packages to install at deployment time
pip_index_url str None Custom PyPI index URL for private packages
pip_extra_index_url str None Additional PyPI index URL (fallback)
modal_secret str None Modal secret name containing pip credentials

Authentication

Configure Modal authentication before deploying:

# Interactive setup
modal setup

# Or use environment variables
export MODAL_TOKEN_ID=your-token-id
export MODAL_TOKEN_SECRET=your-token-secret

Local Testing (Recommended)

Before deploying to Modal's cloud infrastructure, test your deployment locally to catch issues early:

from mlflow_modal import run_local

run_local(
    target_uri="modal",
    name="test-model",
    model_uri="runs:/abc123/model",
    config={"gpu": "T4"}
)

This runs modal serve locally, allowing you to verify:

  • Model loads correctly with all dependencies
  • Inference endpoint responds as expected
  • GPU configuration is valid

Once local testing passes, deploy to production with create_deployment().

Advanced Usage

Streaming Predictions

For LLM and generative models, use predict_stream() for token-by-token streaming responses. This API is compatible with MLflow's Databricks client, enabling consistent code across deployment targets.

from mlflow.deployments import get_deploy_client

client = get_deploy_client("modal")

# Stream predictions (for LLM models)
for chunk in client.predict_stream(
    deployment_name="my-llm",
    inputs={
        "messages": [{"role": "user", "content": "Hello!"}],
        "temperature": 0.7,
        "max_tokens": 100,
    },
):
    print(chunk, end="", flush=True)

How it works:

  • Models with native predict_stream() support (LLMs) stream token-by-token
  • Non-streaming models (sklearn, XGBoost, etc.) return predictions in a single chunk
  • Uses Server-Sent Events (SSE) format for efficient streaming over HTTP

Deploy to Specific Workspace

# Use workspace-specific URI
client = get_deploy_client("modal:/production")

Or via CLI:

mlflow deployments create -t modal:/production -m runs:/abc123/model --name my-model

High-Throughput Deployment with Batching

client.create_deployment(
    name="batch-classifier",
    model_uri="runs:/abc123/model",
    config={
        "gpu": "A100",
        "enable_batching": True,
        "max_batch_size": 32,
        "batch_wait_ms": 50,
        "min_containers": 2,
        "max_containers": 20,
    }
)

Adding Extra Packages at Deployment Time

Use extra_pip_packages when the model's auto-detected requirements are incomplete or you need production-specific packages:

client.create_deployment(
    name="my-model",
    model_uri="runs:/abc123/model",
    config={
        "gpu": "A100",
        "extra_pip_packages": [
            "accelerate>=0.24",      # GPU inference optimization
            "prometheus_client",     # Monitoring
            "structlog",             # Production logging
        ],
    }
)

Common use cases:

  • Missing transitive dependencies: Packages MLflow didn't auto-detect
  • Inference optimizations: accelerate, bitsandbytes, onnxruntime-gpu
  • Production monitoring: prometheus_client, opentelemetry-api
  • Version overrides: Pin specific versions for compatibility

Deploying with Private Packages

For private PyPI servers or authenticated package repositories:

Step 1: Create a Modal secret with your credentials:

# Create a secret with your private PyPI credentials
modal secret create pypi-auth \
    PIP_INDEX_URL="https://user:token@pypi.my-company.com/simple/" \
    PIP_EXTRA_INDEX_URL="https://pypi.tw.martin98.com/simple/"

Step 2: Reference the secret in your deployment:

client.create_deployment(
    name="my-model",
    model_uri="runs:/abc123/model",
    config={
        # Option 1: Use Modal secret for authenticated access
        "modal_secret": "pypi-auth",
        "extra_pip_packages": ["my-private-package>=1.0"],

        # Option 2: Direct URL (for unauthenticated private repos)
        # "pip_index_url": "https://pypi.my-company.com/simple/",
        # "pip_extra_index_url": "https://pypi.tw.martin98.com/simple/",
    }
)

Supported private package sources:

  • Private PyPI servers: Artifactory, CodeArtifact, DevPI, Nexus
  • Authenticated indexes: Any pip-compatible index with auth tokens
  • Wheel files: Already supported via the code/ directory in model artifacts

Models with Private Dependencies

If your model includes wheel files in the code/ directory, they are automatically detected and installed:

model/
├── MLmodel
├── requirements.txt
├── code/
│   └── my_private_package-1.0.0-py3-none-any.whl  # Auto-detected
└── ...

Troubleshooting

Modal Authentication Fails

# Re-authenticate with Modal
modal setup

# Verify authentication
modal profile list

"MLmodel not found" Error

  • Ensure model was logged with mlflow.pyfunc.log_model() or similar MLflow logging function
  • Verify the model URI is correct: runs:/<run_id>/model or models:/<name>/<version>
  • Check that the model directory contains an MLmodel file

Deployment Times Out

For large models that take longer to load:

client.create_deployment(
    name="large-model",
    model_uri="runs:/abc123/model",
    config={
        "startup_timeout": 600,  # 10 minutes for model loading
        "timeout": 300,          # 5 minutes for inference requests
    }
)

Missing Dependencies at Runtime

If the model fails with import errors:

client.create_deployment(
    name="my-model",
    model_uri="runs:/abc123/model",
    config={
        "extra_pip_packages": ["missing-package>=1.0"],
    }
)

View Build Logs

Check the Modal Dashboard for detailed build and runtime logs.

Requirements

  • Python 3.10+
  • MLflow 2.10.0+
  • Modal 1.0.0+

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/debu-sinha/mlflow-modal-deploy.git
cd mlflow-modal-deploy

# Install with dev dependencies
uv sync --extra dev

# Install pre-commit hooks
uv run pre-commit install

# Run tests
uv run pytest tests/ -v

License

Apache License 2.0

Acknowledgments

  • MLflow - Open source platform for the ML lifecycle
  • Modal - Serverless cloud for AI/ML

Useful Links

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlflow_modal_deploy-0.6.2.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlflow_modal_deploy-0.6.2-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file mlflow_modal_deploy-0.6.2.tar.gz.

File metadata

  • Download URL: mlflow_modal_deploy-0.6.2.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlflow_modal_deploy-0.6.2.tar.gz
Algorithm Hash digest
SHA256 b4874e3fe01246a5fe0812fe11d23bdd2c4bdae5100c3038f4d018723ce146f8
MD5 463b1220ec071049c8f2340993d3f23c
BLAKE2b-256 be16e63a98a4ba989cdb910a5dc07e3c6a3ec13155ea5f8e4ae7cd4624cb0f6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlflow_modal_deploy-0.6.2.tar.gz:

Publisher: release.yml on debu-sinha/mlflow-modal-deploy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlflow_modal_deploy-0.6.2-py3-none-any.whl.

File metadata

File hashes

Hashes for mlflow_modal_deploy-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 06c1d52d776bc1a61ce50cfc778c9843f7b7d9fb3ba3838a53c788a79b42ef85
MD5 97da80c20236a2096708cd2e81e3db66
BLAKE2b-256 a30eb9098879b3dd2606f7f2ab8f90956a9f77adcc9aaa14b99acf8fa381919e

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlflow_modal_deploy-0.6.2-py3-none-any.whl:

Publisher: release.yml on debu-sinha/mlflow-modal-deploy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page