MLflow deployment plugin for Modal serverless GPU infrastructure (actively maintained)
Project description
mlflow-modal-deploy
Deploy MLflow models to Modal's serverless GPU infrastructure with a single command.
If you find this project useful, please consider giving it a star! It helps others discover the project and motivates continued development. Using it in production? Share your experience - we'd love to hear from you!
Installation
pip install mlflow-modal-deploy
Features
- One-command deployment: Deploy any MLflow model to Modal's serverless infrastructure
- GPU support: T4, L4, L40S, A10, A100, A100-40GB, A100-80GB, H100, H200, B200
- Streaming predictions:
predict_stream()API compatible with MLflow Databricks client - Auto-scaling: Configure min/max containers, scale-down windows
- Dynamic batching: Built-in request batching for high-throughput workloads
- Automatic dependency detection: Extracts requirements from model artifacts
- Wheel file support: Handles private dependencies packaged as wheel files
- Private PyPI support: Deploy with private packages via
pip_index_urlor Modal secrets - MLflow CLI integration: Use familiar
mlflow deploymentscommands
How it Works
flowchart LR
A[MLflow Model] --> B[Extract Dependencies]
B --> C[Modal Volume]
C --> D[Generate Modal App]
D --> E[HTTPS Endpoint]
- Extract: MLflow model artifacts and dependencies are extracted from the model URI
- Upload: Model files are uploaded to a Modal Volume for persistent storage
- Generate: A Modal app is generated with FastAPI endpoints (
/invocations,/predict_stream) - Deploy: Modal builds a container with all dependencies and deploys to serverless infrastructure
- Serve: An HTTPS endpoint URL is returned, ready to handle prediction requests
The generated container mirrors your training environment, ensuring consistent behavior between development and production.
Quick Start
Python API
from mlflow.deployments import get_deploy_client
# Get the Modal deployment client
client = get_deploy_client("modal")
# Deploy a model
deployment = client.create_deployment(
name="my-classifier",
model_uri="runs:/abc123/model",
config={
"gpu": "T4",
"memory": 2048,
"min_containers": 1,
}
)
print(f"Deployed to: {deployment['endpoint_url']}")
# Make predictions
predictions = client.predict(
deployment_name="my-classifier",
inputs={"feature1": [1, 2, 3], "feature2": [4, 5, 6]}
)
CLI
# Deploy a model
mlflow deployments create -t modal -m runs:/abc123/model --name my-model
# Deploy with GPU
mlflow deployments create -t modal -m runs:/abc123/model --name gpu-model \
-C gpu=T4 -C memory=4096
# List deployments
mlflow deployments list -t modal
# Get deployment info
mlflow deployments get -t modal --name my-model
# Delete deployment
mlflow deployments delete -t modal --name my-model
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
gpu |
str/list | None | GPU type (T4, L4, L40S, A10, A100, A100-40GB, A100-80GB, H100, H200, B200), multi-GPU (H100:8), dedicated (H100!), or fallback list (["H100", "A100"]) |
memory |
int | 512 | Memory allocation in MB |
cpu |
float | 1.0 | CPU cores |
timeout |
int | 300 | Request timeout in seconds |
startup_timeout |
int | None | Container startup timeout (overrides timeout during model loading) |
scaledown_window |
int | 60 | Seconds before idle container scales down |
concurrent_inputs |
int | 1 | Max concurrent requests per container |
target_inputs |
int | None | Target concurrency for autoscaler (enables smarter scaling) |
min_containers |
int | 0 | Minimum warm containers |
max_containers |
int | None | Maximum containers |
buffer_containers |
int | None | Extra idle containers to maintain under load |
enable_batching |
bool | False | Enable dynamic batching |
max_batch_size |
int | 8 | Max batch size when batching enabled |
batch_wait_ms |
int | 100 | Batch wait time in milliseconds |
python_version |
str | auto | Python version (auto-detected from model) |
extra_pip_packages |
list | [] | Additional pip packages to install at deployment time |
pip_index_url |
str | None | Custom PyPI index URL for private packages |
pip_extra_index_url |
str | None | Additional PyPI index URL (fallback) |
modal_secret |
str | None | Modal secret name containing pip credentials |
Authentication
Configure Modal authentication before deploying:
# Interactive setup
modal setup
# Or use environment variables
export MODAL_TOKEN_ID=your-token-id
export MODAL_TOKEN_SECRET=your-token-secret
Local Testing (Recommended)
Before deploying to Modal's cloud infrastructure, test your deployment locally to catch issues early:
from mlflow_modal import run_local
run_local(
target_uri="modal",
name="test-model",
model_uri="runs:/abc123/model",
config={"gpu": "T4"}
)
This runs modal serve locally, allowing you to verify:
- Model loads correctly with all dependencies
- Inference endpoint responds as expected
- GPU configuration is valid
Once local testing passes, deploy to production with create_deployment().
Advanced Usage
Streaming Predictions
For LLM and generative models, use predict_stream() for token-by-token streaming responses. This API is compatible with MLflow's Databricks client, enabling consistent code across deployment targets.
from mlflow.deployments import get_deploy_client
client = get_deploy_client("modal")
# Stream predictions (for LLM models)
for chunk in client.predict_stream(
deployment_name="my-llm",
inputs={
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100,
},
):
print(chunk, end="", flush=True)
How it works:
- Models with native
predict_stream()support (LLMs) stream token-by-token - Non-streaming models (sklearn, XGBoost, etc.) return predictions in a single chunk
- Uses Server-Sent Events (SSE) format for efficient streaming over HTTP
Deploy to Specific Workspace
# Use workspace-specific URI
client = get_deploy_client("modal:/production")
Or via CLI:
mlflow deployments create -t modal:/production -m runs:/abc123/model --name my-model
High-Throughput Deployment with Batching
client.create_deployment(
name="batch-classifier",
model_uri="runs:/abc123/model",
config={
"gpu": "A100",
"enable_batching": True,
"max_batch_size": 32,
"batch_wait_ms": 50,
"min_containers": 2,
"max_containers": 20,
}
)
Adding Extra Packages at Deployment Time
Use extra_pip_packages when the model's auto-detected requirements are incomplete or you need production-specific packages:
client.create_deployment(
name="my-model",
model_uri="runs:/abc123/model",
config={
"gpu": "A100",
"extra_pip_packages": [
"accelerate>=0.24", # GPU inference optimization
"prometheus_client", # Monitoring
"structlog", # Production logging
],
}
)
Common use cases:
- Missing transitive dependencies: Packages MLflow didn't auto-detect
- Inference optimizations:
accelerate,bitsandbytes,onnxruntime-gpu - Production monitoring:
prometheus_client,opentelemetry-api - Version overrides: Pin specific versions for compatibility
Deploying with Private Packages
For private PyPI servers or authenticated package repositories:
Step 1: Create a Modal secret with your credentials:
# Create a secret with your private PyPI credentials
modal secret create pypi-auth \
PIP_INDEX_URL="https://user:token@pypi.my-company.com/simple/" \
PIP_EXTRA_INDEX_URL="https://pypi.tw.martin98.com/simple/"
Step 2: Reference the secret in your deployment:
client.create_deployment(
name="my-model",
model_uri="runs:/abc123/model",
config={
# Option 1: Use Modal secret for authenticated access
"modal_secret": "pypi-auth",
"extra_pip_packages": ["my-private-package>=1.0"],
# Option 2: Direct URL (for unauthenticated private repos)
# "pip_index_url": "https://pypi.my-company.com/simple/",
# "pip_extra_index_url": "https://pypi.tw.martin98.com/simple/",
}
)
Supported private package sources:
- Private PyPI servers: Artifactory, CodeArtifact, DevPI, Nexus
- Authenticated indexes: Any pip-compatible index with auth tokens
- Wheel files: Already supported via the
code/directory in model artifacts
Models with Private Dependencies
If your model includes wheel files in the code/ directory, they are automatically detected and installed:
model/
├── MLmodel
├── requirements.txt
├── code/
│ └── my_private_package-1.0.0-py3-none-any.whl # Auto-detected
└── ...
Troubleshooting
Modal Authentication Fails
# Re-authenticate with Modal
modal setup
# Verify authentication
modal profile list
"MLmodel not found" Error
- Ensure model was logged with
mlflow.pyfunc.log_model()or similar MLflow logging function - Verify the model URI is correct:
runs:/<run_id>/modelormodels:/<name>/<version> - Check that the model directory contains an
MLmodelfile
Deployment Times Out
For large models that take longer to load:
client.create_deployment(
name="large-model",
model_uri="runs:/abc123/model",
config={
"startup_timeout": 600, # 10 minutes for model loading
"timeout": 300, # 5 minutes for inference requests
}
)
Missing Dependencies at Runtime
If the model fails with import errors:
client.create_deployment(
name="my-model",
model_uri="runs:/abc123/model",
config={
"extra_pip_packages": ["missing-package>=1.0"],
}
)
View Build Logs
Check the Modal Dashboard for detailed build and runtime logs.
Requirements
- Python 3.10+
- MLflow 2.10.0+
- Modal 1.0.0+
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Development Setup
# Clone the repository
git clone https://github.com/debu-sinha/mlflow-modal-deploy.git
cd mlflow-modal-deploy
# Install with dev dependencies
uv sync --extra dev
# Install pre-commit hooks
uv run pre-commit install
# Run tests
uv run pytest tests/ -v
License
Apache License 2.0
Acknowledgments
Useful Links
- Modal Documentation - Modal platform docs and tutorials
- MLflow Deployment Guide - MLflow deployment concepts
- MLflow Model Format - Understanding MLflow models
- Modal GPU Guide - GPU types and configuration
Support
- GitHub Issues - Bug reports and feature requests
- MLflow Slack - Community discussion
- Modal Community - Modal-specific questions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlflow_modal_deploy-0.6.2.tar.gz.
File metadata
- Download URL: mlflow_modal_deploy-0.6.2.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4874e3fe01246a5fe0812fe11d23bdd2c4bdae5100c3038f4d018723ce146f8
|
|
| MD5 |
463b1220ec071049c8f2340993d3f23c
|
|
| BLAKE2b-256 |
be16e63a98a4ba989cdb910a5dc07e3c6a3ec13155ea5f8e4ae7cd4624cb0f6a
|
Provenance
The following attestation bundles were made for mlflow_modal_deploy-0.6.2.tar.gz:
Publisher:
release.yml on debu-sinha/mlflow-modal-deploy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlflow_modal_deploy-0.6.2.tar.gz -
Subject digest:
b4874e3fe01246a5fe0812fe11d23bdd2c4bdae5100c3038f4d018723ce146f8 - Sigstore transparency entry: 1068774143
- Sigstore integration time:
-
Permalink:
debu-sinha/mlflow-modal-deploy@b82b5535229580677ec15d47be4dcba0953431ed -
Branch / Tag:
refs/tags/v0.6.2 - Owner: https://github.com/debu-sinha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b82b5535229580677ec15d47be4dcba0953431ed -
Trigger Event:
push
-
Statement type:
File details
Details for the file mlflow_modal_deploy-0.6.2-py3-none-any.whl.
File metadata
- Download URL: mlflow_modal_deploy-0.6.2-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06c1d52d776bc1a61ce50cfc778c9843f7b7d9fb3ba3838a53c788a79b42ef85
|
|
| MD5 |
97da80c20236a2096708cd2e81e3db66
|
|
| BLAKE2b-256 |
a30eb9098879b3dd2606f7f2ab8f90956a9f77adcc9aaa14b99acf8fa381919e
|
Provenance
The following attestation bundles were made for mlflow_modal_deploy-0.6.2-py3-none-any.whl:
Publisher:
release.yml on debu-sinha/mlflow-modal-deploy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlflow_modal_deploy-0.6.2-py3-none-any.whl -
Subject digest:
06c1d52d776bc1a61ce50cfc778c9843f7b7d9fb3ba3838a53c788a79b42ef85 - Sigstore transparency entry: 1068774193
- Sigstore integration time:
-
Permalink:
debu-sinha/mlflow-modal-deploy@b82b5535229580677ec15d47be4dcba0953431ed -
Branch / Tag:
refs/tags/v0.6.2 - Owner: https://github.com/debu-sinha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b82b5535229580677ec15d47be4dcba0953431ed -
Trigger Event:
push
-
Statement type: