LLM inference serving — deploy any HuggingFace or custom model behind a modal.web_endpoint() with token streaming, WebSocket support, and sub-10ms overhead latency via globally distributed compute.
Multi-node distributed training — configure gang-scheduled multi-node runs on up to 128 B200s with 3200 Gbps Infiniband using Modal's cluster API in a single Python file.
Batch & async inference pipelines — process large-scale embedding generation, re-ranking, or dataset synthesis jobs across thousands of parallel GPU workers with no job orchestration overhead.
Sandbox execution for RL rollouts — programmatically instantiate hundreds of thousands of concurrent modal.Sandbox environments for reinforcement learning trajectory collection, keeping GPU inference resources saturated.
Parallel hyperparameter sweeps — use .map() or .starmap() to fan out hundreds of training experiments simultaneously, with automatic resource cleanup and per-second billing.
Secure agent execution environments — build background or coding agents that run in fully isolated sandboxes with custom images, injected secrets, and controlled network access.

Modal