Hypha
Hypha is a self-managing "Kubernetes for AI" (but simpler) for distributed machine learning. Train and serve models across heterogeneous infrastructure—from GPU farms to commodity hardware.
Get started in minutes following the quick start guide.
Built on the battle-tested libp2p network stack with additional security features, Hypha maintains high security and reliability while making it simple to set up. The system implements DiLoCo (Distributed Low-Communication) style training, an approach that dramatically reduces communication overhead compared to traditional data-parallel training making it feasable to train across data centers.
Key Features
- Distributed Training — Run DiLoCo-style training across workers with infrequent synchronization, ideal for bandwidth-constrained or geographically distributed setups. Learn more →
- Production Inference (in development) — The same decentralized architecture supports scalable, resilient inference serving with automatic load balancing.
- Security — End-to-end encryption via mTLS, certificate revocation for immediate access control, and a permissioned network model. Security guide →
Next Steps
New to Hypha? Start with the Quick Start to get a local cluster running in minutes, then explore the architecture and deployment guides to get into production.
- Quick Start — Set up a local cluster and run your first training job
- Architecture — How Gateways, Schedulers, Workers, and Data Nodes fit together
- Deployment — Deploy Hypha on cloud infrastructure
Hypha
-
Quick Start
Onboarding
Step-by-step guide for installing Hypha, generating certificates, configuring nodes, and running a first training job.
-
Installation
Reference
Explains every supported method for installing or removing Hypha binaries and where to go next.
-
Architecture Overview
Reference
Explains how gateways, schedulers, workers, and data nodes interact across Hypha's decentralized network.
-
DiLoCo Training
Onboarding
Walkthrough of running Hypha's DiLoCo training workflow, explaining component roles and execution flow.
-
Gateway
Reference
Describes gateway responsibilities plus configuration, telemetry, and protocol participation guidance.
-
Worker Node
Reference
Details worker responsibilities, configuration options, and executor setup for running training or inference jobs.
-
Data Node
Reference
Documents how data nodes store, announce, and serve SafeTensors datasets along with preparation guidance.
-
Scheduler
Reference
Covers scheduler duties, configuration, and job specification fields for orchestrating distributed training.
-
Networking
Reference
Explains how Hypha nodes discover and reach each other plus guidance on ports, NAT, and external addresses.
-
Security
Reference
Outlines Hypha's mTLS design, certificate hierarchy, and node authentication flow.
-
Troubleshooting
Reference
Aid in resolving common issues encountered when using Hypha.
-
Deploying Hypha
Hypha documentation.
-
Reference
Command-line reference for Hypha tools.
-
hypha-worker CLI
Reference
Auto-generated reference for the hypha-worker command, covering configuration and operational subcommands.
-
hypha-gateway CLI
Reference
Auto-generated reference for the hypha-gateway command, including init, probe, and run usage.
-
hypha-data CLI
Reference
Auto-generated reference for the hypha-data command, covering init, probe, and run modes.
-
hypha-inspect CLI
Reference
Auto-generated reference for the hypha-inspect command, covering configuration and operational subcommands.
-
hypha-scheduler CLI
Reference
Auto-generated reference for the hypha-scheduler command, its init/probe/run options, and flags.
-
hypha-certutil CLI
Reference
Auto-generated reference for the hypha-certutil command and its subcommands.