Quick Start Guide
Overview
Hypha is an open-source tool for democratizing large-scale machine learning across heterogeneous compute resources. It helps discover and utilize diverse compute resources (CPUs, GPUs, and specialized accelerators) to run distributed training and inference with minimal operational overhead.
This guide walks through installation, configuration, and running a first distributed training job.
Prerequisites
Make sure you have the following installed:
| Requirement | Version | Check Command | Install Documentation |
|---|---|---|---|
| uv | ≥ 0.9.7 | uv --version | https://docs.astral.sh/uv/getting-started/installation/ |
| git | any | git --version | https://git-scm.com/book/en/v2/Getting-Started-Installing-Git |
| git-lfs | any | git-lfs --version | https://git-lfs.github.com/ |
| curl | any | curl --version | https://curl.se/docs/install.html |
Also if you want to build from source, you need your rust toolchain.
Installation
Install Hypha using the standalone installer script:
curl -fsSL https://hypha-space.org/install.sh | sh
For alternative installation methods, see the Installation Guide.
Configuration
All Hypha components require setup, and certificates need to be prepared.
Certificates
Create the necessary certificates first. See hypha-certutil --help for details.
hypha-certutil root --organization root
hypha-certutil org --root-cert ./root-ca-cert.pem --root-key ./root-ca-key.pem -o quickstart
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n gateway
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n scheduler
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n worker1
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n worker2
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n worker3
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n data1
hypha-certutilis only for development and testing. For production clusters, follow the mTLS guidance in Security and issue node certificates from your organization’s PKI with proper rotation and CRL distribution.
Setting Up Nodes
Set up the nodes next. This guide assumes the following configuration:
Gateway Node
Generate a configuration for local development using hypha-gateway init:
hypha-gateway init -n gateway -o gateway-config.toml --exclude-cidr 192.0.2.0/24
Configuration written to: "gateway-config.toml"
Scheduler Node
Generate the necessary configuration using hypha-scheduler init:
hypha-scheduler init -n scheduler -o scheduler-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
Configuration written to: "scheduler-config.toml"
Send metrics to AIM
If you want the Scheduler to send metrics to AIM, you can download and set up our AIM Driver Connector from our releases page. Please follow its instructions to set up the connector and scheduler.
Worker Nodes
Generate worker configuration using hypha-worker init:
hypha-worker init -n worker1 -o worker1-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
hypha-worker init -n worker2 -o worker2-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
hypha-worker init -n worker3 -o worker3-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
Configuration written to: "worker1-config.toml"
Configuration written to: "worker2-config.toml"
Configuration written to: "worker3-config.toml"
Accelerate Configuration
Create an accelerate.yaml file with the following configuration:
cat <<'EOF' > accelerate.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: "NO"
downcast_bf16: "no"
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: "no"
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
EOF
Data Node
Generate a data configuration using hypha-data init:
hypha-data init -n data1 --dataset-path="mnist/mnist" -o data-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
Configuration written to: "data-config.toml"
Download the Training Dataset
In the project folder (where all configuration files and certificates are), clone the HuggingFace dataset. To ensure a clean folder structure afterwards, use this quick workaround:
git clone https://huggingface.co/datasets/hypha-space/mnist
Cloning into 'mnist'...
Congratulations! Hypha is now set up correctly.
Running Hypha for Training
With everything set up, start the Hypha nodes and begin training.
In different terminals, start the gateway first:
hypha-gateway run -c gateway-config.toml
INFO libp2p_swarm: local_peer_id=<PEER_ID>
INFO hypha_network::listen: Listening address=/ip4/127.0.0.1/tcp/8080
INFO hypha_network::listen: Listening address=/ip4/127.0.0.1/udp/8080/quic-v1
WARN libp2p_kad::behaviour: Failed to trigger bootstrap: No known peers.
Then start the worker and data nodes (the gateway must be running first):
hypha-worker run -c worker1-config.tomlhypha-worker run -c worker2-config.tomlhypha-worker run -c worker3-config.tomlhypha-data run -c data-config.toml
Finally, once all the nodes above are running properly, start the scheduler to begin training:
hypha-scheduler run -c scheduler-config.toml
Expected Output
After successfully running training, the scheduler terminal should display these messages:
2025-11-13T15:59:07.446178Z INFO hypha_scheduler::scheduling::batch_scheduler: Job is completed.
2025-11-13T15:59:07.446185Z INFO hypha_scheduler: Batch Scheduler finished
2025-11-13T15:59:07.446455Z ERROR hypha_network::request_response: Failed to unregister handler error=Other("Failed to send action")
2025-11-13T15:59:07.446471Z ERROR hypha_network::request_response: Failed to unregister handler error=Other("Failed to send action")
2025-11-13T15:59:07.446482Z INFO hypha_scheduler::scheduling::data_scheduler: Data handler finished.
NOTE: We are aware of the error messages and they are expected (for now). As long as these are the only errors, everything worked as expected.
If you encounter any issue, please have a look at the Troubleshooting Guide.