Hypha Space Logo

Quick Start Guide

Overview

Hypha is an open-source tool for democratizing large-scale machine learning across heterogeneous compute resources. It helps discover and utilize diverse compute resources (CPUs, GPUs, and specialized accelerators) to run distributed training and inference with minimal operational overhead.

This guide walks through installation, configuration, and running a first distributed training job.


Prerequisites

Make sure you have the following installed:

RequirementVersionCheck CommandInstall Documentation
uv≥ 0.9.7uv --versionhttps://docs.astral.sh/uv/getting-started/installation/
gitanygit --versionhttps://git-scm.com/book/en/v2/Getting-Started-Installing-Git
git-lfsanygit-lfs --versionhttps://git-lfs.github.com/
curlanycurl --versionhttps://curl.se/docs/install.html

Also if you want to build from source, you need your rust toolchain.


Installation

Install Hypha using the standalone installer script:

curl -fsSL https://hypha-space.org/install.sh | sh

For alternative installation methods, see the Installation Guide.


Configuration

All Hypha components require setup, and certificates need to be prepared.


Certificates

Create the necessary certificates first. See hypha-certutil --help for details.

hypha-certutil root --organization root
hypha-certutil org --root-cert ./root-ca-cert.pem --root-key ./root-ca-key.pem -o quickstart
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n gateway
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n scheduler
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n worker1
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n worker2
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n worker3
hypha-certutil node --ca-cert ./quickstart-ca-cert.pem --ca-key ./quickstart-ca-key.pem -n data1

hypha-certutil is only for development and testing. For production clusters, follow the mTLS guidance in Security and issue node certificates from your organization’s PKI with proper rotation and CRL distribution.


Setting Up Nodes

Set up the nodes next. This guide assumes the following configuration:


Gateway Node

Generate a configuration for local development using hypha-gateway init:

hypha-gateway init -n gateway -o gateway-config.toml --exclude-cidr 192.0.2.0/24
Configuration written to: "gateway-config.toml"

Scheduler Node

Generate the necessary configuration using hypha-scheduler init:

hypha-scheduler init -n scheduler -o scheduler-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
Configuration written to: "scheduler-config.toml"
Send metrics to AIM

If you want the Scheduler to send metrics to AIM, you can download and set up our AIM Driver Connector from our releases page. Please follow its instructions to set up the connector and scheduler.

Worker Nodes

Generate worker configuration using hypha-worker init:

hypha-worker init -n worker1 -o worker1-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
hypha-worker init -n worker2 -o worker2-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
hypha-worker init -n worker3 -o worker3-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
Configuration written to: "worker1-config.toml"
Configuration written to: "worker2-config.toml"
Configuration written to: "worker3-config.toml"

Accelerate Configuration

Create an accelerate.yaml file with the following configuration:

cat <<'EOF' > accelerate.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: "NO"
downcast_bf16: "no"
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: "no"
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
EOF

Data Node

Generate a data configuration using hypha-data init:

hypha-data init -n data1 --dataset-path="mnist/mnist" -o data-config.toml --exclude-cidr 192.0.2.0/24 --gateway /ip4/127.0.0.1/tcp/8080
Configuration written to: "data-config.toml"

Download the Training Dataset

In the project folder (where all configuration files and certificates are), clone the HuggingFace dataset. To ensure a clean folder structure afterwards, use this quick workaround:

git clone https://huggingface.co/datasets/hypha-space/mnist
Cloning into 'mnist'...

Congratulations! Hypha is now set up correctly.


Running Hypha for Training

With everything set up, start the Hypha nodes and begin training.

In different terminals, start the gateway first:

hypha-gateway run -c gateway-config.toml
INFO libp2p_swarm: local_peer_id=<PEER_ID>
INFO hypha_network::listen: Listening address=/ip4/127.0.0.1/tcp/8080
INFO hypha_network::listen: Listening address=/ip4/127.0.0.1/udp/8080/quic-v1
WARN libp2p_kad::behaviour: Failed to trigger bootstrap: No known peers.

Then start the worker and data nodes (the gateway must be running first):

  1. hypha-worker run -c worker1-config.toml
  2. hypha-worker run -c worker2-config.toml
  3. hypha-worker run -c worker3-config.toml
  4. hypha-data run -c data-config.toml

Finally, once all the nodes above are running properly, start the scheduler to begin training:

hypha-scheduler run -c scheduler-config.toml

Expected Output

After successfully running training, the scheduler terminal should display these messages:

2025-11-13T15:59:07.446178Z  INFO hypha_scheduler::scheduling::batch_scheduler: Job is completed.
2025-11-13T15:59:07.446185Z  INFO hypha_scheduler: Batch Scheduler finished
2025-11-13T15:59:07.446455Z ERROR hypha_network::request_response: Failed to unregister handler error=Other("Failed to send action")
2025-11-13T15:59:07.446471Z ERROR hypha_network::request_response: Failed to unregister handler error=Other("Failed to send action")
2025-11-13T15:59:07.446482Z  INFO hypha_scheduler::scheduling::data_scheduler: Data handler finished.

NOTE: We are aware of the error messages and they are expected (for now). As long as these are the only errors, everything worked as expected.

If you encounter any issue, please have a look at the Troubleshooting Guide.