Hypha Space Logo

Deploying Hypha – Overview & Planning Guide

This guide is aimed at DevOps and platform engineers who need to run Hypha across multiple machines (cloud, on-prem, or hybrid). It highlights the architectural decisions, references detailed component docs, and points you to deeper deployment guides such as Deploying a small AWS GPU Worker and Deploying a CPU Worker. If you only want to experiment on a single machine, start with the Quick Start Guide.

1. Define the goal

Before creating instances or opening ports, document what you are trying to accomplish:

These answers determine how many machines you need, which roles you can colocate, how aggressive the firewall rules must be, and whether you can rely on hypha-certutil for certificates.

2. Understand the components

A DiLoCo deployment uses four main services:

Small clusters often colocate roles (for example, run the scheduler and parameter server on the same host as the gateway). Larger deployments dedicate machines to each role and add multiple gateways or data nodes for redundancy.

Hypha does not require a one-to-one mapping between “node” and “machine.” You can run multiple Hypha daemons on a single host as long as you provision enough CPU, memory, and disk. Resource numbers in configs are advisory—they guide the scheduler but do not enforce OS-level limits.

3. Certificates and trust model

Hypha uses mutual TLS everywhere. Every process presents an Ed25519-backed certificate that chains to a trusted CA bundle. Follow the hierarchy described in Security:

  1. Offline root CA
  2. Per-tenant intermediate/organization CA
  3. Individual node certificates for gateways, schedulers, workers, and data nodes

hypha-certutil is acceptable for local testing, but production deployments should integrate with your corporate PKI and CRL/OCSP processes. Plan how you will distribute and rotate certificates (manual copy for pilots, Ansible/Salt for staging, automated secret delivery for production). Never reuse the same certificate across multiple nodes because PeerIDs are derived from the leaf key.

4. Install binaries and tooling

Each machine typically needs:

  1. The Hypha CLI suite (hypha-*) — install via the methods in the Installation guide.
  2. uv plus any executor-specific dependencies (only required on workers running Python-based executors like the Accelerate-based DiLoCo runner).

Automate both steps via cloud-init, Ansible, or your preferred provisioning system so new workers can join the network quickly.

5. Networking and gateway configuration

Gateways must be directly reachable by every node. Configure them with accurate listen/external addresses and CIDR filters so the DHT never advertises unusable IP ranges.

Listen addresses — prefer binding to specific interfaces:

/ip4/10.0.1.100/tcp/55000
/ip4/10.0.1.100/udp/55001/quic-v1

Use /ip4/0.0.0.0/... only when infrastructure-as-code cannot predict the private IP.

External addresses — advertise the address other peers should dial (/ip4/203.0.113.10/tcp/55000 for public gateways, /ip4/10.0.1.100/tcp/55000 for intra-VPC deployments, or /dns4/gateway.example.com/tcp/55000).

exclude_cidr — filter out unroutable ranges (loopback, link-local, internal ranges you do not want propagated). Without these filters, peers may announce 127.0.0.1, causing other nodes to dial themselves.

Open both the TCP and UDP ports you configure, since Hypha can operate over TCP or QUIC. Workers, schedulers, and data nodes usually need only outbound connectivity plus SSH for administration.

Enable QUIC (UDP) whenever possible. It offers better performance on high-latency links and significantly improves connection success rates through restrictive firewalls. If you are forced to run TCP-only, explicitly open each worker’s listening port in your security groups.

6. Capacity planning

Document these choices so you can justify cloud spend and replicate the environment later.

7. Generate and edit configs

Use the init subcommands to scaffold configs, then edit them to match your topology:

Wire in certificate paths, adjust listen_addresses/external_addresses, and define scheduler job specs and worker executors. Keep configs in version control where possible so infrastructure changes are auditable.

8. Bring the cluster online

Start services in this order, verifying each step before moving on:

  1. Gatewayhypha-gateway run -c gateway.toml
  2. Data node(s) — ensures schedulers can resolve datasets.
  3. Workershypha-worker probe before run to confirm mTLS and connectivity.
  4. Scheduler — launches jobs once enough workers have leases.

Probe commands (hypha-worker probe <gateway> or hypha-data probe <gateway>) reuse the same certificates and addresses as production services and act as smoke tests. During training, watch scheduler logs for Job is completed plus data handler shutdown messages.

9. Hardening and operations

Deploying Hypha