Hypha Space Logo

Worker Node

Worker nodes execute training and inference tasks in the Hypha network. They evaluate task advertisements, bid for work, and run jobs through configurable executors. This section covers worker deployment, configuration, and executor setup.

CLI Reference: See the hypha-worker CLI Reference for complete command-line documentation.

Role and Responsibilities

Workers implement several subsystems to participate in the distributed training workflow:

Request Management: Subscribes to advertisements on the hypha/worker topic and evaluates incoming requests against configurable criteria. The arbiter maintains a pricing threshold and only responds to requests meeting resource requirements and price expectations. When multiple schedulers compete for resources simultaneously, the evaluation system scores and ranks requests using configurable strategies — prioritizing profit maximization. See the Decentralized Resource Allocation Protocol RFC for details on the negotiation mechanism.

Lease Management: The LeaseManager handles time-bounded resource reservations. Temporary leases prevent double-booking during offer negotiation. Accepted offers transition to renewable leases maintained through periodic scheduler renewal.

Job Execution: The JobManager spawns executor processes, provides isolated work directories, and manages job lifecycles. Each job receives a Unix socket for communication via the job bridge.

Data Fetching: Workers request data slices from schedulers, retrieve them from data nodes via stream protocols, and prefetch to minimize wait times.

Metric Reporting: Workers report training metrics (loss, batch processing times, data points processed) to schedulers for performance-aware scheduling decisions.

Gradient Communication: For DiLoCo training, workers send pseudo-gradients to parameter servers at synchronization points and receive updated global gradients.

Installation and Setup

Install the data node binary following the Installation guide.

Work Directory Setup

Workers require a base directory for job isolation:

# Create work directory with appropriate permissions
sudo mkdir -p /var/hypha/work
sudo chown $(whoami):$(whoami) /var/hypha/work
chmod 750 /var/hypha/work

Each job creates a subdirectory hypha-{uuid} under the base path. Jobs have full access to their subdirectories.

Configuration Parameters

Worker configuration uses TOML format with network, security, and storage settings. Generate an example configuration file using the hypha-worker init command. Provide certificate paths along with gateway addresses and listen/external multiaddrs.

Resource Advertisement

Specify available resources in GB (memory, storage, GPU) or cores (CPU):

[resources]
cpu = 16        # cores
memory = 64     # GB
storage = 500   # GB
gpu = 24        # GB (vram)

These values advertise capacity to schedulers. Actual resource limiting is planned but not yet implemented.

Offer Strategies and Pricing

Workers expose their bidding policy through the [offer] section:

[offer]
price = 150.0           # Preferred whole-worker price
floor = 120.0           # Ignore bids below this value
strategy = "whole"

The strategy and pricing configuration allow a single worker to operate either like today's flexible node or as a dedicated “whole box” bidder suitable for exclusive GPU hosts.

Work Directory

work_dir = "/var/hypha/work"

Executors receive isolated subdirectories under this path.

Executor Configuration

Executors define how workers handle jobs. Each executor specifies:

Example: Accelerate Executor for Training

[[executors]]
class = "train"
name = "diloco-transformer"
runtime = "process"
cmd = "uv"
args = [
    "run",
    "--python", "3.12",
    "--no-project",
    "--with", "hypha-accelerate-executor[<extra for CUDA/ROCm version>] @ https://github.com/hypha-space/hypha/releases/download/v<version>/hypha_accelerate_executor-<version without semver channel or metadata>-py3-none-any.whl",
    "--",
    "accelerate",
    "launch",
    "--config_file", "/etc/hypha/accelerate.yaml",
    "-m", "hypha.accelerate_executor.training",
    "--socket", "{SOCKET_PATH}",
    "--work-dir", "{WORK_DIR}",
    "--job", "{JOB_JSON}",
]

Example: Parameter Server Executor

[[executors]]
class = "aggregate"
name = "parameter-server"
runtime = "parameter-server"

Template Substitutions:

OpenTelemetry

OpenTelemetry enables distributed tracing and metrics collection for debugging and monitoring your workers in production. Configure telemetry either via the TOML configuration file or using standard OTEL_* environment variables.

Configuration File (Example uses Grafana Cloud): Specify telemetry settings in your worker.toml:

telemetry_attributes = "service.name=<Node Name>,service.namespace=<Namespace>,deployment.environment=<Environment>"
telemetry_endpoint = "https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
telemetry_headers = "Authorization=Basic <Api Key>"
telemetry_protocol = "http/protobuf"
telemetry_sampler = "parentbased_traceidratio"
telemetry_sample_ratio = 0.1

Environment Variables (Example uses Grafana Cloud): Alternatively, use standard OpenTelemetry environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <Api Key>"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_SERVICE_NAME="worker-01"
export OTEL_RESOURCE_ATTRIBUTES="service.namespace=production,deployment.environment=prod"

hypha-worker run --config /etc/hypha/worker.toml

Environment variables take precedence over configuration file settings, allowing flexible per-deployment customization.

Environment Overrides

Workers load configuration in layers: TOML file → HYPHA_* env vars → CLI flags. Example:

HYPHA_CERT_PEM=/etc/hypha/certs/worker-cert.pem \
HYPHA_KEY_PEM=/etc/hypha/certs/worker-key.pem \
HYPHA_WORK_DIR=/var/hypha/work \
hypha-worker run --config /etc/hypha/worker.toml

Use this pattern to inject secrets or per-host tweaks (e.g., different GPUs or work directories) while sharing a single committed config.

Executors

Executors handle actual computation. Workers support multiple executor types simultaneously.

Accelerate Executor

The Accelerate executor enables multi-GPU training through HuggingFace Accelerate, which provides tensor parallelism across multiple devices. Without tensor parallelism, each GPU must hold the entire model in memory. Accelerate distributes model layers across devices, enabling training of models larger than a single GPU's capacity. A worker with 4 GPUs can train models 4x larger than possible on a single GPU.

The executor is distributed as a Python wheel and uses uv for dependency management and execution.

For detailed installation instructions, configuration examples, PyTorch variant selection, and troubleshooting, see the Accelerate Executor README.

Parameter Server Executor

The Parameter Server executor provides built-in support for DiLoCo (Distributed Low-Communication) based training, enabling workers to act as parameter servers that aggregate model updates across distributed training runs.

Configuration

The parameter server executor is built into Hypha workers—no additional installation required. Add it to your worker configuration:

[[executors]]
class = "aggregate"
name = "parameter-server"
runtime = "parameter-server"