Worker Node

Worker nodes execute training and inference tasks in the Hypha network. They evaluate task advertisements, bid for work, and run jobs through configurable executors. This section covers worker deployment, configuration, and executor setup.

CLI Reference: See the hypha-worker CLI Reference for complete command-line documentation.

Role and Responsibilities

Workers implement several subsystems to participate in the distributed training workflow:

Request Management: Subscribes to advertisements on the hypha/worker topic and evaluates incoming requests against configurable criteria. The arbiter maintains a pricing threshold and only responds to requests meeting resource requirements and price expectations. When multiple schedulers compete for resources simultaneously, the evaluation system scores and ranks requests using configurable strategies — prioritizing profit maximization. See the Decentralized Resource Allocation Protocol RFC for details on the negotiation mechanism.

Lease Management: The LeaseManager handles time-bounded resource reservations. Temporary leases prevent double-booking during offer negotiation. Accepted offers transition to renewable leases maintained through periodic scheduler renewal.

Job Execution: The JobManager spawns executor processes, provides isolated work directories, and manages job lifecycles. Each job receives a Unix socket for communication via the job bridge.

Data Fetching: Workers request data slices from schedulers, retrieve them from data nodes via stream protocols, and prefetch to minimize wait times.

Metric Reporting: Workers report training metrics (loss, batch processing times, data points processed) to schedulers for performance-aware scheduling decisions.

Gradient Communication: For DiLoCo training, workers send pseudo-gradients to parameter servers at synchronization points and receive updated global gradients.

Installation and Setup

Install the worker binary following the Installation guide.

Work Directory Setup

Workers require a base directory for job isolation:

# Create work directory with appropriate permissions
sudo mkdir -p /var/hypha/work
sudo chown $(whoami):$(whoami) /var/hypha/work
chmod 750 /var/hypha/work

Each job creates a subdirectory hypha-{uuid} under the base path. Jobs have full access to their subdirectories.

Configuration Parameters

Worker configuration uses TOML format with network, security, and storage settings. Generate an example configuration file using the hypha-worker init command. Provide certificate paths along with gateway addresses and listen/external multiaddrs.

Resource Advertisement

Specify available resources in GB (memory, storage, GPU) or cores (CPU):

[resources]
cpu = 16        # cores
memory = 64     # GB
storage = 500   # GB
gpu = 24        # GB (vram)

These values advertise capacity to schedulers. Actual resource limiting is planned but not yet implemented.

Offer Strategies and Pricing

Workers expose their bidding policy through the [offer] section:

[offer]
price = 150.0           # Preferred whole-worker price
floor = 120.0           # Ignore bids below this value
strategy = "whole"

price represents the worker's ask. Bids below this value are rejected unless the floor is higher.
floor is the hard minimum bid. Requests below the floor are dropped without a response.
strategy selects how resources are bundled:
- flexible mirrors the scheduler's bid as long as it meets the pricing thresholds and only reserves the resources requested by the scheduler.
- whole offers all configured resources as a single unit and responds at the higher of the scheduler's bid or the configured price.

The strategy and pricing configuration allow a single worker to operate either like today's flexible node or as a dedicated “whole box” bidder suitable for exclusive GPU hosts.

Work Directory

work_dir = "/var/hypha/work"

Executors receive isolated subdirectories under this path.

Executor Configuration

Executors define how workers handle jobs. Each executor specifies:

class: Job category (e.g., "train", "inference")
name: Specific executor variant (e.g., "diloco-transformer")
runtime: Execution mechanism ("process" or "parameter-server")
cmd (process only): Command to execute
args (process only): Arguments, with template substitutions

Example: Accelerate Executor for Training

[[executors]]
class = "train"
name = "diloco-transformer"
runtime = "process"
cmd = "uv"
args = [
    "run",
    "--python", "3.12",
    "--no-project",
    "--with", "https://github.com/hypha-space/hypha/releases/download/v<version>/hypha_accelerate_executor-<PEP 440-ish version derived from the release version (e.g., `1.0.0a19` for `v1.0.0-alpha.19`)>-py3-none-any.whl",
    # Optional: Configure `index` to use a specific torch variant. See executor README for details: https://github.com/hypha-space/hypha/blob/alpha/executors/accelerate/README.md
    "--",
    "accelerate",
    "launch",
    "--config_file", "/etc/hypha/accelerate.yaml",
    "-m", "hypha.accelerate_executor.training",
    "--socket", "{SOCKET_PATH}",
    "--work-dir", "{WORK_DIR}",
    "--job", "{JOB_JSON}",
]

Use the latest compatible executor version from the Hypha Releases page (e.g., v1.0.0-alpha.34). Optionally configure an extra index to install the torch variant that matches your environment. For details on how to configure the executor please refer to the executor documentation.

You can copy the wheel filename from the release artifacts.

Example: Parameter Server Executor

[[executors]]
class = "aggregate"
name = "parameter-server"
runtime = "parameter-server"

Template Substitutions:

{SOCKET_PATH}: Unix socket path for job bridge communication
{WORK_DIR}: Job-specific work directory
{JOB_JSON}: Job specification as JSON string

OpenTelemetry

OpenTelemetry enables distributed tracing and metrics collection for debugging and monitoring your workers in production. Configure telemetry either via the TOML configuration file or using standard OTEL_* environment variables.

Configuration File (Example uses Grafana Cloud): Specify telemetry settings in your worker.toml:

telemetry_attributes = "service.name=<Node Name>,service.namespace=<Namespace>,deployment.environment=<Environment>"
telemetry_endpoint = "https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
telemetry_headers = "Authorization=Basic <Api Key>"
telemetry_protocol = "http/protobuf"
telemetry_sampler = "parentbased_traceidratio"
telemetry_sample_ratio = 0.1

Environment Variables (Example uses Grafana Cloud): Alternatively, use standard OpenTelemetry environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <Api Key>"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_SERVICE_NAME="worker-01"
export OTEL_RESOURCE_ATTRIBUTES="service.namespace=production,deployment.environment=prod"

hypha-worker run --config /etc/hypha/worker.toml

Environment variables take precedence over configuration file settings, allowing flexible per-deployment customization.

Environment Overrides

Workers load configuration in layers: TOML file → HYPHA_* env vars → CLI flags. Example:

HYPHA_CERT_PEM=/etc/hypha/certs/worker-cert.pem \
HYPHA_KEY_PEM=/etc/hypha/certs/worker-key.pem \
HYPHA_WORK_DIR=/var/hypha/work \
hypha-worker run --config /etc/hypha/worker.toml

Use this pattern to inject secrets or per-host tweaks (e.g., different GPUs or work directories) while sharing a single committed config.

Executors

Executors handle actual computation. Workers support multiple executor types simultaneously.

Accelerate Executor

The Accelerate executor enables multi-GPU training through HuggingFace Accelerate, which provides tensor parallelism across multiple devices. Without tensor parallelism, each GPU must hold the entire model in memory. Accelerate distributes model layers across devices, enabling training of models larger than a single GPU's capacity. A worker with 4 GPUs can train models 4x larger than possible on a single GPU.

The executor is distributed as a Python wheel and uses uv for dependency management and execution.

For detailed installation instructions, configuration examples, PyTorch variant selection, and troubleshooting, see the Accelerate Executor README.

Parameter Server Executor

The Parameter Server executor provides built-in support for DiLoCo (Distributed Low-Communication) based training, enabling workers to act as parameter servers that aggregate model updates across distributed training runs.

Configuration

The parameter server executor is built into Hypha workers—no additional installation required. Add it to your worker configuration:

[[executors]]
class = "aggregate"
name = "parameter-server"
runtime = "parameter-server"