Scheduler

The scheduler orchestrates distributed training jobs by coordinating workers, managing resource allocation, and controlling DiLoCo synchronization. This section covers scheduler configuration, deployment, and operational details.

CLI Reference: See the hypha-scheduler CLI Reference for complete command-line documentation.

Role and Responsibilities

The scheduler serves as the coordination point for distributed training:

Task Advertisement: Publishing resource requirements to workers via Gossipsub. Workers subscribe to the hypha/worker topic and evaluate incoming advertisements against their capabilities.

Worker Selection: Collecting offers from workers and selecting optimal matches. The scheduler evaluates offers based on a configurable price range and picks the best price/performance ratio rather than simply the lowest absolute bid.

Lease Management: Maintaining resource reservations through periodic renewal. Leases prevent resource double-booking while ensuring automatic cleanup if schedulers fail.

DiLoCo Synchronization Coordination: Tracking worker progress and determining when to trigger weight synchronization. The scheduler uses performance-aware scheduling to minimize stragglers by simulating future completion times.

Data Distribution Tracking: Assigning dataset slices to workers and ensuring complete coverage. The scheduler tracks slice states (AVAILABLE → ASSIGNED → USED) and reassigns slices when epochs complete.

Performance-Aware Scheduling: Allocating heterogeneous batch sizes based on worker capabilities. Faster workers process more data per synchronization round, improving overall efficiency.

Parameter Server Lifecycle Management: Coordinating parameter server startup, informing it which workers will participate, and setting collection timeouts.

Metrics Aggregation: Optionally forwarding training metrics to Grafana for visualization and analysis or storing them to file.

Installation

Install the data node binary following the Installation guide.

Configuration Parameters

Scheduler configuration uses TOML format with network, security, and job specification settings. Generate an example configuration file using the hypha-scheduler init command. You will need to provide paths to TLS certificates, configure network settings (gateway addresses, listen addresses), and define your training job specification.

Network and Security Settings

TLS Certificates (required for mTLS):

cert_pem: Path to scheduler's certificate (PEM format)
key_pem: Path to scheduler's private key (PKCS#8 format)
trust_pem: Path to CA certificate bundle
crls_pem: Optional certificate revocation list

Network Addresses:

gateway_addresses: List of gateway nodes for bootstrap (Multiaddr format)
listen_addresses: Local addresses to bind (e.g., /ip4/0.0.0.0/tcp/0)
external_addresses: Publicly reachable addresses to advertise
exclude_cidr: CIDR ranges to exclude from peer discovery (defaults to reserved ranges)

Relay Configuration:

relay_circuit: Enable relay via gateways (default: true)

Job Specification

The [scheduler.job] section defines the training job:

Model Configuration:

[scheduler.job.model]
repository = "owner/repo"                  # HuggingFace repository
revision = "main"                          # Optional branch/tag
filenames = ["config.json", "model.safetensors"]
token = "hf_..."                           # Optional auth token
type = "image-classification"             # Model type
input_names = ["images"]                # Input names to the model

The token will be shared with all workers. Only use a token with limited rights and invalidate the token at the end.

All transformers Auto Classes are supported.

Preprocessor Configuration (optional):

[scheduler.job.preprocessor]
repository = "l45k/Resnet50"
filenames = ["preprocessor_config.json"]
token = "hf_..."
type = "image"              # Preprocessor Type
input_names = ["images"]     # Input names to the preprocessor

The token will be shared with all workers. Only use a token with limited rights and invalidate the token at the end.

Dataset Configuration:

[scheduler.job.dataset]
dataset = "imagenet"  # Name advertised by data nodes

Inner Optimizer (AdamW):

[scheduler.job.inner_optimizer]
learning_rate = 0.001
# Optional: betas = [0.9, 0.999]
# Optional: epsilon = 1e-8

Outer Optimizer (Nesterov Momentum):

[scheduler.job.outer_optimizer]
learning_rate = 0.7
momentum = 0.9

Upload Trained Model To HF: To upload the trained model to Hugging Face provide a repository and a token with the correct rights:

[scheduler.job.model_destination]
repository="repository/identifier"
token="hf_*"

This will override any existing model and model weights in the repository and if the repository doesn't exist, it will try to create a new repository with the account defaults (if not changed it will be a public repository). Additionally, the token will be shared with a worker. Only use a token with limited rights and invalidate the token at the end.

Training Duration:

To control the training process, there are two essential parameters. First, the number of DiLoCo rounds and second, the number of data samples that should be processed between updates. The Scheduler determines batch size automatically, that can be processed by a worker depending on it's capabilities. To prevent a single worker to run with an extremely large batch size, one can also cap the batch size with a max_batch_size. Additionally, the Scheduler can assign up to multi_batch_size batches within a single message. This reduces the communication between Scheduler and Workers and can speed-up the training on high-latency connections. It also defines how many batches the Scheduler will project into the future befor assigning batches to a worker.

[scheduler.job.rounds]
update_rounds = 100
avg_samples_between_updates = 1200
max_batch_size = 600
multi_batch_size = 3

The avg_samples_between_updates is the minimal number of samples and the Scheduler can use more samples depending on the batch sizes of the participating Workers. It will not reduce the batch size to exactly match the value.

Resource Requirements:

[scheduler.job.resources]
num_workers = 4

[[scheduler.job.resources.worker]]
type = "GPU"
min = 10.0

[[scheduler.job.resources.worker]]
type = "CPU"
min = 2.0

[[scheduler.job.resources.worker]]
type = "Memory"
min = 8.0

[[scheduler.job.resources.worker]]
kind = "diloco-transformer"   # Require workers that advertise this executor

[[scheduler.job.resources.parameter_server]]
type = "CPU"
min = 4.0

[[scheduler.job.resources.parameter_server]]
type = "Memory"
min = 16.0

[[scheduler.job.resources.parameter_server]]
kind = "parameter-server"

Each [[...worker]] or [[...parameter_server]] table serializes directly into a worker Requirement.

Make sure that the GPU requirements for the worker match the required GPU memory for training with a batch size of 1 on a GPU. A good estimate in GB is #number of parameters * 24 * 9e-10. This is equal to the model memory (1 fp32) + AdamW state (4 fp32) + gradient (1 fp32) and activation (1 fp32). For a better estimation consult Transformer math. The number of parameters can be easily computed in torch with sum(v.shape.numel() for v in model.state_dict().values()).

Price Ranges: Configure bid/maximum pairs for workers and parameter servers to express how far the scheduler is willing to counter-offer without revealing the cap to workers:

[scheduler.job.resources.worker_price]
bid = 110.0   # published bid workers see
max = 160.0   # private cap used for filtering

[scheduler.job.resources.parameter_server_price]
bid = 80.0
max = 120.0

The bid is broadcast to workers, while the max remains local to the scheduler. Received offers whose price exceeds the configured max are ignored. Within that range, offers are ranked by the scheduler's resource evaluator to prioritize the most capacity per token spent.

OpenTelemetry

OpenTelemetry enables distributed tracing and metrics collection for debugging and monitoring your scheduler in production. Configure telemetry either via the TOML configuration file or using standard OTEL_* environment variables.

Configuration File (Example uses Grafana Cloud): Specify telemetry settings in your scheduler.toml:

telemetry_attributes = "service.name=<Node Name>,service.namespace=<Namespace>,deployment.environment=<Environment>"
telemetry_endpoint = "https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
telemetry_headers = "Authorization=Basic <Api Key>"
telemetry_protocol = "http/protobuf"
telemetry_sampler = "parentbased_traceidratio"
telemetry_sample_ratio = 0.1

Environment Variables (Example uses Grafana Cloud): Alternatively, use standard OpenTelemetry environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-eu-west-2.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <Api Key>"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_SERVICE_NAME="scheduler-01"
export OTEL_RESOURCE_ATTRIBUTES="service.namespace=production,deployment.environment=prod"

hypha-scheduler run --config /etc/hypha/scheduler.toml

Environment variables take precedence over configuration file settings, allowing flexible per-deployment customization.

Monitoring

The Scheduler supports 3 different ways to monitor a training and to collect metrics. The methods are not exclusive an all can be used a the same time.

OpenTelemetry

If OpenTelemetry is configured, metrics can be directly forwarded to Grafana:

[[scheduler.job.metrics]]
type = "otel"

JSONL

Metrics can be stored in a local JSONL file.

[[scheduler.job.metrics]]
type = "jsonl"
path = "metrics.jsonl"

csv

Metrics be stored in a local CSV file.

[[scheduler.job.metrics]]
type = "csv"
path = "metrics.csv"