Hypha Space Logo

Deploying an AMD GPU Worker (MI300X)

This guide walks you through deploying a high-performance AMD GPU worker using the AMD Developer Cloud.

Goal: A powerful worker (MI300X) for running Hypha DiLoCo training jobs.

To follow this guide, ensure you have a running Gateway reachable by the worker, a Node Certificate and Key for the worker (refer to Security for details), and an account on amd.digitalocean.com.

1. Infrastructure Specification

We will use the AMD Developer Cloud to provision a "GPU Droplet" with the following key specifications:

ComponentSpecificationRationale
ImageROCm 6.4Select the base "ROCm Software" image. Do not use the "AI/ML" or "GPT-OSS" pre-configured images as they contain unnecessary bloat.
InstanceMI300X 1 GPU20 vCPU, 240GB RAM, 192GB VRAM.
StorageBoot: 720GB NVMe
Scratch: 5TB NVMe
Massive storage for large models and datasets.

Begin by logging in to the AMD Developer Cloud Console. Navigate to GPU Droplets and select Create GPU Droplet. Choose the MI300X (1 GPU) plan and then select the preinstalled ROCm™ Software package, with version 6.4. Before creating the droplet, add your public SSH key. Once configured, click Create.

Ensure the ROCm version matches is set to 6.4, matching the PyTorch index URL we will configure later.

2. Install & Configure Hypha

2.1 Install Dependencies

First, install uv using the official installation guide. A quick way to do this is:

curl -LsSf https://astral.sh/uv/install.sh | sh

Next, install the hypha-worker binary. You can find detailed instructions in the installation guide, or use the following command to install the latest version:

curl -LsSf https://hypha-space.org/install.sh | sh

2.2 Configuration

First, upload your node certificates (cert.pem, key.pem, and ca.pem) to /etc/hypha/certs/. Then, initialize a base configuration file using hypha-worker init, specifying a name for your worker, the gateway's multiaddr, and the designated work directory.

hypha-worker init \
  -n amd-worker-1 \
  --gateway <GATEWAY_MULTIADDR> \
  --work-dir /mnt/data/work

After generating config.toml, you must adjust the resources to reflect the MI300X's capabilities. Critically, configure the executor for ROCm.

Resources:

[resources]
cpu = 20
memory = 240   # GB
gpu = 192      # GB (MI300X)
storage = 5000 # GB (Scratch Disk)

Executor Configuration (ROCm):

Within the [[executors]] block for diloco-transformer, configure the args to use the ROCm PyTorch index. This ensures uv installs the correct PyTorch/ROCm compatible packages.

[[executors]]
class = "train"
name = "diloco-transformer"
runtime = "process"
cmd = "uv"
args = [
    "run",
    "--python", "3.12",
    "--no-project",
    "--with", "https://github.com/hypha-space/hypha/releases/download/v<version>/hypha_accelerate_executor-<PEP 440-ish version derived from the release version (e.g., `1.0.0a19` for `v1.0.0-alpha.19`)>-py3-none-any.whl",
    "--index", "https://download.pytorch.org/whl/rocm6.4",
    "--index-strategy", "unsafe-best-match",
    "--",
    "accelerate",
    "launch",
    "--config_file", "<path/to/accelerate.yaml>",
    "-m", "hypha.accelerate_executor.training",
    "--socket", "{SOCKET_PATH}",
    "--work-dir", "{WORK_DIR}",
    "--job", "{JOB_JSON}",
]

3. Start

Once all configurations are complete, start the worker using the command:

hypha-worker run -c config.toml