Adding an RTX 5090 GPU Server to a Mac Mini Cluster

The Mac Mini cluster handles distributed inference well, but at 9 tok/s on an 8B model, it's not fast. I needed a GPU. So I built a workstation around an RTX 5090 and connected it to the cluster.

The Hardware

AMD Ryzen 9 9950X3D - 16 cores, 32 threads
NVIDIA RTX 5090 - 32 GB VRAM, 600W TDP
128 GB DDR5 RAM (4x 32 GB)
2 TB NVMe SSD
MSI X870E Tomahawk WiFi motherboard

OS: Ubuntu Server 25.04

I considered Proxmox but decided against it. For a single-GPU AI workstation:

Native NVIDIA driver has zero overhead. GPU passthrough in PVE loses 5-10% and adds IOMMU complexity
Docker + nvidia-container-toolkit gives full GPU access in containers
500 MB OS overhead vs 4-8 GB for a hypervisor
The RTX 5090 is brand new - day-1 driver support is better on bare Ubuntu than through passthrough

The NIC Problem

The MSI X870E Tomahawk has a Realtek RTL8126 5GbE NIC. No Linux kernel supports it out of the box - not Ubuntu 24.04, not 25.04, not even kernel 6.14.

The board also has a 2.5GbE port (RTL8125) that works fine. I plugged into that one and moved on. The 5GbE port needs an out-of-tree r8126 driver built after install.

I tried three ISOs before getting a working install:

Ubuntu 24.04 LTS (kernel 6.8) - no network at all
Ubuntu 25.04 Server (kernel 6.14) - same, kernel panic on network detection
Ubuntu 25.04 with custom ISO - worked after plugging into the 2.5G port

Custom ISO via JetKVM

I have a JetKVM connected to the machine for remote BIOS access. Instead of burning USBs, I built a custom ISO with the r8126 driver source baked in and an autoinstall config for zero-touch setup:

# Extract ISO, add driver + autoinstall, rebuild
xorriso -indev ubuntu-25.04-live-server-amd64.iso -osirrox on -extract / /tmp/isoroot
cp -r r8126-driver /tmp/isoroot/
xorriso -as mkisofs ... -o ubuntu-25.04-chimera.iso /tmp/isoroot

Mount the custom ISO via JetKVM's virtual media, boot from it, done.

NVIDIA Driver: Open Kernel Modules Required

The RTX 5090 (GB202 chip) does NOT work with NVIDIA's proprietary kernel modules. After installing nvidia-driver-570, I got:

NVRM: installed in this system requires use of the NVIDIA open kernel modules.
NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:886)

nvidia-smi showed "No devices found" even though the driver loaded.

The fix:

sudo apt install nvidia-driver-570-open
# NOT nvidia-driver-570

After reboot, the RTX 5090 appeared with full 32 GB VRAM.

Docker + GPU

sudo apt install docker.io docker-compose-v2

# Add NVIDIA container toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Important: use runtime: nvidia in docker-compose.yml, NOT the deploy.resources.reservations.devices syntax. The deploy syntax only works with Docker Swarm. With regular docker compose, the GPU is invisible inside the container and inference silently falls back to CPU.

# WRONG - GPU not accessible
services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

# CORRECT
services:
  ollama:
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

I learned this the hard way when Ollama was doing 12 tok/s instead of 262 tok/s. Always verify with docker exec ollama nvidia-smi.

Services

Everything runs as Docker containers:

services:
  ollama:        # LLM inference
  comfyui:       # Image/video generation
  langfuse-web:  # LLM observability
  llm-guard:     # Prompt injection detection
  # + Postgres, ClickHouse, Redis, MinIO for Langfuse v3
  # + nvidia-exporter, cAdvisor, node-exporter for monitoring

Benchmarks

LLM Inference (Ollama)

Model	Speed	GPU Util	VRAM	Power	Temp
Llama 3.1 8B	131 tok/s	90%	9.9 GB	587W	57C
Qwen3 32B	53 tok/s	97%	28.5 GB	600W (TDP)	63C

The Qwen3 32B test maxes out the 600W power limit. The GPU is fully saturated at 97% utilization, using 28.5 of 32 GB VRAM.

For comparison, the Mac Mini M4 does about 9 tok/s on the same 8B model. The RTX 5090 is roughly 15x faster.

GPU Compute (FP16 MatMul)

Matrix Size	Time	TFLOPS
8192x8192	4.6ms	236.8
16384x16384	38.2ms	230.2

237 TFLOPS sustained, about 75% of the 317 TFLOPS theoretical peak. Normal efficiency for real workloads.

Suspend Breaks CUDA

This one hurt. After a suspend/resume cycle, CUDA stops working. Ollama silently falls back to CPU. nvidia-smi still works - only CUDA compute is broken.

The symptoms:

total_vram="0 B" in Ollama logs
ggml_cuda_init: failed to initialize CUDA: unknown error
Inference runs at 12 tok/s instead of 131

A reboot fixes it. Reloading the nvidia_uvm kernel module sometimes helps:

sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm

I updated the auto-sleep script to restart Ollama after every wake, but a full reboot is more reliable. Something to keep in mind if you're planning suspend/resume workflows with NVIDIA GPUs on Linux.

Power Management

The RTX 5090 draws 15W idle and 600W under load. Leaving it on 24/7 isn't practical.

I set up auto-sleep: the machine checks every 5 minutes for GPU activity, Docker workloads, SSH sessions, and Ollama model status. After 30 minutes of inactivity, it suspends. Wake-on-LAN brings it back.

The management node runs a proxy that intercepts Ollama API requests. If chimera is sleeping, it sends a WoL packet, waits for boot (~30s), then forwards the request. From the client's perspective, the first request is slow but it just works.

What I'd Do Differently

The only real regret is the motherboard NIC situation. If I were buying again, I'd pick a board with an Intel 2.5G NIC instead of Realtek 5G. The Intel i226-V has had Linux support since kernel 5.x. The Realtek RTL8126 still doesn't have mainline support as of kernel 6.14.