Adding an RTX 5090 GPU Server to a Mac Mini Cluster
The Mac Mini cluster handles distributed inference well, but at 9 tok/s on an 8B model, it's not fast. I needed a GPU. So I built a workstation around an RTX 5090 and connected it to the cluster.
The Hardware
- AMD Ryzen 9 9950X3D - 16 cores, 32 threads
- NVIDIA RTX 5090 - 32 GB VRAM, 600W TDP
- 128 GB DDR5 RAM (4x 32 GB)
- 2 TB NVMe SSD
- MSI X870E Tomahawk WiFi motherboard
OS: Ubuntu Server 25.04
I considered Proxmox but decided against it. For a single-GPU AI workstation:
- Native NVIDIA driver has zero overhead. GPU passthrough in PVE loses 5-10% and adds IOMMU complexity
- Docker + nvidia-container-toolkit gives full GPU access in containers
- 500 MB OS overhead vs 4-8 GB for a hypervisor
- The RTX 5090 is brand new - day-1 driver support is better on bare Ubuntu than through passthrough
The NIC Problem
The MSI X870E Tomahawk has a Realtek RTL8126 5GbE NIC. No Linux kernel supports it out of the box - not Ubuntu 24.04, not 25.04, not even kernel 6.14.
The board also has a 2.5GbE port (RTL8125) that works fine. I plugged into that one and moved on. The 5GbE port needs an out-of-tree r8126 driver built after install.
I tried three ISOs before getting a working install:
- Ubuntu 24.04 LTS (kernel 6.8) - no network at all
- Ubuntu 25.04 Server (kernel 6.14) - same, kernel panic on network detection
- Ubuntu 25.04 with custom ISO - worked after plugging into the 2.5G port
Custom ISO via JetKVM
I have a JetKVM connected to the machine for remote BIOS access. Instead of burning USBs, I built a custom ISO with the r8126 driver source baked in and an autoinstall config for zero-touch setup:
# Extract ISO, add driver + autoinstall, rebuild
xorriso -indev ubuntu-25.04-live-server-amd64.iso -osirrox on -extract / /tmp/isoroot
cp -r r8126-driver /tmp/isoroot/
xorriso -as mkisofs ... -o ubuntu-25.04-chimera.iso /tmp/isoroot
Mount the custom ISO via JetKVM's virtual media, boot from it, done.
NVIDIA Driver: Open Kernel Modules Required
The RTX 5090 (GB202 chip) does NOT work with NVIDIA's proprietary kernel modules. After installing nvidia-driver-570, I got:
NVRM: installed in this system requires use of the NVIDIA open kernel modules.
NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:886)
nvidia-smi showed "No devices found" even though the driver loaded.
The fix:
sudo apt install nvidia-driver-570-open
# NOT nvidia-driver-570
After reboot, the RTX 5090 appeared with full 32 GB VRAM.
Docker + GPU
sudo apt install docker.io docker-compose-v2
# Add NVIDIA container toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Important: use runtime: nvidia in docker-compose.yml, NOT the deploy.resources.reservations.devices syntax. The deploy syntax only works with Docker Swarm. With regular docker compose, the GPU is invisible inside the container and inference silently falls back to CPU.
# WRONG - GPU not accessible
services:
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
# CORRECT
services:
ollama:
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
I learned this the hard way when Ollama was doing 12 tok/s instead of 262 tok/s. Always verify with docker exec ollama nvidia-smi.
Services
Everything runs as Docker containers:
services:
ollama: # LLM inference
comfyui: # Image/video generation
langfuse-web: # LLM observability
llm-guard: # Prompt injection detection
# + Postgres, ClickHouse, Redis, MinIO for Langfuse v3
# + nvidia-exporter, cAdvisor, node-exporter for monitoring
Benchmarks
LLM Inference (Ollama)
| Model | Speed | GPU Util | VRAM | Power | Temp |
|---|---|---|---|---|---|
| Llama 3.1 8B | 131 tok/s | 90% | 9.9 GB | 587W | 57C |
| Qwen3 32B | 53 tok/s | 97% | 28.5 GB | 600W (TDP) | 63C |
The Qwen3 32B test maxes out the 600W power limit. The GPU is fully saturated at 97% utilization, using 28.5 of 32 GB VRAM.
For comparison, the Mac Mini M4 does about 9 tok/s on the same 8B model. The RTX 5090 is roughly 15x faster.
GPU Compute (FP16 MatMul)
| Matrix Size | Time | TFLOPS |
|---|---|---|
| 8192x8192 | 4.6ms | 236.8 |
| 16384x16384 | 38.2ms | 230.2 |
237 TFLOPS sustained, about 75% of the 317 TFLOPS theoretical peak. Normal efficiency for real workloads.
Suspend Breaks CUDA
This one hurt. After a suspend/resume cycle, CUDA stops working. Ollama silently falls back to CPU. nvidia-smi still works - only CUDA compute is broken.
The symptoms:
total_vram="0 B"in Ollama logsggml_cuda_init: failed to initialize CUDA: unknown error- Inference runs at 12 tok/s instead of 131
A reboot fixes it. Reloading the nvidia_uvm kernel module sometimes helps:
sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
I updated the auto-sleep script to restart Ollama after every wake, but a full reboot is more reliable. Something to keep in mind if you're planning suspend/resume workflows with NVIDIA GPUs on Linux.
Power Management
The RTX 5090 draws 15W idle and 600W under load. Leaving it on 24/7 isn't practical.
I set up auto-sleep: the machine checks every 5 minutes for GPU activity, Docker workloads, SSH sessions, and Ollama model status. After 30 minutes of inactivity, it suspends. Wake-on-LAN brings it back.
The management node runs a proxy that intercepts Ollama API requests. If chimera is sleeping, it sends a WoL packet, waits for boot (~30s), then forwards the request. From the client's perspective, the first request is slow but it just works.
What I'd Do Differently
The only real regret is the motherboard NIC situation. If I were buying again, I'd pick a board with an Intel 2.5G NIC instead of Realtek 5G. The Intel i226-V has had Linux support since kernel 5.x. The Realtek RTL8126 still doesn't have mainline support as of kernel 6.14.