Building a Hybrid AI Cluster - Mac Mini, Linux, and NVIDIA Under One Roof

My desk looks like a hardware store had an argument with a data center. Cables everywhere, four Mac Minis stacked in ways Apple never intended, a GPU workstation humming under the desk, and a tangle of Thunderbolt cables that I keep telling myself I'll clean up next weekend. I never do. The mess has its own logic - everything is where it is for a reason, even if that reason only makes sense at 2 AM.

I wanted to run large language models locally without renting GPU time from cloud providers. Not on a single machine - on a cluster that combines Apple Silicon for memory-efficient inference with NVIDIA GPU for raw throughput. Mac Minis for the distributed pool, a Linux workstation with a high-end NVIDIA card for the heavy lifting, all managed from one control plane.

The result is a hybrid environment where macOS and Linux nodes work together - different architectures, different operating systems, different strengths. I spend more time in this lab than I probably should. Getting them to cooperate took longer than expected. Here's what actually happened.

The Hardware

Four Mac Mini M4 (base model):

10-core CPU, 16 GB unified memory, 228 GB SSD
3x Thunderbolt 4 / USB-C ports

Total cluster: 40 CPU cores, 48 GB pooled memory, 40 Gbps inter-node bandwidth.

One machine acts as the management node (olab0). The other three are workers that run exo.

A Note on M4 vs M4 Pro

The base M4 has Thunderbolt 4 (40 Gbps). The M4 Pro has Thunderbolt 5 (80 Gbps) with RDMA support. RDMA makes a real difference for distributed inference - it drops inter-node latency from ~300us to under 50us.

If you're buying new hardware specifically for an exo cluster, the M4 Pro is the better pick. I went with base M4 and it works, just slower on the network side.

Exo's dashboard might show "Thunderbolt 5 hardware detected - Enable RDMA." If you have base M4, ignore this. It's a false positive. Check with:

system_profiler SPThunderboltDataType | grep Speed
# TB4: "Speed: 40 Gb/s"
# TB5: "Speed: Up to 80 Gb/s"

The Network

Thunderbolt Mesh

Three workers, three cables, full mesh:

olab1 ---- olab2
  \        /
   olab3

Each Mac Mini has 3 TB ports. A full mesh of 3 nodes uses 2 ports each, leaving one free. I connected the management node to two workers with the remaining ports.

Plug cables one at a time. macOS pops up "New Interface Detected" in Network Settings. Accept it. Then set static IPs on the Thunderbolt Bridge:

sudo networksetup -setmanual "Thunderbolt Bridge" 10.66.2.11 255.255.255.0

The Bridge Loop Problem

A full mesh of 3 nodes creates a layer 2 switching loop. macOS bridges all TB ports into a single bridge0 interface, so frames travel olab1 to olab2 to olab3 to olab1 in a circle.

macOS uses STP (Spanning Tree Protocol) to handle this automatically. You might see warnings about "network routing cycle detected." Don't follow the suggestion to disable Thunderbolt Bridge - that kills the entire high-speed interconnect. STP handles the loop fine.

If the warnings bother you, use a chain topology instead (remove one cable), but then traffic between end nodes has to hop through the middle one.

First Pings Fail

The Thunderbolt Bridge takes a few seconds to learn MAC addresses via STP. Your first ping between newly-connected nodes will time out. Run pings in both directions and wait 5-10 seconds. This is normal.

Jumbo frames don't work on TB Bridge (ifconfig bridge0 mtu 9000 fails with "Invalid argument"). Stuck with 1500 MTU.

Setting Up Headless Mac Minis

The workers run without keyboards, mice, or displays. This creates several problems you wouldn't hit on a normal Mac.

SSH Is Your Only Way In

Enable Remote Login in System Settings before going headless, or you're locked out. Configure SSH keys from the management node:

ssh-keygen -f ~/.ssh/olab -t ed25519
ssh-copy-id -i ~/.ssh/olab pawel@10.66.1.11

FileVault Blocks Headless Boot

If FileVault (disk encryption) is on, the Mac sits at a pre-boot password screen after every power outage. No SSH, no services, nothing. You'd need physical access to type the password.

Disable FileVault on headless workers. Keep it on the management node if you want - it has a display.

The tricky part: sudo fdesetup disable prompts for username and password interactively. Over SSH, you have to pipe it as XML:

echo '<?xml version="1.0" ><dict>
  <key>Username</key><string>pawel</string>
  <key>Password</key><string>SECRET</string>
</dict></plist>' | ssh olab1 'sudo fdesetup disable -inputplist'

GUI Apps Don't Launch Over SSH

Any app that needs a window - including OrbStack, Docker Desktop, and even EXO.app - fails with Domain does not support specified action when launched via SSH. No GUI session means no GUI apps.

This caught me multiple times. The workarounds:

EXO: use the bundled CLI binary instead of the .app
Docker: use Colima (headless Docker runtime) instead of Docker Desktop
OrbStack: same problem as Docker Desktop, use Colima

Auto-Login vs Headless

If you really need GUI apps on headless Macs, enable auto-login in System Settings. Then the Mac boots into a full desktop session (just with no display) and GUI apps work. I didn't go this route because Colima solved my Docker needs without it.

Installing Exo

This section alone took me 2 hours.

Three Wrong Ways

What I tried	What happened
`pip install exo`	Installed a bioinformatics tool called "Yeast Epigenome Project." Wrong package.
`pip install exo-explore`	Package doesn't exist on PyPI.
Building from source	Failed: `xcrun: error: unable to find utility "metal"`. The Metal shader compiler is only in full Xcode, not Command Line Tools.

What Actually Works

Download the pre-built macOS app from exolabs.net:

# Download once, copy to all workers
curl -L -o /tmp/EXO-latest.dmg https://assets.exolabs.net/EXO-latest.dmg
scp /tmp/EXO-latest.dmg olab1:/tmp/

# Install
ssh olab1 'hdiutil attach /tmp/EXO-latest.dmg -nobrowse -quiet
sudo cp -R /Volumes/EXO/EXO.app /Applications/
hdiutil detach /Volumes/EXO -quiet
mkdir -p ~/.exo/models'

Run it headless (not open -a EXO - that needs GUI):

nohup /Applications/EXO.app/Contents/Resources/exo/exo > /tmp/exo.log 2>&1 &

Xcode CLI Tools Over SSH

xcode-select --install pops up a GUI dialog. Useless over SSH. Use softwareupdate instead:

sudo touch /tmp/.com.apple.dt.CommandLineTools.installondemand.in-progress
sudo softwareupdate -i "Command Line Tools for Xcode 26.3-26.3"

Downloads sometimes fail with PKDownloadError error 8. Just retry.

Python Version

Exo requires Python 3.13+. I wasted time installing 3.12 first. If you see requires a different Python: 3.12.x not in '>=3.13', that's why.

Exo Cluster Behavior

Exo's peer discovery works via mDNS. Start the binary on all nodes and they find each other within 15 seconds. One node gets elected master automatically.

Small Models Don't Shard

If a model fits in a single node's memory, exo runs it on one node only. My 8B model (4.6 GB) fits easily in 16 GB, so exo assigned all 32 layers to one node. world_size=1. The other two nodes sit idle.

This is actually correct - sharding across nodes adds network latency with no benefit when the model already fits. The cluster's value is in running models that don't fit on one machine (70B+, which need 40+ GB).

Single-node performance on M4: about 9 tok/s for Llama 3.1 8B via MLX.

mDNS Issues

One node kept failing to discover peers. Turned out mDNS was broken on that specific machine. dns-sd -B _ssh._tcp local showed only itself, not the other nodes. Restarting mDNSResponder (sudo killall mDNSResponder) fixed it - but be careful, this can temporarily kill SSH access if the hostname resolves via mDNS.

Management Infrastructure

Running the cluster from the management node requires a few supporting services. I run all of these as Docker containers on olab0:

DNS (AdGuard Home)

Every service gets a .olab domain name. ollama.olab, grafana.olab, exo.olab, etc. Way better than remembering IPs and ports.

AdGuard Home handles DNS resolution for the entire cluster and also logs all queries - useful for security monitoring.

Caching Proxy (Squid)

Installing Homebrew, Python, and pip packages on 3 identical machines downloads the same files 3 times. A Squid caching proxy on the management node caches everything after the first download.

One gotcha: Docker Desktop's NAT changes the source IP, so ACLs based on subnet (10.66.1.0/24) don't match. Set acl localnet src all for an internal-only proxy.

Monitoring (Grafana + Prometheus)

Prometheus scrapes node_exporter on every machine. Grafana shows CPU, memory, disk, network for all nodes. On the GPU server, I also run nvidia_gpu_exporter and cAdvisor for GPU and container metrics.

Docker Desktop on macOS can't route to LAN IPs from inside containers by default. You need docker-mac-net-connect for Prometheus to reach the other machines.

Performance Tuning

For headless compute nodes, disable everything you don't need:

sudo pmset -a sleep 0 displaysleep 1 hibernatemode 0 powernap 0
sudo nvram boot-args="serverperfmode=1"  # needs reboot
sudo mdutil -a -i off                    # Spotlight
sudo networksetup -setairportpower en1 off  # Wi-Fi
sudo tmutil disable                       # Time Machine

serverperfmode=1 tells macOS to allocate max resources and skip thermal throttling. The Mac Mini runs warmer but faster.

Power Management

Four Mac Minis running 24/7 at ~5W each (idle) isn't terrible, but it adds up. I set up auto-sleep for when the cluster isn't being used:

./cluster idle 30    # sleep after 30 min idle
./cluster wake       # wake all + start exo
./cluster status     # who's awake?

Wake-on-LAN wakes the machines. I wrote a proxy on the management node that intercepts API requests to Ollama or exo - if the target machine is sleeping, it sends a WoL packet and waits for it to boot before forwarding the request. First request takes ~30 seconds, then it's instant.

What I'd Do Differently

Buy M4 Pro - the extra RAM (48-64 GB per node) and TB5 RDMA would make a real difference for larger models
Skip Thunderbolt for exo - exo's libp2p networking uses mDNS and picks whichever IP it finds first (usually Ethernet), not the faster TB bridge. There's no way to force it. The TB mesh is useful for file transfers but exo doesn't take advantage of it
Start with the .dmg - I wasted hours trying to pip install and source-build exo before discovering the pre-built app

What's Next

Test multi-node sharding with a 70B model
Add an RTX 5090 GPU server to the mix (already done - see next post)
Build a proper monitoring dashboard (Grafana)
Set up LLM security pipeline (content safety, prompt injection detection)