Two Docker One Bug

Two production issues hit the same evening. One on the central platform, one on an edge device. Both were Docker Compose operational bugs - not code bugs, not infrastructure bugs. Just the kind of thing that happens when containers run for months without being recreated.

Bug 1: The Missing Compose Overlay

The central platform's market prices page stopped working. The frontend loaded, but the API calls returned 404:

WARN path=/api/v1/market/prices status=404
WARN path=/api/v1/market/weather status=404

The architecture is straightforward: a Go API server proxies /api/v1/market/* requests to a separate marketd microservice. The proxy is only registered if the MARKET_API_URL environment variable is set:

if deps.MarketAPIURL != "" {
    mp := NewMarketProxy(deps.MarketAPIURL)
    r.Get("/market/*", mp.ProxyHandler)
}

No env var, no route, no proxy. Just 404.

The Diagnosis

The marketd service was running and healthy. I could reach it from inside the API container:

$ docker exec collector wget -qO- http://marketd:8090/health
{"status":"ok"}

But listing the API container's environment showed no MARKET_API_URL. It also had no LOKI_URL and no REALM_ID - all variables defined in the production Compose overlay.

The project uses two Compose files stacked together:

docker compose -f docker-compose.yml -f docker-compose.production.yml up -d

The base file defines service structure. The production overlay adds environment variables, overrides ports, and configures monitoring. Someone (probably me, weeks ago) had recreated the collector using only the base file:

# What was probably run:
docker compose up -d --force-recreate collector

# What should have been run:
docker compose -f docker-compose.yml -f docker-compose.production.yml up -d --force-recreate collector

One forgotten -f flag. The container started fine, passed health checks, served most API routes correctly. Only the market proxy was silently missing because its env var didn't exist.

The Port Conflict Surprise

Fixing it revealed a second problem. The production overlay had ports: "8080:8080" on the API container, but the reverse proxy (Caddy) in the base Compose also mapped 8080:8080. When both were recreated with the full overlay, they fought over the port:

Error: Bind for 0.0.0.0:8080 failed: port is already allocated

The API service shouldn't expose ports directly - that's what the reverse proxy is for. The base Compose correctly used expose: "8080" (internal only), but the production overlay incorrectly added a host port mapping. Removing the ports directive from the production overlay fixed it.

Takeaway

Docker Compose overlays are powerful but fragile. There's no warning when you run docker compose up without all your overlay files - Docker just uses whatever files you specify. The container starts, passes health checks, and silently runs with a subset of its intended configuration.

If you use Compose overlays in production, protect yourself:

# In your deploy script or .bashrc:
alias dc='docker compose -f docker-compose.yml -f docker-compose.production.yml'

Or put it in a docker-compose.override.yml which Docker loads automatically. Or use a Makefile. Anything that removes the human from the command.

Bug 2: The DNS Ghost

The same evening, an edge device stopped being able to register with the central platform. Different system, different host, different error:

httpx.ConnectError: [Errno -3] Temporary failure in name resolution

DNS failure. Simple, right? It took longer than it should have to find the real cause - because every obvious check passed.

The Setup

The device is an embedded Linux box running a Docker Compose stack - an API service, a data collector, a reverse proxy, and a time-series database. It connects to an external MQTT broker by hostname for telemetry sync. The stack had been running for about 8 months without issues.

One day the UI showed a registration error. DNS resolution was failing inside the API container. No code changes, no deployments, no Docker updates. It just stopped working.

The Misleading Clues

First instinct: check if DNS works on the host.

$ ping -c1 mqtt.example.com
PING mqtt.example.com (91.x.x.x) 56(84) bytes of data.
64 bytes from host-91-x-x-x.example.com: icmp_seq=1 ttl=58 time=17.4ms

Host DNS is fine. Check the host's resolver:

# /etc/resolv.conf (host)
# Generated by NetworkManager
nameserver 192.168.200.100

Normal. Now check inside the container:

$ docker exec api python3 -c "import socket; socket.getaddrinfo('mqtt.example.com', 8883)"
# socket.gaierror: [Errno -3] Temporary failure in name resolution

Fails. But the container's /etc/resolv.conf looks standard for a Compose network:

nameserver 127.0.0.11
options ndots:0

127.0.0.11 is Docker's embedded DNS resolver. It handles container-to-container name resolution (service discovery) and forwards external queries to whatever DNS the host was using when the container started. This is normal for any container on a user-defined network.

The Red Herring

Maybe Docker's embedded DNS is broken on this network? Spin up a fresh container on the same Compose network:

$ docker run --rm --network myapp_default alpine nslookup google.com
Server:    127.0.0.11
Address:   127.0.0.11:53

Non-authoritative answer:
Name:   google.com
Address: 142.251.98.100

Works perfectly. Same network, same DNS resolver address, same Docker daemon. But the 8-month-old container can't resolve anything.

The Ghost

The answer was hiding in plain sight, in a comment inside /etc/resolv.conf:

Fresh container:

# ExtServers: [host(192.168.200.100)]

8-month-old container:

# ExtServers: [host(192.168.65.7)]

192.168.65.7 was the host's DNS server eight months ago, when the container was created. The network configuration changed at some point - maybe a router swap, maybe a DHCP lease change - and the host's /etc/resolv.conf updated to 192.168.200.100. But Docker's embedded DNS resolver inside the running container still had 192.168.65.7 cached as its upstream forwarder.

The old DNS server no longer existed on the network.

How Docker's Embedded DNS Works

When you create a container on a user-defined network (which Docker Compose does by default), Docker sets up an embedded DNS server at 127.0.0.11 inside the container. This resolver does two things:

Service discovery - resolves container names and service aliases within the Docker network
External forwarding - forwards all other queries to the host's DNS server

The critical detail: the external forwarder address is captured at container creation time from the host's /etc/resolv.conf. It's baked into the container's network namespace. If the host's DNS changes while the container is running, the embedded resolver keeps forwarding to the old address.

For containers on the default bridge network, Docker takes a different approach - it copies the host's /etc/resolv.conf directly into the container (the "legacy" mode). These containers get the host's DNS server address verbatim and don't use the embedded resolver for external queries. That's why docker run alpine nslookup google.com worked - it used the default bridge, not the Compose network.

Why This Is Insidious

This failure mode has several properties that make it hard to catch:

It's silent. No log message says "DNS forwarder changed." The container doesn't monitor the host's resolver config. It just starts failing when it needs to resolve an external name.

It's selective. Container-to-container resolution still works perfectly (that's handled by Docker's internal DNS, not the forwarder). So api can still talk to influxdb by name. Only external hostnames break.

It's inconsistent. New containers on the same network work fine because they capture the current DNS at creation time. This makes it look like a container-specific bug rather than a network issue.

It's delayed. The DNS could change months before the failure manifests. If your containers don't resolve external names frequently, you won't notice until something triggers it.

The Fix

Restart the affected containers:

docker compose up -d --force-recreate api collector

--force-recreate destroys and recreates the containers, which captures the current DNS configuration. A simple docker compose restart would NOT fix it - restart just stops and starts the existing container without recreating its network namespace.

If you also run a reverse proxy (like Caddy or nginx), recreate that too - it caches upstream container IPs that change on recreate.

Prevention

Option 1: Configure DNS explicitly in Docker daemon

Add DNS servers to /etc/docker/daemon.json:

{
  "dns": ["8.8.8.8", "1.1.1.1"]
}

This overrides the host's /etc/resolv.conf for all containers. The DNS server won't change when your local network does. Downside: you lose automatic DNS from DHCP, and you depend on external DNS availability.

Option 2: Configure DNS in Compose

services:
  api:
    dns:
      - 8.8.8.8
      - 1.1.1.1

Same idea, scoped to specific services.

Option 3: Health checks that catch it

Add a health check that resolves an external name:

services:
  api:
    healthcheck:
      test: ["CMD", "python3", "-c", "import socket; socket.getaddrinfo('dns-check.example.com', 443)"]
      interval: 60s
      timeout: 5s
      retries: 3

This won't prevent the issue, but it'll mark the container as unhealthy when it happens, making it visible in monitoring.

Option 4: Don't run containers for 8 months

Regular deployments naturally recreate containers. The issue only affects long-running containers that outlive their network configuration. If you deploy weekly (or even monthly), you'll never hit this.

For embedded/IoT devices that run unattended for months - this is the real danger zone. These devices sit on home networks where routers get replaced, ISPs change, and DHCP configurations drift. The containers keep running, oblivious.

The Common Thread

Both bugs share the same root cause: Docker containers snapshot their configuration at creation time and never update it.

Bug 1: the Compose overlay defines env vars at creation time. Miss the overlay, and the container runs forever with the wrong config. No warning, no health check failure, just a silently missing feature.

Bug 2: the DNS forwarder is captured at creation time. Change the network, and the container keeps forwarding to a dead server. No warning, no log entry, just failed resolution.

Docker containers are not VMs. They don't have an init system that re-reads config files. They don't poll the host for network changes. They don't validate their environment against what was intended. They run with whatever they were given at birth, for as long as they live.

For services that get deployed weekly, this doesn't matter - containers are ephemeral by default. But for embedded devices, IoT gateways, and self-hosted infrastructure that runs unattended for months, these time-bomb failures are real.

Debugging Checklist

When a Dockerized service silently breaks after months of working:

Check the environment. docker inspect <container> --format '{{json .Config.Env}}' - compare against what your Compose files define. Look for missing variables.
Check the DNS. Read /etc/resolv.conf inside the container. Compare the ExtServers comment against the host's current /etc/resolv.conf. If they don't match, recreate.
Test on a fresh container. docker run --rm --network <compose_network> alpine nslookup google.com - if this works but your old container doesn't, it's a stale config issue.
Check your Compose command. Did you use all the overlay files? docker inspect <container> --format '{{index .Config.Labels "com.docker.compose.project.config_files"}}' tells you which files were used.

The fix is always the same: docker compose up -d --force-recreate. The question is whether you catch it before your users do.