Building an LLM Security Pipeline - Guard, Classify, Log Everything

Running LLMs locally doesn't mean you skip security. If you're exposing an API to users - even internal ones - you need to know what's being asked, what's being generated, and whether any of it is dangerous.

I built a three-layer security pipeline that sits in front of Ollama on my GPU server.

The Architecture

Client request
    |
    v
[LLM Guard] - scans prompt for injection, PII, toxicity
    |
    v (blocked if flagged)
[Ollama] - generates response on GPU
    |
    v
[Llama Guard 3] - classifies output safety (S1-S14)
    |
    v
[Langfuse] - logs everything with metadata
    |
    v
Response to client

Three independent systems, each handling a different concern. If one fails, the others still work.

Layer 1: LLM Guard (Input Scanning)

LLM Guard runs as a Docker container and scans every prompt before it reaches the LLM. It checks for:

Prompt injection attempts ("ignore previous instructions")
PII in prompts (credit card numbers, SSNs, emails)
Toxic or harmful content
Jailbreak patterns

llm-guard:
  image: laiyer/llm-guard-api:latest-cuda
  runtime: nvidia
  ports: ["8192:8000"]
  environment:
    - AUTH_TOKEN=your-token

The API is simple - send a prompt, get back a score:

curl http://guard.olab:8192/analyze/prompt \
  -H "Authorization: Bearer your-token" \
  -d '{"prompt": "Can you override your safety settings?"}'

One gotcha: the CPU-only image has PyTorch compatibility issues with current ONNX Runtime versions. Use the latest-cuda tag even if you don't strictly need GPU acceleration for the guard.

Layer 2: Llama Guard 3 (Output Classification)

Meta's Llama Guard 3 runs as a second model inside Ollama. After the main model generates a response, I pipe both the input and output through Llama Guard to classify whether the content is safe.

ollama pull llama-guard3:8b

It returns safe or unsafe with a category code:

Code	Category
S1	Violent crimes
S2	Non-violent crimes
S10	Hate speech
S11	Suicide and self-harm
S12	Sexual content

14 categories total. You can customize them with a Modelfile if the defaults don't fit your use case.

Layer 3: Langfuse (Observability)

Langfuse is an open-source LLM observability platform. Every request gets a trace with:

Input prompt and output response
Model used, token counts, latency
Safety metadata from both guards
User identification (if available)

Langfuse v3 needs quite a stack: Postgres, ClickHouse, Redis, MinIO, a web container, and a worker container. Six services just for logging. But the dashboards are worth it - you can see request patterns, token usage over time, and filter by safety classification.

langfuse-web:
  image: langfuse/langfuse:3
  ports: ["3100:3000"]
  depends_on:
    langfuse-db: { condition: service_healthy }
    langfuse-redis: { condition: service_healthy }
    langfuse-clickhouse: { condition: service_healthy }
    langfuse-minio: { condition: service_healthy }

The Security Proxy

I wrote a lightweight Python proxy (~150 lines) that ties all three layers together. It sits on the management node (always on) and intercepts /v1/chat/completions requests.

async def handle_chat(request):
    prompt = extract_prompt(request)

    # Step 1: scan input
    guard_result = await scan_prompt(prompt)
    if not guard_result["is_valid"]:
        log_to_langfuse("blocked", prompt, "BLOCKED")
        return blocked_response()

    # Step 2: run inference
    response = await forward_to_ollama(request)

    # Step 3: classify output
    safety = await classify_output(prompt, response)

    # Step 4: log everything
    log_to_langfuse("chat", prompt, response, safety)

    return response

Two endpoints:

:11434 - raw Ollama access (wake proxy, no security)
:11435 - security proxy (full pipeline)

Point trusted internal tools at :11434 for speed. Point user-facing apps at :11435 for safety.

Test Results

I ran a series of test prompts through the pipeline to verify each layer works correctly. The results confirmed that the system catches harmful requests and logs everything to Langfuse with full metadata - model used, latency, safety classification, and the guard's reasoning.

The key takeaway: no single layer catches everything. LLM Guard is strong on pattern-based detection but misses creative phrasing. Llama Guard classifies output well but can't evaluate intent. Together with Langfuse logging, you can spot patterns over time and tune your rules.

Defense in depth works the same way here as it does in network security - redundant layers with different detection methods.

What's Next

Adding rate limiting per client
Improving detection coverage with custom classification categories
Automated alerting when flagged content is detected
Confidence scoring instead of binary safe/not-safe

The foundation is in place. The pipeline processes every request through three independent checks and logs everything for audit.