$ cat case-studies/orion.md

ORION — a multi-agent platform for a homelab

How I built a chat-room model for Claude / Ollama / OpenAI agents that can open GitOps PRs against a real Kubernetes cluster — with audit trails, tool-tier authorization, and an evaluation harness that catches drift before it ships.

duration

~6 months

team

solo

stack

TS / Python / K8s

status

running prod

01 · The problem

Off-the-shelf agent frameworks gave me two options: a single agent burning context on every step, or a multi-agent setup that hallucinated tool calls because the underlying model couldn't actually structure them properly. Neither was good enough to drive real infrastructure work — opening PRs, kicking off deploys, reading from a cluster, responding to ops events.

I wanted a system where:

→ multiple agents share a room, with explicit ring-leader / specialist routing
→ every tool call is gated, audited, and attributable
→ agents drive real change via GitOps PRs, not direct kubectl
→ Claude, Ollama, and OpenAI agents all participate equally

02 · Architecture

Three processes form the core, all behind a single Next.js app:

┌─────────────────────────────────────────────────────────────┐
│  Next.js web        ─── chat rooms, dashboards, REST API     │
│   │                                                          │
│   ├── room-agents   ─── per-room LLM dispatch, tool routing  │
│   │     │                                                    │
│   │     ├── callClaude ─→ orion-claude sidecar (MCP)         │
│   │     ├── callOpenAIChat ─→ ext models / Ollama OAI-compat │
│   │     └── callOllamaChat ─→ native /api/chat (no tools)    │
│   │                                                          │
│   └── tool-registry ─── 60+ tools across tiers               │
│         │                                                    │
│         ├── tasks   ─── create / close / reopen / validate   │
│         ├── gitops  ─── propose / ls / validate manifest     │
│         ├── k8s     ─── stat / logs / events                 │
│         └── ...                                              │
│                                                              │
│  Worker (Node)      ─── long-running tasks, watchers,        │
│                         retries with backoff                 │
│                                                              │
│  Postgres + Redis   ─── chat state, audit log, rate limits   │
└─────────────────────────────────────────────────────────────┘

Every agent reply runs through a dispatch loop that owns the message history, the tool-call protocol, and the per-room rate / token budget. Tool calls go to a central registry that enforces tier-based authorization (read · create · modify · destructive) and writes an audit row attributing every change to either a user or an agent.

03 · Three hard problems

Tool calling across model providers

Claude does tool calling natively through MCP. OpenAI does it through the tools array on chat completions. Ollama's /api/chat endpoint doesn't really — agents tend to hallucinate tool names as prose. I route every Ollama model through its /v1/chat/completions OpenAI-compatible endpoint so structured tool calls actually round-trip, and added a fake-tool-call detector that catches inline JSON and rejects it.

Agents driving real changes

The agent doesn't apply manifests — it proposes them. The gitops_propose tool branches, commits, opens a PR in Gitea, and policy gates decide what auto-merges. A separate ArgoCD-style reconciler watches the repo. That gives me three layers of review (the agent's reasoning, the policy, and Argo's diff) before anything touches the cluster, and a clean history of every change.

Evaluation and observability

Every tool call, every model call, every retry is logged with a correlation ID. I added a small eval harness that replays scripted conversations against the current build and scores them on a few rubrics (tool use, refusal accuracy, hallucination). Failures show up as a regression in CI before the model change ships — cheaper than finding out in prod that a Claude minor bump broke tool routing.

04 · Where it landed

+ 60+ tools registered across 7 categories
+ 3 model providers fully interoperable (Claude · Ollama · OpenAI)
+ Per-room rate limits + token budgets via Redis Sentinel
+ SOC2-aligned audit trail on every write tool
+ GitOps integration — agents open PRs that the cluster auto-reconciles
+ Runs on a single-node Talos cluster behind Authentik SSO + CrowdSec

05 · What I'd tell someone starting

Pick the model abstraction before the framework. You're going to swap models more than you think. Define a single dispatch interface that takes a model ID and returns tokens / tool calls, then build everything else on top. Don't let framework conventions leak.

Audit attribution from day one. Once an agent can write to a database or open a PR, you need to know which agent and on whose behalf. Retrofitting this is awful — wire it through the request context from the first endpoint.

Treat the tool registry as a product surface. Tool descriptions are prompts. Schemas are prompts. Error messages are prompts. The LLM reads all of it. Write them like you're writing copy for a junior engineer who'll never ask a follow-up question.

Don't deploy agents without evals. The eval harness was the single highest-ROI piece of infrastructure. You don't need a fancy one — even a yaml-driven replay suite with a dozen scenarios catches 80% of regressions.

$ echo "building something similar?"

I'm happy to walk through the architecture in more detail, share specific code patterns, or just pressure-test your current design. Most of these decisions are reversible if you make them early enough.

→ start a conversation

← back to portfolio