$ cat services.md

What I do

I focus on the parts of an AI product that aren't the model itself — the surrounding system that makes it reliable, observable, and economical. I work with teams either as a hands-on engineer or as an advisor reviewing architecture and shipping with your team.

Agent system design

From single-purpose tool callers to multi-agent rooms with handoff and delegation.

→ Picking the right pattern — one agent + tools, planner+executor, multi-agent room, or ReAct loop
→ Designing tool calling that actually works — schemas, errors, retries, safety tiers
→ Memory and context strategy — sliding windows, summarization, RAG, structured memory

Model routing & cost / latency tuning

Stop overpaying. Get the cheapest model that hits your quality bar, and a fallback for when it doesn't.

→ Routing strategies across Claude, OpenAI, Ollama, and self-hosted models
→ Caching, prompt deduplication, and context-window economics
→ Latency budgets — streaming, parallel tool calls, speculative decoding where it helps

Tool calling & MCP integration

The part most agent projects get wrong. Schema design, execution sandboxing, and tier-based authorization.

→ MCP server design — what to expose, what to gate behind approval
→ Read / write / destructive tool tiers, audit logging, agent attribution
→ Tool result caching, parallel execution, error recovery patterns

Evaluation & observability

Agents that pass eyeball-tests in a notebook usually fail in prod. Build the feedback loop early.

→ Eval harness design — synthetic conversations, regression suites, scoring rubrics
→ Tracing — every tool call, every token, every decision the agent makes
→ Production telemetry — cost per request, drop-off, intervention rate

Self-hosted infrastructure

When you need data residency, cost ceilings, or just don't want to ship customer data to OpenAI.

→ Kubernetes (Talos, k3s) cluster design — auth, ingress, secrets, monitoring
→ GitOps workflows — Gitea, ArgoCD-style sync, agent-driven PR proposals
→ Self-hosted models via Ollama / vLLM, with OpenAI-compatible front doors

$ engagement_models --list

advisory — weekly architecture reviews, async help via Slack/Linear
hands-on — embedded with your team for a defined scope (typically 4–12 weeks)
turnkey — I build it, you run it; includes runbook + handover