$ khalisio .io
$ cat services.md

What I do

I focus on the parts of an AI product that aren't the model itself — the surrounding system that makes it reliable, observable, and economical. I work with teams either as a hands-on engineer or as an advisor reviewing architecture and shipping with your team.

01

Agent system design

From single-purpose tool callers to multi-agent rooms with handoff and delegation.

  • Picking the right pattern — one agent + tools, planner+executor, multi-agent room, or ReAct loop
  • Designing tool calling that actually works — schemas, errors, retries, safety tiers
  • Memory and context strategy — sliding windows, summarization, RAG, structured memory
02

Model routing & cost / latency tuning

Stop overpaying. Get the cheapest model that hits your quality bar, and a fallback for when it doesn't.

  • Routing strategies across Claude, OpenAI, Ollama, and self-hosted models
  • Caching, prompt deduplication, and context-window economics
  • Latency budgets — streaming, parallel tool calls, speculative decoding where it helps
03

Tool calling & MCP integration

The part most agent projects get wrong. Schema design, execution sandboxing, and tier-based authorization.

  • MCP server design — what to expose, what to gate behind approval
  • Read / write / destructive tool tiers, audit logging, agent attribution
  • Tool result caching, parallel execution, error recovery patterns
04

Evaluation & observability

Agents that pass eyeball-tests in a notebook usually fail in prod. Build the feedback loop early.

  • Eval harness design — synthetic conversations, regression suites, scoring rubrics
  • Tracing — every tool call, every token, every decision the agent makes
  • Production telemetry — cost per request, drop-off, intervention rate
05

Self-hosted infrastructure

When you need data residency, cost ceilings, or just don't want to ship customer data to OpenAI.

  • Kubernetes (Talos, k3s) cluster design — auth, ingress, secrets, monitoring
  • GitOps workflows — Gitea, ArgoCD-style sync, agent-driven PR proposals
  • Self-hosted models via Ollama / vLLM, with OpenAI-compatible front doors
$ engagement_models --list
  • advisory — weekly architecture reviews, async help via Slack/Linear
  • hands-on — embedded with your team for a defined scope (typically 4–12 weeks)
  • turnkey — I build it, you run it; includes runbook + handover