AI where it creates measurable business value — not as an end in itself, not as pilot theatre.
You've been hearing about AI transformation for two years. You see impressive demos. And you suspect that most initiatives are stuck in pilot stage — without anyone saying why. What you're looking for: someone who doesn't treat AI as the next chatbot prototype, but as an integral part of your systems — with clear boundaries, measurable benefit, and the 'who carries accountability' question settled in the right places.
Why AI projects get stuck in pilot limbo
The typical pattern: a department experiments with ChatGPT, builds an impressive prototype, presents internally, everyone is excited. Three months later nothing is in production — because nobody clarified who is liable when the AI hallucinates, how token costs per request add up, what happens when the LLM provider swaps the model, and who writes eval tests against regressions.
The demo-to-production gap
An impressive prototype without an architecture plan lands in pilot limbo. Token costs, hallucination rate, model migration, eval — that's what decides production vs. demo, not the wow-demo.
Or, even more frequent: there are two parallel AI initiatives, one in marketing with a chatbot, one in IT with a RAG prototype. Both use different LLMs, different vector databases, different embedding models. In six months you have two island solutions, three API invoices and no coherent AI strategy.
This has a shared root cause: AI gets treated as a tool topic instead of an architecture topic. Demo speed gets rewarded, production readiness rarely honoured, eval pipelines almost never visibly sold. Multi-agent architectures sound modern — until you try to debug an agentic pipeline in which three LLMs have been talking to each other.
Multi-agent magic
Architectures with three chattering LLMs look futuristic — and aren't debuggable when results drift. Conservative before magic is architectural discipline here, not innovation brake.
My approach: think AI as an integrated layer in your systems, not as an isolated prototype. Modular, so LLM swaps don't trigger architectural breaks. Observable, so token costs per request, latency and hallucination rate are transparent. Conservative — RAG before fine-tuning, tool-use before multi-agent, deterministic wherever possible. And with clear human accountability where decisions have consequences.
My approach
Four principles I anchor AI architecture on — grown from 25 years of distributed systems and productive use of current AI architectures:
01Modular instead of monolithic. LLM, retrieval, tooling and eval layer as exchangeable building blocks. Models change every few months; an architecture that can't survive that becomes technical debt in 18 months. Built modular, the system stays load-bearing even when the model provider changes.
02Observable instead of black box. Tracing over every inference, token costs per request, latency profile, eval pipeline against regressions. Running AI in production without this layer means flying blind through every incident and noticing model drift only when customers call.
03Conservative before magic. RAG before fine-tuning, tool-use before multi-agent, deterministic before LLM wherever possible. A majority of 'AI problems' can be solved more cleanly classically — recognising that is part of consulting responsibility, not part of sales avoidance.
04Human in the accountability chain. AI takes the routine: research, pre-qualification, code reviews, documentation. Decisions with consequence — compliance approvals, financial commitments, medical assessments — stay with humans. Eval pipelines and guardrails are the technical implementation of that separation; they're not optional extras but a prerequisite for productive operation.
What you actually get
AI strategy & use-case assessment
Honest assessment of your AI initiatives: where does AI demonstrably bring value, where is classical software the clean answer, where does a RAG system pay off over a chatbot. You get a prioritised list with expected value, effort and risk per use case — no blanket 'AI does everything'.
RAG pipelines in production
Document ingestion, chunking strategies, hybrid search (vector + keyword), re-ranking. Proven productive with Dify and Weaviate on own infrastructure — this website is the live demo. Hands-on with pgvector and Qdrant for existing environments too.
MCP servers & tool integration
Multiple productive MCP servers built, including for Enterprise DataHub queries at Hoffmann Group. Claude, GPT-4 and local models connected to real tools in a controlled way — with boundaries, auth layer, audit trail.
AI agents in the workflow
Custom AI agents that take over development tasks autonomously and communicate with the team via the ticketing system — as a tool amplifier, not a replacement. Automated code reviews in GitLab, GitHub and Azure DevOps. The architect defines boundaries and quality gates.
Eval, cost control & guardrails
Eval pipelines against model drift, cost monitoring per request, rate limiting, input validation, prompt-injection protection. Fallback chains on model outage. All of this as an architectural layer, not bolted on after the first production incident.
From the trenches
Hoffmann Group — AI integration in Enterprise DataHubs
An MCP server made the DataHub backend directly accessible to AI tools for the first time. Developers now formulate queries in natural language instead of memorising GraphQL schemas. AI assistance in the playground noticeably accelerates onboarding of new use cases and consumers — without the data model or the existing GraphQL Federation knowing anything about it.
th3chris.com — RAG system as a live demo
This website isn't just a portfolio but is itself a productive RAG architecture: Dify as orchestration, Weaviate as vector database, custom knowledge base, GPT-4 for answer generation with streaming. Self-hosted on a Kubernetes cluster with GitOps. Anyone who wants to know how a productively operated RAG pipeline feels can chat down to the right.
AI agents in the dev workflow — human-machine integration
Custom AI agents take over development tasks autonomously and communicate with human team members via GitLab tickets. Additionally automated code reviews via GitLab, GitHub and Azure DevOps. Bridge technology between classical microservice landscapes and LLM-backed tools — architect defines tasks, AI executes, human reviews.
„... outstanding mind with excellent skills in development; great software architect. Highly recommended if you need to find a professional fast and scalable solution. We've been working together on a project that was rated by Microsoft professionals as "not possible". Together with Christian our team managed to deliver a great working solution/product!"
Oscar Angress — Cyber Security Consultant, Bosch Engineering GmbH
Questions you might be asking
When is AI NOT the answer?
When determinism counts (accounting, compliance, billing) — rule-based systems beat LLMs in every audit. When the corpus fits in the context window — straight into the prompt is more deterministic and provable than RAG. When the output isn't checkable — LLM answers without an eval pipeline are a compliance risk, not a feature. When consequences are irreversible (medical, financial, safety-critical decisions) the human belongs in the loop.
Self-hosted or cloud LLM?
That's decided by the use case, not by vendor preference. Cloud LLMs (OpenAI, Anthropic) are often the faster answer and for many workloads the more economical one. Self-hosted (Llama, Mistral, own models) pays off with data sovereignty requirements, regulated industries or high volume with cost control. Often hybrid: sensitive paths self-hosted, the rest in the cloud.
How do you measure AI quality?
Eval pipelines with defined test sets per use case. Hallucination rate, factual consistency, token cost per request, latency profile. For RAG additionally: retrieval precision (are the right documents found?) and answer faithfulness (does the answer actually appear in the retrieved documents?). Without this layer 'AI quality' is a claim, not a measurement.
What does AI cost in production?
Token costs plus infrastructure (vector DB, compute, monitoring). At low volumes token costs dominate, at high volumes infrastructure does. Concretely: a productive RAG system with ~10k requests/month typically sits in the low three- to four-digit euro range per month — depending on model choice, chunk size and whether caching is used. Cost monitoring per request belongs in the architecture from the start, otherwise there are surprises.
How do you handle AI compliance?
Data flow diagrams per use case (which data leaves your system, where to, under which contract), logging of all inferences for audit trails, input validation against prompt injection, clear responsibility assignment for who reviews what. With GDPR-sensitive data often: self-hosting or cloud provider with DPA and EU region. AI compliance isn't a retrofitted patch but an architectural property.