The Problem
Why standard chatbots and RAG pipelines fail for complex business workflows.
Rigid State Machines
Traditional chatbots use hardcoded FSMs that break on multi-intent messages. User says "Wroclaw, women, laser hair removal" — FSM handles one step at a time.
RAG Hallucinations
Standard Retrieval-Augmented Generation retrieves context but still lets the LLM fabricate prices, services, and availability that don't exist in the database.
Context Loss
Stateless chatbots forget user preferences between messages. A returning customer has to re-specify their studio, gender, and preferred device every session.
No Observability
When something goes wrong, there's no trace of which tool was called, what the LLM decided, or why the user got a wrong answer. Debugging is guesswork.
The Solution: Brain-First Architecture
A single fine-tuned LLM reads a comprehensive system prompt and decides which tools to call — no hardcoded state machines, no rigid flows.
User Message
Incoming message from any channel (Web, Telegram, Instagram, Facebook)
Load Session
Restore conversation context, preferences, cart, booking history from PostgreSQL
Build Context
Construct dynamic system prompt with current studio, gender, cart, pending questions
Brain LLM
Fine-tuned GPT-4.1 Mini decides which tools to call based on user intent
Tool Execution
Up to 5 sequential tool calls with context sync between each step via prepareStep
Response
Natural language answer grounded in database facts, with UI selectors when needed
Key Innovation: prepareStep
Before each LLM step, prepareStep rebuilds the system prompt with real-time state from tool executions. When the user says "Wroclaw, women, laser" — the LLM processes it in 4 sequential steps, with each step seeing the updated context from the previous tool call.
The Tech Stack
Production-grade infrastructure for a serverless AI agent at scale.
AI / LLM
Backend
Channels
Infrastructure
21 Orchestrated Tools
The Brain LLM has access to 21 specialized tools organized into 5 categories. Each tool validates its own context and returns typed results.
Near-Zero Hallucination Strategy
Six layers of protection ensure the agent never fabricates information.
Database-Only Policy
System prompt explicitly forbids inventing prices, services, studios, or availability. All factual queries must go through tools.
Tool-Level Validation
Each tool validates required context. addToCart checks serviceId against lastShownServices. getBookingLink validates time slots against lastAvailabilityTimes.
Two-Stage Vector Search
Context-aware search with studio/gender/country filters first. Global fallback with lower threshold (0.25) only if contextual search returns nothing.
FAQ Direct Answer
High-similarity FAQ matches (>= 0.85) return the stored answer directly without LLM rewriting, eliminating any chance of distortion.
Injection Detection
Pre-LLM guard layer with pattern matching for prompt injection attempts, 500-char message limit, and DB-based rate limiting for serverless.
Incident Snapshots
On 3+ consecutive failures, user dislikes, or complaints — a SupportIncident is created with full conversation snapshot, independent from conversation lifecycle.
Key Architectural Decisions
The reasoning behind critical design choices.
Brain-First over FSM
Single LLM agent decides all flows, enabling flexibility for multi-intent requests and natural conversations instead of rigid state transitions.
Native Multi-Step (Vercel AI SDK)
Uses generateText with stopWhen + prepareStep instead of custom loop. prepareStep rebuilds the system prompt before each LLM step with real-time state changes.
Session vs Conversation split
Session is temporary (24h TTL) for cart/booking state. Conversation is permanent for preferences, studio, gender, and booking history that persists across sessions.
No Langchain / LlamaIndex
Direct Vercel AI SDK + Prisma for simpler, more transparent architecture. Full control over prompt construction, tool execution, and state management.
Fine-tuned model over prompting
Custom training data pipeline (merge, validate, audit, quality-check scripts) for consistent tool calling behavior and tone across 6 languages.
Permanent booking history
Never reset, even after handoff or session restart. Enables repeat booking, personalization, and analytics across the entire customer lifecycle.
The Outcome
$0.015 per Conversation
Fine-tuned GPT-4.1 Mini keeps cost ~100x cheaper than a human operator per dialog
~4s Average Response
Multi-step tool execution (2-5 calls) with prompt caching delivers near-instant replies
Near-Zero Hallucinations
Brain-first architecture with tool-level validation ensures every fact comes from the database
18 Studios, 5 Countries
Live in production across the entire European chain with full localization in 6 languages
4 Channels, 1 Context
Website, Telegram, Instagram, Facebook — unified conversation and booking history
Complete Observability
Langfuse traces every tool call, token usage, and cost per conversation in real-time
Interested in building something similar?