Skip to content
Rush Commerce
Field Notes4 min read

We gave our phone system a brain: building an AI voice receptionist with LiveKit + Twilio

The build log for the AI receptionist that answers every call across three retail stores — architecture, latency lessons, and the gotchas that ate our weekend.

Retail phones are a lose-lose. During a rush, staff choose between the customer in front of them and the one on the line. After hours, calls go to a voicemail nobody checks. So we built an AI receptionist that answers every call across three store locations — and it's been on the phones ever since. This is the build log. The case study version lives here.

The architecture

Three pieces:

  1. Twilio owns the phone numbers and SIP trunking — calls hit Twilio first
  2. LiveKit bridges telephony into a realtime session and runs the agent framework
  3. A realtime speech model does the actual conversation — speech in, speech out, no text-in-the-middle pipeline

The agent knows each location's hours, address, parking quirks, and the twenty questions that make up most call volume. Anything outside its lane — a complaint, a negotiation, anything with heat — it transfers to a human with context.

The routing trick that simplified everything: a catch-all dispatch rule. Every inbound number lands on the same agent, which adapts by store. One agent to maintain, three stores covered, new location = new number pointed at the same place.

Lesson 1: latency is the product

Nothing else matters if the turn-taking feels wrong. Humans notice pauses above roughly 500ms; at a full second, callers start saying "hello?" — and once that happens twice, trust is gone regardless of how smart the answers are.

What moved the needle:

  • Realtime speech-to-speech instead of STT → LLM → TTS. The classic pipeline stacks three latencies and loses prosody. Speech-native models cut both.
  • On-device voice activity detection and end-of-turn detection, tuned for retail calls — background music, register noise, two people talking near the phone. Stock thresholds interrupted people mid-sentence; retail callers pause mid-thought ("do you have it in… uh, a size 11").
  • Ruthless prompt budget. Every instruction token adds to time-to-first-word. The persona prompt earns its length or gets cut.

Lesson 2: the boring failure modes are the real ones

The model was never the problem. The problems were:

  • Managed-number plumbing. Numbers provisioned through certain messaging services behave differently than raw voice numbers when you attach SIP trunks. That mismatch ate a weekend. Check how the number was provisioned before wiring anything.
  • Knowing when to shut up. Early versions answered everything enthusiastically, including things they shouldn't ("can you hold five pairs for my cousin?"). The fix wasn't more intelligence — it was a tighter lane and a graceful, fast transfer.
  • The persona test. Read your agent's greeting out loud. If it sounds like a phone tree wearing a costume, rewrite it. Ours is short, warm, and store-specific — callers regularly don't clock it as AI until it tells them.

Lesson 3: scope it like an employee, not a feature

The unlock was writing the agent's job description before its prompt: what would we tell a new hire answering phones on day one? Hours, directions, stock checks, hold policy, when to grab a manager. That document became the system prompt almost verbatim — and gave us the eval checklist for free.

What it costs to run

Orders of magnitude: realtime model minutes + telephony land at cents per call. A receptionist-shaped human answering the same volume across three locations would be a five-figure line item. The ROI conversation is short. The 7-automations post has the missed-call math.

What's next

A public demo line you can call from this site is on the roadmap. Until then, the write-up is the tour — and the phones at all three stores are the production deployment.

Key takeaways

  • Twilio (telephony) + LiveKit (realtime agent) + speech-to-speech model = an AI receptionist in production
  • Latency is the product: realtime models, tuned turn detection, and short prompts beat raw intelligence
  • The failures are boring — number provisioning, scope creep, robotic personas — plan for them
  • Write the agent's job description like a new hire's; it becomes the prompt and the eval set

Missing calls at your business? We'll scope a voice agent for your operation — honestly, including whether you actually need one. Automations & AI →

  • #voice-ai
  • #livekit
  • #twilio
  • #build-log
TR

Tommy Rush — Founder, Rush Commerce

Operator turned builder. Runs a three-store retail operation and ships the software it runs on. More

Get The Rush Report weekly — one email, zero fluff.