Production-Grade Telephony: SIP, Failover, and Reliability

Running AI voice agents in production means your system lives and dies by telephony reliability. When a caller dials in and gets silence or a dropped connection, no amount of clever NLU matters. After operating SIP-based voice agents handling thousands of concurrent calls, here is what I have learned about building infrastructure that holds up.

SIP Fundamentals for Voice AI

Session Initiation Protocol (SIP) is the backbone of modern telephony. An inbound call triggers an INVITE, your system responds with a 200 OK, the caller ACKs, and media begins flowing over RTP. The critical detail most teams miss: the SIP signaling path and the RTP media path are separate. Your SIP proxy might be healthy while your media server silently drops packets.

In a typical architecture, SIP trunks from a carrier terminate at a Session Border Controller (SBC), which routes to your application server. The application server bridges caller audio to your AI pipeline — ASR, language model, TTS — and streams synthesized speech back. Every hop is a potential failure point.

Designing for Failover

Single points of failure in telephony are unacceptable. Here is the pattern that works in production:

Carrier-level redundancy. Configure at least two SIP trunk providers with weighted routing. Your SBC should detect trunk failures via SIP OPTIONS pings every 10 seconds, marking a trunk as down after 3 missed responses and rerouting automatically.

SBC clustering. Deploy SBCs in an active-active pair behind a DNS SRV record. For cloud deployments, a network load balancer provides faster failover than DNS-based approaches.

Application server pools. Run voice application servers as a stateless pool. Each instance should expose a health endpoint that checks not just process liveness but whether the instance can actually process a call right now, factoring in CPU, memory, and downstream dependencies.

Monitoring the Right Metrics

Standard infrastructure metrics will not catch telephony-specific failures. Track these:

Answer-Seizure Ratio (ASR): Percentage of call attempts resulting in a connection. A healthy system runs above 95%. Alert on any 5-minute window below 90%.

Post-Dial Delay (PDD): Time from INVITE to the first ringing response. Keep this under 2 seconds. High PDD means callers hear dead air and hang up.

RTP packet loss and jitter: Even 1% packet loss degrades ASR accuracy. Monitor per-call MOS (Mean Opinion Score) and flag calls below 3.5.

SIP response codes: Track 4xx and 5xx responses by trunk. A spike in 503 means it is time to shift traffic to your backup trunk.

Handling Graceful Degradation

When partial failures occur, degrade gracefully. If your AI pipeline is overloaded, route overflow calls to a simpler IVR fallback that collects basic information and offers a callback. Implement call admission control at the SBC — reject new calls with a 486 (Busy Here) rather than accepting calls you cannot serve. A busy signal and retry beats a broken agent experience.

Lessons from Production Incidents

The worst outages I have seen were subtle: a carrier silently transcoded audio from G.711 to G.729 during a capacity crunch, destroying ASR accuracy. A TLS certificate expired on a Saturday, and mutual TLS failed silently because the SBC fell back to UDP. A DNS change propagated unevenly, sending half the traffic to a decommissioned IP for six hours.

Every one of these was detectable with proper monitoring. Build runbooks with SIP-specific diagnostics: capture a SIP ladder diagram, check codec negotiation in the SDP, verify RTP flows bidirectionally, and confirm DNS resolution from multiple vantage points.