API-First Development: Designing Resilient Third-Party Integrations
Every production system I have worked on integrates with at least three external APIs. Salesforce for CRM data, telephony providers for call routing, EHR systems for patient records, payment processors, identity providers — the list grows with every project. The uncomfortable truth is that third-party APIs will fail, and your system’s reliability is only as strong as your weakest integration.
After years of building integrations that handle millions of API calls daily, here are the patterns that actually survive production.
Define Contracts Before Writing Code
API-first means your integration layer has a clearly defined contract independent of the third party’s actual API surface. Create an internal interface that describes what your system needs, not what the external API provides. This abstraction layer does two critical things: it isolates your business logic from upstream API changes, and it makes testing straightforward because you can mock at the contract boundary.
Define your internal data models with explicit types using Pydantic or dataclasses. Map the external API response to your internal model in a single adapter function. When Salesforce renames a field or changes a response structure, you fix one adapter — not forty call sites scattered across your codebase.
Circuit Breakers Are Non-Negotiable
The circuit breaker pattern prevents cascading failures when a downstream API degrades. Without one, a slow or failing API call blocks your threads, exhausts connection pools, and takes down services that have nothing to do with the failing integration.
Implement a circuit breaker that tracks failure counts over a rolling window. After a configurable threshold of failures, the breaker opens and immediately returns a fallback response without attempting the API call. After a cooldown period, it enters a half-open state, allowing a single probe request through. If that succeeds, the breaker closes and normal traffic resumes.
In Python, libraries like pybreaker or tenacity handle this well, but even a simple implementation with a failure counter in Redis works. The key is tuning the thresholds to match the downstream API’s behavior. A telephony API that occasionally drops a call needs different thresholds than a CRM endpoint that returns stale data for hours during maintenance windows.
Retry with Backoff and Jitter
Naive retries make outages worse. If an API is struggling under load and fifty of your workers all retry simultaneously after a fixed delay, you create a thundering herd that pushes the API further into failure.
Use exponential backoff with jitter. Start with a base delay, double it on each retry, and add a random jitter component. This spreads retry attempts over time and prevents synchronized retry storms. Cap the maximum delay and set a maximum retry count based on how stale your data tolerance allows.
Equally important: classify which errors are retryable. A 429 (rate limited) or 503 (service unavailable) is retryable. A 400 (bad request) or 401 (unauthorized) is not — retrying those wastes time and can trigger rate limiting on your own credentials.
Idempotency Protects Against Duplicate Processing
When you retry a request that might have succeeded but timed out before you received the response, you risk creating duplicate records or processing a transaction twice. Every write operation against an external API should include an idempotency key.
Generate a deterministic key from the request payload — a hash of the operation type, entity ID, and relevant parameters. Send it as a header or parameter with each request. Most well-designed APIs (Stripe, Salesforce) support idempotency keys natively. For APIs that do not, track completed operations in your own datastore keyed by the idempotency token, and check before re-issuing the call.
Timeouts Must Be Explicit
Never use default timeouts. Default connection and read timeouts in most HTTP libraries are either too generous (waiting 60+ seconds for a response) or not set at all. Set explicit connect and read timeouts on every HTTP client you create.
A connect timeout of 3-5 seconds is reasonable for most APIs. Read timeouts depend on the operation — a simple GET might warrant 10 seconds, while a complex report generation endpoint might need 30. The critical point is that these values should be conscious decisions documented alongside the integration, not afterthoughts.
Build a Degraded Mode
When a critical integration is down, your system should not crash — it should degrade gracefully. Define what degraded mode means for each integration before you need it.
For a CRM integration, degraded mode might mean serving cached customer data that is up to an hour stale. For a telephony integration, it might mean queueing call routing requests in a local store and processing them when the API recovers. For an EHR system, it might mean displaying a clear message that records are temporarily unavailable rather than showing an error page.
Cache the last known good response for every read operation. Use write-ahead logging for write operations so nothing is lost during an outage. When the circuit breaker closes and the API recovers, drain the write-ahead log in order.
Observability Across Integration Boundaries
Log every outbound API call with the endpoint, response status, latency, and a correlation ID that ties back to the originating request in your system. Track error rates and latency percentiles per integration endpoint, not just per service. An overall error rate of 1% hides the fact that one specific Salesforce endpoint is failing 30% of the time.
Set alerts on error rate changes rather than absolute thresholds. A jump from 0.1% to 2% is more meaningful than crossing a static 5% threshold, because it tells you something changed rather than that you hit an arbitrary number.
These patterns are not theoretical — they are the difference between an integration that pages you at 3 AM and one that self-heals while you sleep.
