Deploying Python Microservices on Kubernetes: Lessons Learned

After two years of running Python microservices on Kubernetes across AWS EKS and GCP GKE, I have accumulated a list of mistakes I wish someone had warned me about. These are not theoretical concerns from documentation — they are patterns that caused real outages, wasted compute, and late-night debugging sessions.

Get Your Health Checks Right

The single most common failure mode I have seen is misconfigured liveness and readiness probes. A Python service using FastAPI or Flask might take several seconds to load ML models or establish database connection pools at startup. If your liveness probe fires before initialization completes, Kubernetes kills the pod, it restarts, gets killed again, and you enter a CrashLoopBackOff spiral.

The fix is straightforward: separate your readiness probe from your liveness probe. The readiness probe should gate traffic until your application is fully initialized. The liveness probe should only fail when the process is genuinely stuck. I typically set initialDelaySeconds to at least 30 seconds for services that load models, and I use a dedicated /healthz endpoint that checks downstream dependencies like database connections and API credentials, distinct from a simple /ready endpoint that just confirms the server is accepting requests.

Resource Requests and Limits Are Not Optional

Python’s memory behavior makes resource configuration tricky. CPython does not release memory back to the OS eagerly due to its internal memory allocator. A service that spikes to 512MB during a burst will often hold that memory indefinitely, even after the objects are garbage collected.

Set your resource requests based on steady-state usage and your limits based on observed peak plus a 20% buffer. Use memory_profiler or Prometheus metrics with process_resident_memory_bytes to understand actual consumption before you set values. I have seen teams set limits too aggressively and get OOMKilled under normal load because they tested locally where memory pressure behaves differently.

For CPU, remember that Python’s GIL means a single-threaded workload will never use more than one core effectively. Setting a CPU limit of 4 on a synchronous Flask app is wasting quota. Match your limits to your concurrency model — if you are running Gunicorn with 4 workers, a CPU request of 1 and limit of 2 is usually reasonable.

Graceful Shutdown Matters More Than You Think

When Kubernetes terminates a pod, it sends SIGTERM and then waits for terminationGracePeriodSeconds (default 30 seconds) before sending SIGKILL. Many Python frameworks do not handle SIGTERM gracefully out of the box. If your service is mid-request or holding a database transaction, you get dropped connections and data inconsistency.

Register a signal handler that sets a shutdown flag, stops accepting new work, and drains in-flight requests. For async services using uvicorn, pass the --timeout-graceful-shutdown flag. For Celery workers, use SIGTERM with --without-heartbeat to allow tasks to complete. Also configure a preStop hook in your pod spec with a short sleep — this gives the Kubernetes endpoints controller time to remove the pod from the Service before traffic stops arriving.

Container Image Discipline

Use multi-stage Docker builds. Your build stage can include gcc, build-essential, and development headers, but your runtime image should be as lean as possible. I use python:3.11-slim as the base for production and install only the wheels built in the prior stage. This typically cuts image size from 1.2GB to under 300MB, which directly impacts pod startup time when nodes need to pull the image.

Pin every dependency. Do not use pip install flask in a Dockerfile — use a requirements.txt with exact versions generated from pip-compile. A floating dependency that works today and breaks tomorrow will cause a rollout failure at the worst possible time.

Observability From Day One

Instrument your services before you need to debug them. At minimum, expose Prometheus metrics using prometheus-client, ship structured JSON logs with correlation IDs so you can trace requests across services, and add OpenTelemetry tracing if you are running more than three services. The cost of adding observability after an outage is always higher than building it in upfront.