10 Kubernetes Best Practices for Production Success

INCIDENT REPORT: POST-MORTEM #882-B (THE “FRIDAY AFTERNOON ARCHITECT SPECIAL”)
TIMESTAMP: 2024-05-17T03:04:12Z
INITIAL ALERT: CRITICAL - PagerDuty - Service: checkout-api - Severity: 1 - Status: FAILED
FIRST LOG ENTRY:
kubelet[1024]: E0517 03:04:12.442103 1024 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox: failed to allocate for range 0: no IP addresses available in range set: 10.244.2.0-10.244.2.255"


1. The 3 AM Meltdown: Why Everything Broke at Once

I was finally in a REM cycle when the pager started screaming. Not the “disk is at 80%” chirp. The “your entire regional cluster is a smoking crater” siren. I opened my laptop, eyes burning like I’d stared into a solar flare, and saw the Slack channel already filled with “Architects” asking if we’d “tried restarting the pods.”

Restarting the pods. Brilliant. Why didn’t I think of that? Oh, wait, I did, and the kube-scheduler is currently having a nervous breakdown because someone decided to push a “minor optimization” to our Kubernetes 1.29 control plane at 4:45 PM on a Friday.

The failure wasn’t a single point. It was a cascading failure of epic proportions. It started with the checkout-api deployment. Our “Architectural Lead”—who I’m convinced hasn’t touched a CLI since the Obama administration—decided that we needed to “maximize density.” They stripped out all the resource requests and limits because “Kubernetes is smart enough to figure it out.”

Narrator: It was not smart enough to figure it out.

By 03:00, the nodes were so oversubscribed that the Linux kernel started executing processes like it was the French Revolution. The OOM Killer was the only thing working at full capacity. Because there were no resource boundaries, a single memory leak in a Java sidecar (don’t ask why there’s a Java sidecar in a Go service) ballooned until it choked the kubelet. When the kubelet died, the node went NotReady. When the node went NotReady, the scheduler tried to move 400 pods to the remaining three nodes.

You can guess what happened next. It was a digital suicide pact.

$ kubectl get nodes
NAME             STATUS     ROLES           AGE   VERSION
ip-10-0-1-12     NotReady   worker          45d   v1.29.2
ip-10-0-1-13     NotReady   worker          45d   v1.29.2
ip-10-0-1-14     NotReady   worker          45d   v1.29.2
ip-10-0-1-15     Ready      control-plane   45d   v1.29.2

$ kubectl describe node ip-10-0-1-12 | grep -A 5 Events
Events:
  Type     Reason                   Age                From     Message
  ----     ------                   ----               ----     -------
  Warning  EvictionThresholdMet     14m                kubelet  Attempting to reclaim memory
  Normal   NodeHasInsufficientMemory 14m (x20 over 2h)  kubelet  Node ip-10-0-1-12 status is now: NodeHasInsufficientMemory

The “Architects” wanted density. They got it. They got 100% density of failure.


2. Resource Limits: The Lie We Tell Ourselves

Let’s talk about the YAML I found in the checkout-api manifest. It’s a work of art if your medium is “pure incompetence.” In Kubernetes 1.29, we have sophisticated cgroup v2 support, better memory pressure handling, and what do we do? We leave the resources block empty.

# The "Architect" special
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
spec:
  template:
    spec:
      containers:
      - name: app
        image: checkout:latest # Because who needs versioning?
        resources: {} # "Let the cluster decide," they said.

When you don’t define requests, the scheduler treats the pod as BestEffort. In the hierarchy of “Who gets killed first when the node is sweating,” BestEffort is the first against the wall. But it’s worse than that. Without limits, the container will try to consume every byte of RAM on the host.

I spent four hours watching dmesg logs on a dying node. The CPU throttling was so aggressive due to CFS (Completely Fair Scheduler) quotas being mismanaged by the lack of limits that the application’s response time went from 20ms to 15,000ms.

# Checking the carnage on the node
$ journalctl -u kubelet --since "1 hour ago" | grep "OOM"
May 17 03:15:22 node-1 kubelet[1024]: Task 'app' (pid 12345) killed due to OOMKill.
May 17 03:15:24 node-1 kubelet[1024]: Task 'sidecar' (pid 12360) killed due to OOMKill.

If you want to survive a production load, you use the kubernetes best approach: Guaranteed QoS. You set requests equal to limits. This tells the scheduler, “Do not put this pod on a node unless you can actually give me this memory.” It prevents the “noisy neighbor” syndrome where a dev-test pod starts eating the production database’s memory. But no, that would be “too expensive.” You know what’s expensive? Being down for 48 hours while I manually prune dead containers from the containerd shim.


3. Probes are Not Optional: Liveness vs. Readiness vs. Reality

While the nodes were burning, the Load Balancer was still sending traffic to pods that were in a CrashLoopBackOff. Why? Because the “Architects” thought livenessProbes and readinessProbes were “boilerplate fluff.”

Here is what I found in the production manifest:

# Actual snippet from the wreckage
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 1

They set initialDelaySeconds to 0 for a Java application that takes 90 seconds to warm up its JVM and connect to the connection pool. The result? The kubelet started hitting the /health endpoint before the app was even finished loading its classes. The app didn’t respond, so the kubelet marked it as unready. Then, because they used the same logic for a livenessProbe, the kubelet killed the container and restarted it.

It was an infinite loop of death. The app would start, get poked by the kubelet, fail to respond instantly, get killed, and start over.

I had to explain—for the tenth time this year—that a readinessProbe tells the Service “don’t send me traffic yet,” while a livenessProbe tells the Kubelet “kill me, I’m stuck.” If you point them both at the same endpoint with a 1-second timeout, you are essentially DDoSing your own startup routine.

In Kubernetes 1.29, we have startupProbes. Use them. They are designed for this exact scenario. They give the app time to breathe before the liveness probe starts swinging its axe. But that would require reading the documentation, which apparently isn’t as fun as writing “Thought Leadership” posts on LinkedIn.


4. Security Contexts: Why Are You Still Running as Root?

As I was digging through the kubectl describe pod output to figure out why the CNI was failing, I noticed something even more horrifying. Every single pod was running with privileged: true or, at the very least, as the root user.

$ kubectl get pod checkout-api-6f789-abcde -o jsonpath='{.spec.containers[0].securityContext}'
{"privileged":true,"runAsUser":0}

Why? “Because the app needs to write to a log file in /var/log.”

I want to scream. It’s 2024. We have emptyDir volumes. We have fluentd for log aggregation. There is zero reason—NONE—for a checkout service to have root access to the host’s kernel. Because these pods were privileged, when the memory leak happened, they weren’t just contained within their own cgroup; they were able to interfere with host-level processes.

One pod managed to trigger a kernel panic because it exhausted the host’s file descriptors. If we had used a proper securityContext, the container would have been capped.

# What a sane person would write
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

But no, “security slows down development.” You know what else slows down development? Having the entire engineering team locked in a Zoom bridge for two days because a compromised or buggy container wiped the node’s root partition.


5. The Networking Rabbit Hole: CNI Failures and Latency

By hour 20, we got the pods to stop crashing, but the latency was astronomical. We’re talking 5 seconds for a simple DNS lookup. I checked CoreDNS.

$ kubectl logs -n kube-system -l k8s-app=kube-dns
[ERROR] plugin/errors: 2 checkout-api.production.svc.cluster.local. A: read udp 10.244.0.15:53->10.244.0.2:45678: i/o timeout

The “Architects” had configured the application to use ndots: 5 in the dnsConfig. For those who don’t live in the trenches, this means for every single internal request, the client tries to resolve the name through five different search domains before it gives up and tries the absolute name.

checkout-api -> checkout-api.production.svc.cluster.local -> checkout-api.svc.cluster.local -> checkout-api.cluster.local

Every single database query was generating five DNS queries. CoreDNS was being hammered so hard that the conntrack tables on the nodes were overflowing.

$ dmesg | grep "conntrack full"
[72431.123456] nf_conntrack: table full, dropping packet

I had to manually go in and patch the coredns ConfigMap to implement a stale cache and increase the memory limit. I also had to explain that we should be using ExternalName services or at least pointing to the FQDN with a trailing dot to bypass the ndots search hell.

And then there was the CNI. We’re using a standard overlay network, but because someone decided to set the MTU (Maximum Transmission Unit) incorrectly on the VPC, we were getting packet fragmentation. The pods could talk to each other if the payload was small, but as soon as a JSON response hit 1500 bytes, the packets were dropped silently. No error, just a timeout.

I spent six hours with tcpdump and wireshark inside a debug container just to prove that it wasn’t “the network being slow” but rather “the network being misconfigured by people who think YAML is just a suggestion.”


6. The “Kubernetes Best” Practices We Ignored (And Why We’re Fired)

Finally, at hour 40, we started looking at the cluster’s overall health. The reason the failure was so “catastrophic” and not just “annoying” was that we ignored every single kubernetes best practice regarding high availability.

First: Pod Topology Spread Constraints.
The “Architects” didn’t define these. As a result, the scheduler—in its infinite, unguided wisdom—placed 90% of our critical pods on the same physical rack in the same availability zone. When that zone had a minor power blip, the entire “redundant” cluster went dark.

# What we should have had
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: checkout-api

Second: Horizontal Pod Autoscaler (HPA) Misconfiguration.
The HPA was configured to scale based on CPU at 50%. Sounds safe, right? Wrong. The application is IO-bound, not CPU-bound. During the outage, the CPU stayed low because the app was waiting on DNS timeouts. So the HPA started scaling down the number of pods because it thought the system was idle, which increased the load on the remaining pods, which caused them to OOM, which caused the HPA to scale down further. It was a “death spiral” orchestrated by a YAML file.

Third: Taints and Tolerations.
We have “special” nodes with NVMe drives for our databases. Someone forgot to add taints to them. So, a bunch of “Hello World” cronjobs from the marketing team’s experimental namespace got scheduled onto the high-performance database nodes, eating all the IOPS and starving the production DB.

I’m currently sitting in the office, the sun is coming up for the second time, and I’m looking at a “Strategy Document” from the architects about “Moving to a Multi-Cluster Service Mesh.”

I haven’t slept. I smell like stale coffee and failure. I have 400 tabs of kubectl logs open. And these people want to add more complexity? We can’t even get a readinessProbe right, and they want to implement Istio?

Here is the reality: Kubernetes 1.29 is a rock-solid platform. It’s the most stable, feature-rich version we’ve ever had. It has the tools to prevent every single thing that happened this weekend. But Kubernetes is a mirror. If your engineering culture is a mess, your cluster will be a mess. If your architects think they are too good to understand how a cgroup works, your production environment will stay down.

I’m going home. If PagerDuty goes off again because someone changed the imagePullPolicy to Always on a 5GB image, I’m throwing my phone into the river.

Resolution:
1. Re-implemented requests and limits across all namespaces.
2. Added startupProbes to all JVM-based services.
3. Fixed the ndots issue in the global dnsConfig.
4. Applied topologySpreadConstraints to ensure multi-AZ resilience.
5. Deleted the “Architect’s” write access to the production repository.

Status: Cluster stable. SRE unstable.


EOF – End of Report

Related Articles

Explore more insights and best practices:

Leave a Comment