text
$ kubectl get pods -n prod
NAME READY STATUS RESTARTS AGE
api-gateway-v2-7f8d9b4c-xhq2z 0/1 CrashLoopBackOff 42 (3m ago) 14h
order-processor-66d5f4e3-99abc 0/1 OOMKilled 12 (1m ago) 14h
payment-service-55c2a1b0-zxy98 1/1 Running 0 14h
marketing-tracker-88f123a4-bbbbb 1/1 Running 0 14h
$ kubectl logs -f api-gateway-v2-7f8d9b4c-xhq2z –previous
{“level”:”fatal”,”ts”:1715432100.123,”caller”:”main.go:45″,”msg”:”failed to connect to redis”,”error”:”dial tcp 10.96.0.15:6379: i/o timeout”}
{“level”:”info”,”ts”:1715432105.456,”msg”:”Attempting reconnection… (Attempt 43)”}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8a2f31]
$ terraform plan
╷
│ Error: Error acquiring the state lock
│
│ Error message: conditional check failed
│ Lock Info:
│ ID: 6e2f1a3b-8c9d-4e5f-a1b2-c3d4e5f6a7b8
│ Path: prod-infrastructure/terraform.tfstate
│ Operation: OperationTypePlan
│ Who: jenkins-worker-04@ip-10-0-45-12
│ Version: 1.5.7
│ Created: 2024-05-11 02:14:05.123456 +0000 UTC
│ Info:
╵
The sun is coming up. Or maybe it’s going down. I can’t tell. The blue light from my triple-monitor setup has burned a permanent rectangular ghost into my retinas. My hands smell like stale coffee and the cheap mechanical keyboard grease that accumulates after 72 hours of frantic typing.
Three days ago, some "Growth Hacker" in a slim-fit suit decided we needed a "Flash Loyalty Reward Event" to coincide with a celebrity tweet. They didn't tell Engineering. They didn't tell SRE. They just pushed the "Go" button on a campaign that hit forty million users simultaneously.
And now, here I am. Staring at a terminal that’s screaming at me because our "elastic" infrastructure decided to snap like a dry twig. If this is what you call **devops best**, I’m moving to a farm. I’m done. I’m writing this because if I don’t, the next person who inherits this cluster will probably jump off the roof.
## The YAML Hell We Built for Ourselves
We’re running Kubernetes v1.29. It’s supposed to be the pinnacle of container orchestration. Instead, it’s a 1.5-million-line YAML suicide note. We’ve abstracted the infrastructure so far away from the hardware that nobody knows where the packets actually go anymore.
When the marketing spike hit, the `api-gateway` started failing its liveness probes. Why? Because the "DevOps Architects" decided that every pod needed fifteen sidecars for "observability," "security," and "service mesh magic." By the time a request actually hits the application code, it’s traveled through three proxies, been encrypted and decrypted four times, and had its headers bloated by 4KB of tracing metadata.
The `kubelet` on worker node `ip-10-0-112-4` decided it had enough. It went `NotReady`. Why? Because the `conntrack` table overflowed. We’re pushing so many tiny, useless UDP packets for "telemetry" that the kernel literally forgot how to talk to the network.
Look at this absolute garbage. This is the "standard" deployment manifest for the service that died first.
```yaml
# This is the manifest that killed the cluster.
# "Optimized" by a consultant who left six months ago.
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-processor
spec:
replicas: 50 # Marketing said we'd need "scale"
template:
spec:
containers:
- name: app
image: internal-repo/order-processor:latest # Because versioning is for cowards
resources:
limits:
cpu: "200m" # Not enough to actually process an order
memory: "256Mi"
requests:
cpu: "100m"
memory: "128Mi"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 5 # Let's hammer the app while it's struggling to boot
- name: telemetry-sidecar
image: telemetry-vendor/agent:v4.2.1
resources:
limits:
cpu: "500m" # The sidecar gets more CPU than the app. Brilliant.
memory: "512Mi"
I spent four hours yesterday just trying to figure out why the order-processor was OOMKilled. It turns out the “telemetry-sidecar” has a memory leak that triggers whenever the network latency exceeds 50ms. And since the network was saturated by the marketing spike, the sidecar ate all the node’s memory, the kubelet panicked, and the OOM Killer started executing random pods like a firing squad.
Table of Contents
Terraform v1.5.7: State Drift is a Lifestyle
We use Terraform v1.5.7 to manage our AWS environment. “Infrastructure as Code,” they said. “It’ll be idempotent,” they said.
Lies. All of it.
When the outage started, I tried to scale the RDS instance. But I couldn’t. Why? Because someone—probably a “Full Stack” developer who thinks they know CloudFormation—manually changed the security groups in the AWS Console at 4:00 AM. Now the Terraform state is drifted so far it’s in a different zip code.
Every time I run terraform plan, it wants to destroy the production database and recreate it because of a “forced replacement” on a tag. A TAG.
Here is the HCL that currently manages our VPC. It’s a masterclass in how to build a cage you can’t escape from.
# The "Flexible" VPC Module
resource "aws_db_instance" "main" {
allocated_storage = 100
engine = "postgres"
engine_version = "15.3"
instance_class = "db.m5.large"
name = "prod_db"
username = var.db_user
password = var.db_pass # Stored in plain text in the tfvars file. Kill me.
# This block is the reason I haven't slept.
# Someone hardcoded the subnet IDs instead of using data lookups.
replicate_source_db = null
vpc_security_group_ids = [
"sg-0a1b2c3d4e5f6g7h8",
"sg-9i0j1k2l3m4n5o6p7" # This SG was deleted manually 3 months ago.
]
lifecycle {
prevent_destroy = false # Why was this set to false in PROD?!
}
}
I had to manually edit the JSON state file using vim while my hands were shaking from too much caffeine. Do you know how terrifying it is to manually delete a dependency line in a 14MB Terraform state file while the CEO is screaming in a Slack “War Room” channel? Everyone talks about devops best practices until the database starts dropping packets and the state lock is held by a Jenkins job that crashed two hours ago.
Why Your CI/CD Pipeline is a Rube Goldberg Machine
Our deployment pipeline is a nightmare of Jenkinsfiles, GitHub Actions, and “custom” Bash scripts that have more “if” statements than a choose-your-own-adventure novel.
We have this “automated” rollback feature. It’s supposed to detect a failure and revert the last commit. During the peak of the outage, the “automated” rollback triggered. But because the api-gateway was in a CrashLoopBackOff, the health check failed, which triggered another rollback.
The system entered a recursive loop of rolling back to versions of the code that didn’t even exist in the container registry anymore. I had to kill the Jenkins master just to make it stop.
Here is the “cleanup” script that runs after every failed deployment. It’s a crime against humanity.
#!/bin/bash
# "Temporary" fix for orphaned volumes. Added: Oct 2022.
# Still here. Still breaking things.
echo "Starting cleanup of orphaned resources..."
# Find all PVCs that are 'Pending' and delete them.
# What could go wrong?
kubectl get pvc --all-namespaces | grep Pending | awk '{print $2}' | xargs kubectl delete pvc
# Force delete pods that are stuck in Terminating
# Because we don't understand how finalizers work.
for pod in $(kubectl get pods --all-namespaces | grep Terminating | awk '{print $2}'); do
kubectl delete pod $pod --grace-period=0 --force
done
# Check if the database is still alive.
# If not, just restart the whole node group.
# (Note: This actually happened during the outage).
if ! curl -s --connect-timeout 2 http://db-internal:5432; then
echo "DB unreachable. Nuking the worker nodes."
aws autoscaling terminate-instance-in-auto-scaling-group --instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id)
fi
Look at that last part. If the database doesn’t respond to a curl on its port (which it won’t, because it’s Postgres, not an HTTP server), the script terminates the instance it’s running on. This script was written by a “Senior DevOps Engineer” who now works at a crypto startup. I spent two hours wondering why my SSH session kept dropping. It was because the system was literally committing suicide every five minutes.
Prometheus v2.45 and the Lie of “Observability”
We have dashboards. Oh, we have so many dashboards. We have Grafana boards that look like the flight deck of the Space Shuttle. We’re running Prometheus v2.45 with a sidecar for long-term storage.
But here’s the thing: when the system actually fails, the dashboards are the first thing to go.
Prometheus couldn’t scrape the targets because the service discovery was failing. The service discovery was failing because the Kubernetes API server was overloaded. The API server was overloaded because every single pod was trying to report its own death simultaneously.
I was flying blind. I had to use tcpdump on a worker node like it was 1998.
# Trying to find where the 504s are coming from
# while the world burns around me.
sudo tcpdump -i eth0 -A 'tcp port 80 and (dst host 10.0.0.1 or src host 10.0.0.1)' | grep "HTTP/1.1 504"
The “Observability” stack we spent $200k on last year told me everything was “Green” for the first twenty minutes of the outage because the metrics were cached. By the time the alerts fired, the database was already a smoking crater.
We followed the devops best guide to the letter, and yet, here we are, staring at a 503 and a bunch of empty graphs. “Observability” is just a fancy word for “looking at the wreckage after the plane has already crashed.” It doesn’t prevent the crash. It just gives you a high-resolution video of the impact.
The Marketing-Driven Death Spiral
The root cause wasn’t technical. It never is. The root cause was a Jira ticket titled “LTY-999: Implement Flash Rewards.” It was marked as “Low Effort” by a Product Manager who hasn’t seen a line of code since the Obama administration.
They wanted a “real-time” leaderboard. To do this, the developers decided to bypass the cache and query the primary database directly every time a user refreshed their profile page.
“It’s fine,” they said. “We’ll just use a read replica.”
But they didn’t use a read replica. They used the primary. And they didn’t use an index. They did a full table scan on a table with 500 million rows.
When the celebrity tweeted the link, 100,000 people clicked it. That’s 100,000 full table scans per second. The RDS instance’s CPU went from 5% to 100% in three seconds. The IOPS hit the burst limit, and the EBS volume throttled.
And because our “microservices” are all tightly coupled via synchronous REST calls (another “architectural” decision), the entire stack backed up. The order-processor waited for the user-service, which waited for the leaderboard-service, which was stuck waiting for the database.
Every single thread in the entire cluster was blocked. And that’s when the liveness probes started failing. Kubernetes, in its infinite “wisdom,” decided the pods were dead and started killing them. But the new pods couldn’t start because the database was still locked.
It was a circular firing squad of “cloud-native” technology.
The “DevOps” Best Practices Myth
I’m tired of hearing about “DevOps.” In this company, “DevOps” just means “Operations people doing twice the work for the same pay while Developers get to play with new JS frameworks.”
We’re told to “Shift Left.” That just means I have to teach a 22-year-old how to write a Dockerfile so they don’t accidentally include their entire Downloads folder in the image. We’re told to “Automate Everything.” That just means I have to maintain 5,000 lines of Python scripts that glue together tools that were never meant to talk to each other.
Here’s a script I wrote at 4:30 AM this morning to try and recover the corrupted Redis cache. It’s ugly. It’s dangerous. It’s the opposite of “clean code.”
import redis
import time
# This is a hack. If you are reading this, I am sorry.
# The marketing-tracker service flooded Redis with 10GB of
# session data that never expires. This script tries to
# delete keys matching the pattern 'sess:*' without
# killing the entire cluster.
r = redis.Redis(host='redis-prod', port=6379, db=0)
def emergency_cleanup():
cursor = '0'
while cursor != 0:
# Scan instead of KEYS to avoid blocking the event loop
# Although at this point, the loop is already dead.
cursor, keys = r.scan(cursor=cursor, match='sess:*', count=1000)
if keys:
r.delete(*keys)
print(f"Deleted {len(keys)} keys...")
# Sleep to give Redis a chance to breathe.
# As if a piece of software can breathe.
time.sleep(0.01)
if __name__ == "__main__":
print("Starting desperate recovery attempt...")
try:
emergency_cleanup()
except Exception as e:
print(f"Even the recovery script failed: {e}")
Your devops best strategy is just a fancy way of saying “make the SREs fix it.” We build these incredibly complex systems so we can feel smart, but we forget that someone has to stay awake for three days when the “smart” system decides to eat itself.
The Aftermath
The outage is “over.” I’ve patched the security groups. I’ve manually scaled the RDS instance to a db.r5.4xlarge which is costing us $15 an hour. I’ve killed the recursive Jenkins jobs.
The Marketing team is celebrating. They got “record engagement.” They’re probably having champagne right now. They have no idea that the only reason the site is back up is because I spent six hours manually editing a Terraform state file and writing a Python script to delete millions of Redis keys.
They’ll do it again next week. They’ll find some new “engagement” metric to chase, and they’ll push another “Low Effort” ticket that bypasses the staging environment because “it’s just a minor change.”
And I’ll be here. Drinking cold coffee. Staring at kubectl get pods -w. Waiting for the pager to scream.
If you’re reading this leaked wiki page, do yourself a favor. If your company starts talking about “digital transformation” or “scaling their DevOps culture,” run. Go find a job where you manage a single Linux server in a closet. At least then, when it breaks, you know why.
I’m going home. I’m turning off my phone. If the cluster dies again, let it. I’ve given 72 hours of my life to a “Flash Loyalty Reward Event.” I have nothing left to give.
The YAML is still there. The state drift is still there. The “temporary” bash scripts are still there.
Everything is fine. Until it isn’t.
Post-Mortem Action Items (that will never be done):
1. Fix the Terraform state hardcoding (Assigned to: Nobody).
2. Implement actual rate limiting on the API (Status: “In Backlog” for 2 years).
3. Stop letting Marketing push to Prod (Status: Rejected by VP of Growth).
4. Buy the SRE team a bottle of Scotch (Status: Not in budget).
Now get out of my office. I need to sleep for a decade.
Related Articles
Explore more insights and best practices: