10 Essential DevOps Best Practices for Faster Delivery

POST-MORTEM: THE BLACK SUNDAY CATASTROPHE
TIMESTAMP: 2024-05-19 04:12:09 UTC
INCIDENT DURATION: 72 Hours, 14 Minutes
STATUS: Partially Restored / Total Mental Collapse
OPERATOR: Senior SRE (Employee #402, currently fueled by spite and expired Ibuprofen)


1. Summary of the Carnage

At 04:00 UTC on Sunday, while the rest of the world was sleeping or having a normal life, our entire production infrastructure decided to commit ritual suicide. This wasn’t a “glitch.” This wasn’t a “minor degradation.” This was a total, scorched-earth erasure of our primary and secondary regions.

The root cause? A combination of hubris, “ClickOps” by a junior dev who thought they knew better, and a Terraform state file that had become so corrupted it looked like a Rorschach test of pure incompetence. We lost the load balancers, the ingress controllers, and three-quarters of our persistent volumes in under twelve minutes. By the time the first PagerDuty alert hit my phone, the “devops best” practices we’ve been preaching for three years had been set on fire by the very people who wrote the wiki.

We spent the next three days in a Zoom bridge that felt less like a technical troubleshooting session and more like a hostage negotiation where the hostage was our sanity. If I hear the phrase “let’s circle back” one more time, I am going to throw my mechanical keyboard into the server rack.


2. The Terraform State of Emergency: v1.7.4 and the Death of Logic

It started with Terraform v1.7.4. We were supposed to be moving to a modular structure. Instead, someone—and I have the git blame logs to prove who—decided to manually edit the remote state file in S3 because the terraform plan was “taking too long.”

They bypassed the locking mechanism. They ignored the warnings. They forced a state push that contained a null pointer for our entire VPC CIDR block. When the CI/CD pipeline triggered on a “hotfix” for a CSS change (don’t ask me why a CSS change triggers an infra run, that’s another circle of hell), Terraform looked at the state, looked at the reality, and decided the reality shouldn’t exist.

$ terraform plan -out=tfplan
Error: Failed to marshal state to json: 
json: error calling MarshalJSON for type *states.SyncState: 
invalid memory address or nil pointer dereference

$ terraform apply "tfplan"
Error: State lock error: 
ID: 4e5f-8821-9901-bc42
Info: Operation "Plan" already has a lock for customer-prod-vpc.
The state file is corrupted beyond local repair. 
Manual intervention required. 

The “manual intervention” turned into a bloodbath. Because we had no state, Terraform assumed every resource—every RDS instance, every S3 bucket, every EKS node group—was orphaned and needed to be purged. By the time we killed the runner, the database was gone. Not stopped. Gone. Deleted from the AWS console like it never existed.


3. Kubernetes v1.29.1: The OOMKiller’s Banquet

With the underlying network shredded, the Kubernetes clusters (v1.29.1) began to thrash. Without a VPC to call home, the nodes went NotReady. But the control plane—bless its heart—tried to keep going. It started rescheduling pods onto non-existent nodes, creating a recursive loop of scheduling failures that spiked the API server’s CPU to 400%.

We tried to recover using Helm v3.14.0 to redeploy the core services, but since the ConfigMaps were tied to the now-deleted RDS endpoints, every single pod entered a CrashLoopBackOff.

$ kubectl describe pod api-gateway-7f8d9b6c5-xf2k1 -n production
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  2m                 default-scheduler  Successfully assigned api-gateway to ip-10-0-42-12
  Warning  Failed     90s                kubelet            Error: secret "db-credentials" not found
  Warning  BackOff    30s (x5 over 2m)   kubelet            Back-off restarting failed container
  StandardError: 
  [FATAL] 2024-05-19T04:22:11Z: Could not connect to database at 10.0.128.4:5432. 
  Connection refused. Kernel OOMKiller has terminated process 1422 (java).

The OOMKiller wasn’t just killing pods; it was hunting them. Because the junior dev had stripped out the resource limits in the values.yaml to “improve performance” (I’m still shaking as I write this), a single failing Java service managed to consume 64GB of RAM on a worker node before the kernel finally put it out of its misery. This took down the logging agent, which meant for four hours, we were flying blind in a hurricane.


4. Jenkins v2.440.1: The CI/CD Pipeline to Nowhere

Our Jenkins instance (v2.440.1) is a Frankenstein’s monster of Groovy scripts and prayers. During the outage, the Jenkins runners—which were running as pods in the same cluster that was dying—began to fail-over to a “backup” node that hadn’t been patched since the Obama administration.

The Jenkinsfile tried to execute a kubectl apply, but the credentials had expired because the Vault sidecar couldn’t reach the Vault server because the Vault server was sitting on a volume that had been unceremoniously detached by the Terraform collapse.

[Pipeline] { (Deploy to Prod)
[sh] + kubectl apply -f k8s/deployment.yaml
error: You must be logged in to the server (Unauthorized)
[Pipeline] }
[ERROR] java.io.IOException: Failed to execute shell script. 
Exit code: 1. 
Runner 'jenkins-agent-001' has disconnected unexpectedly.
Check stderr for details: 
  /var/lib/jenkins/workspace/prod-deploy@tmp/durable-7721/script.sh: line 1: 
  142 Segmentation fault (core dumped) kubectl apply -f k8s/deployment.yaml

This is where the “devops best” practices really shine. We have “automated” everything to the point where no one knows how to do anything manually. When the automation breaks, the team stands around like NPCs with broken pathfinding. I had to manually SSH into a bastion host—which I had to recreate from an AMI because the original was deleted—and hand-roll kubeconfig files just to see the extent of the damage.


5. The ClickOps Sin: Why We Can’t Have Nice Things

In the middle of the crisis, someone decided that the fastest way to get the site back up was to go into the AWS Console and start clicking buttons. They manually recreated the Load Balancer. They manually added listeners. They manually attached security groups.

This is ClickOps. This is the eighth deadly sin.

By doing this, they created a “drift” so massive that when we finally got Terraform back online, the terraform plan wanted to destroy everything they had just built because it wasn’t in the code. We spent six hours just trying to import manually created resources back into the state file.

If you are an engineer and you find yourself clicking “Create” in a web console during an outage without a senior SRE breathing down your neck, you aren’t helping. You are adding more fuel to the fire. You are creating a shadow infrastructure that will haunt us for months. We found three different “Test-LB-Don’t-Delete” instances that were costing us $400 an hour and doing absolutely nothing but routing traffic to a 404 page.


6. The “Three Ways” as a Tourniquet

People love to talk about the “Three Ways” of DevOps in their cozy LinkedIn posts. Let me tell you what they look like when the servers are melting.

The First Way: Flow.
Flow isn’t about how fast you can push code. It’s about how fast you can recover. Our “flow” was blocked by a 40-minute CI/CD pipeline that required five manual approvals. During a catastrophic outage, that’s not a process; it’s a death sentence. We had to bypass the entire pipeline and run scripts from our local machines like it was 2005. The “flow” was a trickle of desperate commands sent over a laggy VPN.

The Second Way: Feedback.
Feedback is only useful if it’s actionable. My feedback was 4,000 Slack notifications from a bot that just said “Service Down.” No context. No trace IDs. No logs. Just noise. We had to silence the alerts just to hear ourselves think. Real feedback would have been a dashboard that showed the exact moment the Terraform state was corrupted, but we had “optimized” our monitoring to save on Datadog costs.

The Third Way: Continuous Learning.
The only thing we learned is that we are one “minor update” away from total extinction. Continuous learning in this context means writing this post-mortem so that the next time someone suggests editing a state file manually, I can point to this document and legally be allowed to slap them.


7. The Manifesto: Reality vs. “DevOps Best”

We need to stop lying to ourselves. We talk about “devops best” practices as if they are a holy scripture that will protect us. They won’t. A “best practice” is just a suggestion until the database is deleted.

True SRE work isn’t about writing YAML. It’s about understanding the underlying systems. It’s about knowing that when you run helm install, there are a thousand things that can go wrong between your terminal and the Kubelet. It’s about knowing that “serverless” just means “someone else’s servers that you can’t SSH into when they break.”

We have built a “tapestry”—wait, I can’t use that word—we have built a complex, fragile web of dependencies that no single person fully understands. We use tools like Kubernetes v1.29.1 to solve scaling problems we don’t even have, and in doing so, we create complexity problems that we can’t solve.

The “devops best” way to handle this isn’t more tools. It’s less. It’s simpler configurations. It’s harder resource limits. It’s a refusal to accept “it works on my machine” as a valid excuse. If it doesn’t work in a clean, reproducible environment with zero manual intervention, it doesn’t work. Period.


8. The Burn List: Never Again

I am tired. I have been awake for 72 hours. My eyes feel like they’ve been rubbed with sandpaper and my soul is currently 404 Not Found. Before I go to sleep for the next two days, here is the Burn List. If I see any of these in a PR, I will reject it without comment.

  1. Manual State Edits: If you touch a .tfstate file with a text editor, you are dead to me. Use terraform import or terraform state rm. If you don’t know how, learn.
  2. Missing Resource Limits: Every container gets a resources.limits.cpu and resources.limits.memory. No exceptions. I don’t care if it’s a “small utility.” Small utilities are the ones that leak memory and kill the node.
  3. ClickOps: The AWS Console is for viewing, not for doing. If it isn’t in code, it doesn’t exist. If you create a resource manually, I will delete it. I don’t care if it’s “fixing” production.
  4. Vague Alerts: “Service is down” is not an alert. “Service X in Namespace Y is failing 5xx probes due to Connection Timeout to RDS” is an alert. Fix your Prometheus queries.
  5. Jenkins Plugins: We are freezing the Jenkins plugin list. No more “cool new UI” plugins. Every plugin is a security vulnerability and a potential point of failure.
  6. Hardcoded Secrets: If I find a plain-text password in a repo, I am revoking your git access. We have Vault. Use it.
  7. Ignoring the “Three Ways”: If you don’t understand Flow, Feedback, and Learning, you aren’t a DevOps engineer; you’re just a sysadmin with a fancy title and a worse work-life balance.

The Black Sunday Outage was avoidable. It was the result of a thousand small shortcuts that eventually added up to a cliff. We fell off that cliff. We are currently climbing back up, bruised and bleeding.

Next time, we follow the code. Next time, we respect the state. Next time, I’m turning my phone off.

End of Report.
Go away.

Related Articles

Explore more insights and best practices:

Leave a Comment