What is DevOps? Definition, Benefits, and Best Practices

TIMESTAMP: 03:14 AM. The pagers started screaming, and that’s when I realized nobody in this building actually knows what is devops.

2024-05-22 03:14:02.112 UTC [14209] LOG:  process 14209 still waiting for AccessExclusiveLock on relation 16402 of database 16384 after 1000.045 ms
2024-05-22 03:14:05.881 UTC [14209] ERROR:  deadlock detected
2024-05-22 03:14:05.881 UTC [14209] DETAIL:  Process 14209 waits for AccessExclusiveLock on relation 16402 of database 16384; blocked by process 14311.
Process 14311 waits for ShareLock on relation 16408 of database 16384; blocked by process 14209.
2024-05-22 03:14:10.001 UTC [SRE_HELL] ALERT: Database CPU Utilization @ 100.00%

My eyes are vibrating. I’ve been awake for forty-eight hours, fueled by lukewarm espresso and the bitter realization that our “cloud-native” architecture is actually just three raccoons in a trench coat holding a soldering iron. The Postgres cluster didn’t just fail; it committed ritual suicide because some “Full Stack Ninja” decided that adding a global lock to the users table during peak traffic was the best way to run a migration.

Welcome to the autopsy. Grab a coffee. It’s going to be a long night.

The 3 AM Incident Report: Death by a Thousand YAMLs

The first thing I saw when I logged into the cluster was a wall of red. Kubernetes v1.29.1 is a marvel of engineering, but it doesn’t mean a damn thing when your liveness probes are configured by someone who thinks “high availability” is a setting you toggle in a UI.

I ran kubectl get pods -n prod and watched the terminal flicker like a dying neon sign.

NAME                                     READY   STATUS             RESTARTS         AGE
api-gateway-7f8d9b6c5-2wzxl              0/1     CrashLoopBackOff   42 (3m ago)      14h
order-processor-5d4f3e2a1-m9qrs          0/1     OOMKilled          12 (1m ago)      14h
payment-service-8b7c6d5e4-p0onk          1/1     Running            0                14h
user-profile-v2-6c5b4a3d2-x7y6z          0/1     Pending            0                2m

The order-processor was hitting its memory limit because of a memory leak in a Node.js library that hasn’t been updated since the Obama administration. But the real kicker was the api-gateway. I ran a describe to see why it was looping.

$ kubectl describe pod api-gateway-7f8d9b6c5-2wzxl -n prod

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  15m                default-scheduler  Successfully assigned prod/api-gateway-7f8d9b6c5-2wzxl to node-04
  Normal   Pulled     14m (x5 over 15m)  kubelet            Container image "internal-reg:5000/api-gw:latest" already present on machine
  Normal   Created    14m (x5 over 15m)  kubelet            Created container api-gateway
  Normal   Started    14m (x5 over 15m)  kubelet            Started container api-gateway
  Warning  Unhealthy  13m (x20 over 15m) kubelet            Liveness probe failed: Get "http://10.244.1.45:8080/healthz": dial tcp 10.244.1.45:8080: connect: connection refused

The “latest” tag. I want to find whoever pushed to latest and sentence them to a lifetime of writing COBOL on a tablet. We have no traceability. We have no rollback path. We have a “DevOps pipeline” that is actually just a Jenkins job running a shell script that someone wrote in 2019 and then quit.

I checked the journalctl -xe on the node. The kernel was screaming.

May 22 03:16:10 node-04 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=kubepods-burstable-pod7f8d9b6c5.slice,mems_allowed=0,oom_memcg=/kubepods-burstable-pod7f8d9b6c5.slice,task_memcg=/kubepods-burstable-pod7f8d9b6c5.slice/api-gateway,task=node,pid=28441,uid=1000
May 22 03:16:10 node-04 kernel: Memory cgroup out of memory: Killed process 28441 (node) total-vm:2456780kB, anon-rss:1890432kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4820kB oom_score_adj:985

This is the reality. It’s not a “vibrant ecosystem” of tools. It’s a dark room where you’re trying to figure out why a Cgroup is murdering your process because a developer thought that max_old_space_size was just a suggestion.

The Wall of Confusion: Throwing Fire Over the Fence

We were told that DevOps would break down the silos. They lied. All we did was move the silos into a Slack channel where developers ignore my @channel pings and I ignore their “why is the staging environment slow” tickets.

The “Wall of Confusion” hasn’t been torn down; it’s just been reinforced with automated tooling that nobody understands. The developers still “throw code over the fence.” The only difference is that now the fence is a CI/CD pipeline that fails 40% of the time for reasons related to an expired OpenSSL 3.0.13 certificate that I’ve told the security team about three times this month.

I remember the meeting six months ago. The CTO sat there, sipping a $9 oat milk latte, and asked me with a straight face, “what is devops anyway? Is it just Jira but faster?”

I didn’t have the heart to tell him that for this company, DevOps is the practice of automating our own demise. It’s the process of taking a manual, error-prone deployment and turning it into a high-speed, automated disaster. He wanted a definition he could put on a slide for the board. I wanted him to understand that if we didn’t stop treating the operations team like a janitorial service, the whole thing would burn.

He didn’t listen. He bought a license for a “DevOps Platform” that costs more than my house and provides “single-pane-of-glass visibility” into a dumpster fire.

The developers think DevOps means they have sudo access in production. The managers think DevOps means we don’t need a QA team anymore. I think DevOps is the psychological trauma of knowing exactly how fragile the world is because I’m the one holding the duct tape.

The Jenkins Nightmare: A Pipeline Built on Hope and Duct Tape

Let’s talk about the pipeline. Our Jenkins instance (v2.440.2) is a museum of bad decisions. It has 142 plugins, 30 of which have critical security vulnerabilities, and 10 of which are incompatible with each other but somehow still function through sheer spite.

The failure that triggered this 48-hour nightmare started in a Groovy script. We use Terraform v1.7.4 to manage our infrastructure, but because we “don’t have time” to do it right, the state file is stored in an S3 bucket with no locking mechanism.

A developer triggered a build. The pipeline, in its infinite wisdom, decided that since the previous build had timed out, it should “clean up” the environment.

// The line that killed the world
sh "terraform destroy -auto-approve" 

Except the logic for the conditional check was flawed. It didn’t check if it was in the dev workspace. It just saw a non-zero exit code from a previous npm install and decided the entire production VPC needed to cease existing.

I spent four hours manually reconstructing the state file from S3 versioning because the “automated backups” were actually just a cron job that had been failing since March because the disk was full.

error: failed to load state: State restore failed: 
  - 2 errors occurred:
    * state file is corrupted: invalid character ' ' looking for beginning of value
    * check your S3 permissions

This isn’t “automation.” This is a Rube Goldberg machine where the final step is a hammer hitting a glass bottle of nitroglycerin. We call it “Continuous Integration,” but there is nothing continuous about it except the anxiety. We “do DevOps” by checking boxes on a maturity model while our actual infrastructure is a series of Bash scripts held together by hope and the fact that the primary database hasn’t had a hardware failure yet.

Observability vs. Monitoring: The Lie of the Dashboard

Our “Monitoring” setup is a Grafana dashboard that is 90% green, even when the site is down. Why? Because we monitor the wrong things. We monitor “CPU usage” and “Disk I/O,” but we don’t monitor “Can a user actually log in?”

I spent an hour staring at Prometheus, trying to write a PromQL query that would tell me why the payment-service was latent.

sum(rate(http_request_duration_seconds_count{job="payment-service", status=~"5.."}[5m])) 
/ 
sum(rate(http_request_duration_seconds_count{job="payment-service"}[5m])) > 0.05

The query returned nothing. Why? Because the payment-service wasn’t throwing 500s. It was just hanging. It was accepting connections and then doing nothing, effectively black-holing every request. The “Monitoring” said everything was fine because the process was “Running.”

This is the difference between monitoring and observability. Monitoring tells you that the system is broken. Observability—if we actually had it—would tell you why it’s broken without you having to SSH into a container and run gdb like it’s 1995.

We have “logs,” but they’re scattered across three different systems. We have “traces,” but they stop at the load balancer because someone forgot to propagate the headers in the Go rewrite.

When I finally found the issue, it was a classic: a hardcoded timeout in a sidecar proxy that was shorter than the database’s own internal timeout. The system was essentially DDOSing itself. Every time the database got slightly slow, the proxy would kill the connection, the application would retry instantly, and the database would get even slower.

# The silent killer in the service mesh config
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
  hosts:
  - payment-service
  http:
  - route:
    - destination:
        host: payment-service
    timeout: 0.5s  # Database takes 0.6s under load. Good job, team.
    retries:
      attempts: 10
      perTryTimeout: 0.1s

Ten retries. Every single failed request turned into ten more requests. It was a recursive loop of failure, a digital Ouroboros eating its own tail until the CPU hit 100% and the kernel started swinging the OOM-killer axe.

The Myth of the “DevOps Engineer” Job Title

There is no such thing as a “DevOps Engineer.” You can’t be a “Culture Engineer.” But HR doesn’t care. They want a unicorn who knows Kubernetes, AWS, Python, React, Jenkins, Terraform, and can also perform open-heart surgery on a legacy Oracle DB.

By labeling us “DevOps Engineers,” the organization has successfully offloaded the responsibility of quality from the developers to a small group of overworked SREs. If the code is bad, it’s an “Ops problem.” If the deployment fails, it’s an “Ops problem.”

I read the “State of DevOps Report” every year. I see the metrics: Deployment Frequency, Lead Time for Changes, Mean Time to Recovery (MTTR). They look great on paper. But they don’t capture the human cost. They don’t capture the fact that our “Deployment Frequency” is high only because we have to push five hotfixes for every one feature. They don’t capture the fact that our MTTR is low only because I’ve stopped sleeping and just live in the terminal.

The report says “High Performers” have a culture of trust. I don’t trust anyone. I don’t trust the code, I don’t trust the “automated” tests that have been skipped in the codebase since 2022, and I certainly don’t trust the “Self-Healing” capabilities of a cloud provider that has a major outage every time a squirrel chews through a fiber optic cable in Virginia.

We are told that DevOps is about “Empathy.” Where is the empathy for the person who gets paged at 3:00 AM because a “Senior Developer” didn’t want to write a migration script and decided to just “wing it” in the production console?

Why Automation is a Double-Edged Sword

We were promised that automation would set us free. Instead, it just made the failures more spectacular.

In the old days, if I made a mistake, I broke one server. I could fix one server. Today, if I make a mistake in a Terraform module, I can delete an entire region in forty-five seconds. I have built a machine that can destroy my career at the speed of light.

Look at this snippet from our “Infrastructure as Code” repository. It’s a masterpiece of technical debt.

# terraform/modules/vpc/main.tf (v1.7.4)
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  # "Temporary" fix added 2 years ago
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = var.is_emergency ? aws_nat_gateway.emergency[0].id : aws_internet_gateway.main.id
  }
}

The is_emergency flag. It was meant for a one-time migration. Now, it’s baked into every environment. Nobody knows what happens if we set it to false. The last person who tried it caused a routing loop that took down the VPN. So we leave it. We leave the rot. We automate the rot.

This is the “DevOps” reality. It’s not a “journey” toward excellence. It’s a desperate scramble to keep the technical debt from collapsing under its own weight. We use tools like Ansible and Chef to manage configurations that are so complex that no single human brain can hold the entire state. We use Docker to package up “it works on my machine” and ship it to production, where it promptly stops working because “my machine” wasn’t running under a restricted SecurityContextHolder.

The Autopsy Conclusion: The Smell of Ozone and Failure

As I sit here, watching the last of the pods transition to Running, I realize that we haven’t solved anything. We’ve just moved the pain.

The database is back up, but the data is inconsistent. We’ll be spending the next week running manual reconciliation scripts. The “DevOps Pipeline” is still broken, but we’ve patched it with a sleep 30 command to “let the network settle.”

The CTO sent an email at 05:00 AM. “Great job on the recovery, team! Let’s have a ‘Lunch and Learn’ on Friday to discuss how we can use AI to prevent this next time.”

I want to scream. AI won’t fix a deadlock caused by a lack of basic database theory. AI won’t fix a culture where “shipping fast” is the only metric that matters. AI won’t answer “what is devops” in a way that actually results in me getting a full night’s sleep.

DevOps is not a tool. It’s not a title. It’s the uncomfortable, grimy, exhausting work of taking responsibility for the entire lifecycle of a system. It’s the realization that if you build it, you run it—and if you run it, you’re going to get burned.

The smell of ozone isn’t just from a short circuit in a server rack; it’s the smell of a team burning out. It’s the smell of a “DevOps Transformation” that forgot about the people and focused only on the YAML.

I’m going home. I’m turning off my phone. If the site goes down again, let it stay down. I’ve done my time in the trenches, and I’ve realized that the only “seamless” thing in this entire industry is the way we transition from one disaster to the next.

Related Articles

Explore more insights and best practices:

TIMESTAMP: 07:42 AM.
STATUS: System "Stable" (for now).
SRE STATUS: Depleted.
LOG OFF.

Leave a Comment