json
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “ExplicitDenyAllExceptVPC”,
“Effect”: “Deny”,
“Principal”: ““,
“Action”: “s3:“,
“Resource”: [
“arn:aws:s3:::critical-prod-assets/“,
“arn:aws:s3:::critical-prod-assets”
],
“Condition”: {
“StringNotEquals”: {
“aws:SourceVpc”: “vpc-0a1b2c3d4e5f6g7h8”
},
“Bool”: {
“aws:PrincipalIsAWSService”: “false”
}
}
},
{
“Sid”: “AllowAppRoleAccess”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::123456789012:role/application-server-role”
},
“Action”: [
“s3:GetObject”,
“s3:PutObject”
],
“Resource”: “arn:aws:s3:::critical-prod-assets/”
}
]
}
The snippet above is why I haven't slept. Someone—probably a "Senior Architect" who hasn't touched a terminal since 2018—decided to enforce VPC-only access to our primary S3 bucket. They forgot that the `Deny` evaluation logic in **amazon aws** is absolute. It doesn't matter if the `Allow` block is there. It doesn't matter if the IAM role has `AdministratorAccess`. The moment that `StringNotEquals` condition failed because a Lambda function was executing outside the VPC or a CloudFront distribution tried to fetch an origin object, the entire frontend went dark. 403 Forbidden. Everywhere.
I’ve been staring at this for 72 hours. My eyes feel like they’ve been rubbed with sandpaper. The bridge call has 45 people on it, 40 of whom are "Project Managers" asking for an ETA every six minutes.
## Ticket #9902: The Cascading Failure of the "Cost-Optimized" Tier
The incident started at 03:00 UTC. We were running our worker nodes on `t3.medium` instances. Marketing decided to launch a "flash sale" without telling Engineering. The burst credits on those `t3` instances? Gone in fifteen minutes. When a `t3` runs out of credits, it doesn't just stop; it throttles you to a baseline performance that is essentially a digital paperweight.
I tried to pull the metrics. The console was timing out because the control plane was under heavy load. I had to drop to the CLI.
```bash
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUCreditBalance \
--dimensions Name=InstanceId,Value=i-049f8234567890abcdef \
--start-time 2023-10-24T03:00:00Z \
--end-time 2023-10-24T04:00:00Z \
--period 300 \
--statistics Average
The output was a string of zeros. The nodes were pinned at 100% CPU utilization but were only actually processing at 20% of their capacity because of the credit exhaustion. The Auto Scaling Group (ASG) saw the 100% CPU and tried to spin up more instances. But because the existing instances were so throttled, they couldn’t even finish their cloud-init scripts to report as InService.
We had a graveyard of “zombie instances” that were costing us money but doing zero work. The amazon aws ASG logic kept killing them for failing health checks, then starting new ones, which then failed. A death spiral in real-time.
Table of Contents
Ticket #9905: The False Fix and the EBS Wall
The “fix” was supposed to be simple: migrate the instance family to c6g.xlarge. No more burst credits. Graviton processors. Better price-to-performance. Or so the whitepapers say. We updated the Launch Template and forced an instance refresh.
Then we hit the next wall: EBS IOPS. We were using gp2 volumes. For the uninitiated, gp2 performance is tied to volume size. You want more IOPS? You have to buy more storage you don’t need. We were on 100GB volumes, giving us a baseline of 300 IOPS. Under the new load, the disk queue depth exploded.
aws ec2 describe-volumes \
--volume-ids vol-05923847561029384 \
--query 'Volumes[*].Iops'
The volumes were choked. The application was waiting on I/O, which caused the thread pool to saturate, which caused the load balancer to start throwing 504 Gateway Timeouts. We tried to switch to gp3 on the fly to provision 3000 IOPS independently of storage size. But amazon aws has a “soft limit” on how many volume modifications you can perform in a 24-hour period. We hit it.
I had to get on the phone with a TAM (Technical Account Manager) to beg for a limit increase while the site was hemorrhaging $50,000 a minute. The “elasticity” of the cloud is a lie if you don’t have the quota to stretch it.
Ticket #9912: Why the NAT Gateway is Eating Our Series B
While the compute layer was burning, I noticed a spike in our “Data Transfer” costs. We were processing 50TB of data through a NAT Gateway. For those who don’t look at the bill, a NAT Gateway in amazon aws charges you twice: once for the hourly uptime and once for every gigabyte that passes through it.
We were pulling massive container images from an S3 bucket and hitting an external API. Because we didn’t have a VPC Endpoint for S3 configured in that specific subnet, all that traffic was routed through the NAT Gateway.
aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=vpc-0a1b2c3d4e5f6g7h8" \
--query 'NatGateways[*].NatGatewayAddresses'
The egress fees were astronomical. We were paying $0.045 per GB just to move data within the same region. It’s a tax on poor architectural choices. I had to manually rewrite the route tables in the middle of the outage to point S3 traffic toward a Gateway Endpoint. If you misconfigure a route table during a Sev-1, you lose SSH access to your fleet. I did it anyway. My hands were shaking from the caffeine.
Ticket #9920: The Reality Check of Service Limits
By hour 40, we tried to scale the database. We’re using Aurora MySQL. “Limitless scale,” they said. Except for the part where you hit the max_connections limit because the application doesn’t use a connection pooler like ProxySQL or RDS Proxy.
Every time a Lambda function triggered, it opened a new connection. Under load, we had 5,000 concurrent Lambdas trying to talk to a database that was configured for 2,000 connections.
aws rds describe-db-instances \
--db-instance-identifier prod-db-cluster \
--query 'DBInstances[*].DBInstanceClass'
We were on an r6g.2xlarge. We needed to scale up, but the “Modify DB Instance” operation takes time. It’s not instantaneous. While the database was “modifying,” it was effectively in a read-only state for certain operations. The “Serverless” dream died right there. We were managing more infrastructure than we ever did on-prem, just with different names and more expensive invoices.
The “soft limits” are the real killer. You don’t know they exist until you hit them. DescribeInstances API rate limits? Hit them. CloudFormation stack resource limits? Hit them. IAM role policy size limits? Hit those too. We had to split our IAM policies because they exceeded the 6,144-character limit.
Ticket #9931: The Technical Debt of Lambda Execution Environments
The “Serverless” advocates in the company wanted to move everything to Lambda to “save money.” What they didn’t account for were “cold starts” and the execution environment overhead. When you put a Lambda in a VPC, it needs an ENI (Elastic Network Interface). While amazon aws has optimized this with Hyperplane, it still adds latency.
We had a microservice that was chaining four Lambda calls. Each call added 200ms of cold start latency plus the actual execution time. The user experience was like trying to run through waist-high molasses.
aws lambda get-function-concurrency \
--function-name prod-order-processor
We had to implement “Provisioned Concurrency” to keep the functions warm. Do you know what Provisioned Concurrency costs? It’s basically paying for a server to sit idle so that your “serverless” function can start quickly. We’ve come full circle. We’re paying for idle compute again, but this time it’s wrapped in a proprietary API that makes it impossible to migrate away.
The logs were another nightmare. CloudWatch Logs are priced at $0.50 per GB ingested. Our debug logs were so verbose that the logging cost was higher than the compute cost. I had to run a script to mass-update the retention policies of 500 log groups because the default is “Never Expire.”
for group in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do
aws logs put-retention-policy --log-group-name "$group" --retention-in-days 7
done
The Hard Truth: The Architecture Critique
The problem with amazon aws isn’t the technology; it’s the complexity masquerading as simplicity. They give you a “Click to Deploy” button that hides a labyrinth of networking, IAM, and billing traps.
Our architecture was a “distributed monolith.” We had the worst of both worlds: the complexity of microservices with the tight coupling of a monolith. One misconfigured S3 bucket policy (the one I started this post with) brought down the entire stack because every service relied on that one bucket for configuration state.
We used VPC Peering to connect our legacy environment to the new one. But VPC peering isn’t transitive. If VPC A is peered with VPC B, and VPC B is peered with VPC C, A cannot talk to C. We ended up with a “full mesh” of peering connections that was impossible to manage. We should have used a Transit Gateway, but the “Transit Gateway Processing Fee” was another line item the CFO wouldn’t approve.
We were also victims of “Managed Service Fetishism.” We used Managed MQ because we didn’t want to manage RabbitMQ. But Managed MQ has a specific version lag and limited configuration options. When we hit a bug in the underlying broker, we couldn’t patch it. We had to wait for amazon aws to release a new minor version. We were stuck in a waiting room while our production environment was on fire.
The “Global Infrastructure” is another marketing point that falls apart under pressure. Yes, there are multiple Availability Zones (AZs). But cross-AZ data transfer isn’t free. If your app server in us-east-1a talks to your database in us-east-1b, you’re paying for that data move. At scale, this “inter-AZ” tax becomes a massive part of the bill. We had to implement “AZ Affinity” in our load balancers to keep traffic within the same zone, which then created an imbalance in resource utilization. You can’t win.
The Descent into Madness: The Control Plane is Down
The final blow came during hour 60. We tried to roll back a deployment using CodeDeploy. But the amazon aws control plane for the region started experiencing “increased error rates.” We couldn’t deploy. We couldn’t roll back. We couldn’t even see our instances in the console.
I was flying blind. I had to use the CLI with --region us-west-2 just to see if the global STS (Security Token Service) was still responding. It wasn’t. We were locked out of our own infrastructure by the very provider we were paying six figures a month to.
This is the “Hard Truth” of the cloud. You are renting someone else’s computer, and they have the master kill switch. When their control plane fails, your “highly available” architecture is just a collection of expensive, unreachable bits.
We spent the next twelve hours manually rebuilding the state in a different region. Do you know how hard it is to move 10TB of data across regions during a global outage? The egress fees alone will require a board meeting to explain.
aws s3 sync s3://prod-data-us-east-1 s3://prod-data-us-west-2 \
--source-region us-east-1 \
--region us-west-2
The s3 sync command is a lie when you have millions of small files. The overhead of the API calls to ListObjectsV2 and HeadObject means it would take weeks to finish. We had to use S3 Batch Operations, which is another service, another configuration, and another set of IAM permissions to get wrong.
The Technical Debt: The Aftermath
It’s hour 72. The site is back up, mostly. We’re running on a skeleton crew. The “Root Cause Analysis” (RCA) is due in four hours. I know what I’m going to write, but I also know it won’t matter.
The RCA will say “human error” regarding the S3 bucket policy. It will say “unforeseen traffic spike.” It will recommend “better monitoring.”
But the real root cause is the “amazon aws” complexity tax. We have built a system so complex that no single human can understand the failure modes. We have layers of abstractions—containers on top of VMs on top of software-defined networks on top of proprietary storage APIs. When one layer wobbles, the whole tower shakes.
We have “Infrastructure as Code” (Terraform), but the state file is corrupted because of a timed-out apply during the peak of the outage. Now I have to manually import resources back into the state file, praying that I don’t accidentally trigger a Destroy on the production database.
terraform import aws_db_instance.database prod-db-cluster
Every time I run a command, I’m terrified. One typo in a CLI flag and I’m back at hour zero. The stress is physical. My heart rate hasn’t dropped below 90 in three days. I can smell the ozone and the stale coffee.
We’re “Cloud Native” now. That’s what the brochure said. But “Cloud Native” just means you’ve traded your hardware problems for distributed systems problems that you can’t see or touch. You’ve traded a fixed CAPEX for an uncapped OPEX that can bankrupt you in a weekend if a loop goes rogue.
I look at the junior developer who just joined the team. He’s smiling. He’s looking at the amazon aws console and talking about how “easy” it is to spin up a Kubernetes cluster with EKS. He thinks he’s an architect because he can click a few buttons and see a dashboard.
He has no idea. He hasn’t seen the RequestLimitExceeded errors in his dreams yet. He hasn’t had to explain to a CEO why a NAT Gateway cost more than the entire engineering salary for the month. He hasn’t felt the cold dread of a Deny policy that he can’t delete because he accidentally removed his own IAM:PutBucketPolicy permissions.
The Hard Advice
Listen closely, kid. You think you’re “deploying” when you click that button in the console. You’re not. You’re signing a blood pact with an entity that doesn’t care about your uptime.
If you think you can just “click and deploy” without understanding the underlying VPC routing, the IOPS credit math, and the absolute finality of an IAM Deny statement, you are a liability. The cloud is not a playground; it is a high-voltage power grid where every wire is live and nothing is labeled.
Before you launch that next “serverless” function, go look at the pricing page for Data Transfer. Then look at the service limits for your region. Then go read the IAM evaluation logic documentation until you can recite it in your sleep. If you don’t know the difference between a gp2 burst bucket and a gp3 throughput limit, stay away from the production environment.
The console is a lie designed to make you spend money. The CLI is the only truth, and even the truth is eventually consistent. Now get out of my sight; I have a post-mortem to write and a state file to fix.
Related Articles
Explore more insights and best practices: