10 Essential AWS Best Practices for Cloud Optimization

INTERNAL POST-MORTEM: INCIDENT #8842-BRAVO
DATE: Monday, October 16, 2023
DURATION: 72 Hours, 14 Minutes
TOTAL ESTIMATED LOSS: $85,422.19 (Infrastructure Egress + Compute Over-provisioning + Lost Revenue)
STATUS: SEV-0 (Mitigated, Not Resolved)
AUTHOR: Senior SRE (Platform Reliability Team)


1. THE INCIDENT SUMMARY

TIMESTAMP: 2023-10-13 23:43:12 UTC
The pager went off. I was three sips into a beer. By 23:45, the AWS Billing Alert hit my inbox.

SUBJECT: [URGENT] AWS Billing Alert: Estimated charges for the current month have exceeded your threshold of $10,000.00.

The “Current Estimate” wasn’t $10,001. It was $42,908.12. By the time I logged into the Billing Dashboard, it was $44,200. We were burning $1,200 an hour on a Friday night. This wasn’t a traffic spike. This wasn’t a successful marketing campaign. This was a catastrophic failure of engineering discipline.

The dashboard showed a vertical line in Data Transfer costs. Someone had bypassed the staging environment and pushed a “hotfix” to the production VPC that turned our internal data sync into a global egress nightmare. We spent the next 72 hours clawing back our infrastructure from the brink of bankruptcy.


2. THE IAM POLICY THAT LEAKED THE KEYS TO THE KINGDOM

The first point of failure wasn’t the code; it was the permissions. A junior developer, frustrated by “access denied” errors while trying to debug a Lambda function, decided to “simplify” the IAM policy. They attached a policy that effectively turned our production environment into an open playground.

Following aws best practices isn’t a suggestion; it’s a survival tactic that was ignored here. We found a policy attached to a dev-temp-role that looked like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        }
    ]
}

This is the “God Mode” policy. It was used to “test” a script that was supposed to move logs to S3. Instead, the script had a logic loop. Because the role had iam:CreateUser and iam:AttachUserPolicy permissions, a compromised set of temporary credentials allowed an automated bot to spin up 50 p3.16xlarge instances in us-east-1, us-west-2, and eu-central-1 for GPU mining.

THE FIX:
I had to run a scorched-earth script to identify every principal with AdministratorAccess that wasn’t the break-glass account.

# Identifying the carnage
aws iam list-attached-user-policies --user-name <redacted>
aws iam list-roles --query 'Roles[?contains(AssumeRolePolicyDocument.Statement[].Principal.AWS, `*`)].RoleName'

We revoked the sessions and implemented a Service Control Policy (SCP) at the Organization level to deny the ability to create p3 or g4 instances in any region except our primary. If you want to use a $24-an-hour instance, you now have to justify it to me in person.

The new policy for the Lambda function now follows the Principle of Least Privilege. It targets specific ARNs. No wildcards. No shortcuts.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::prod-logs-app-01/*",
            "Condition": {
                "StringEquals": {
                    "aws:SourceVpc": "vpc-0a1b2c3d4e5f6g7h8"
                }
            }
        }
    ]
}

3. WHY OUR VPC PEERING WAS A LATENCY NIGHTMARE

While the crypto-miners were eating our compute budget, our core application was dying because of a “flat network” design. Someone thought it would be “easier” to peer every VPC together in a full mesh.

We had 10.0.0.0/16 in Production and 10.0.0.0/16 in Staging. Yes, you read that right. Overlapping CIDR blocks. To “fix” this, a previous engineer had set up a complex series of NAT Gateways and secondary IP ranges that created a routing loop.

When the mining instances started saturating the NAT Gateway, the connection tracking table (conntrack) hit its limit. The NAT Gateway has a limit of 55,000 concurrent connections to a single destination. We hit that in four minutes.

THE FIX:
We had to tear down the peering and move to a Transit Gateway architecture. But first, I had to identify where the traffic was actually going. I ran this to check the NAT Gateway metrics:

aws cloudwatch get-metric-statistics \
    --namespace AWS/NATGateway \
    --metric-name ErrorPortAllocation \
    --dimensions Name=NatGatewayId,Value=nat-0123456789abcdef0 \
    --start-time 2023-10-14T00:00:00Z \
    --end-time 2023-10-14T01:00:00Z \
    --period 60 \
    --statistics Sum

The output was a wall of non-zero integers. We were dropping packets because the NAT Gateway was choked. We migrated the critical services to use VPC Endpoints (Interface Endpoints) for S3 and DynamoDB. This kept the traffic on the AWS private backbone and off the public internet, cutting our NAT Gateway egress bill by 80% instantly.

We also re-addressed the subnets. No more /16 for everything. We moved to a structured tier:
Public Subnets: /24 (Load Balancers, Bastions)
Private App Subnets: /22 (EC2 Fleet, EKS Nodes)
Data Subnets: /24 (RDS, ElastiCache)


4. S3 BUCKET NEGLIGENCE: PUBLIC IS NOT A PERMISSION

At 03:00 Saturday, I discovered why the egress bill was still climbing even after I killed the mining instances. An S3 bucket named company-assets-backup had been set to public.

Why? Because a frontend dev couldn’t get the CORS policy right for a staging site, so they just hit the “Make Public” button and checked “I understand the risks.” They didn’t understand the risks.

A crawler found the bucket. The bucket contained 4TB of uncompressed database snapshots (another failure for the Storage section). The crawler started downloading the entire bucket from a GCP region. We were paying $0.09 per GB for someone to steal our data.

THE FIX:
I didn’t just fix the bucket; I locked the entire account.

# The "I'm done with this" command
aws s3api put-public-access-block \
    --bucket company-assets-backup \
    --public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Then I audited every bucket in the account:

for bucket in $(aws s3api list-buckets --query "Buckets[].Name" --output text); do
    echo "Checking $bucket"
    aws s3api get-public-access-block --bucket $bucket || echo "$bucket IS EXPOSED"
done

We implemented S3 Object Lock and moved all “backup” data to S3 Glacier Deep Archive with a lifecycle policy. Storing 4TB of “backups” in S3 Standard is an expensive way to prove you don’t know how to use cold storage.


5. THE DATABASE DEADLOCK THAT COST US A QUARTER’S PROFIT

By Saturday afternoon, the app was back up, but the RDS instance (an m5.4xlarge) was pinned at 99% CPU. The “architectural shortcut” here was a lack of Read Replicas. The application was hitting the primary writer for every single analytics query.

Worse, the storage was configured as gp2. For those who don’t spend their lives in the console, gp2 uses a burst credit system for IOPS. Once you run out of credits, your disk performance drops to the baseline. For a 100GB volume, that baseline is 300 IOPS. Our app needs 5,000.

The database was in an I/O wait death spiral.

THE FIX:
We performed a zero-downtime migration to gp3 and scaled the storage to 1TB to get the 3,000 baseline IOPS, then manually provisioned it to 12,000 IOPS.

aws rds modify-db-instance \
    --db-instance-identifier prod-db-master \
    --allocated-storage 1000 \
    --storage-type gp3 \
    --iops 12000 \
    --apply-immediately

But that wasn’t enough. I had to kill the long-running queries that were locking the tables. I logged into the instance and saw the horror:

SELECT * FROM orders JOIN users JOIN tracking_events WHERE orders.created_at > '2023-01-01';

No indexes. Full table scans on three joined tables. I added a Read Replica (r5.large) and forced the analytics engine to point there.

aws rds create-db-instance-read-replica \
    --db-instance-identifier prod-db-replica-01 \
    --source-db-instance-identifier prod-db-master \
    --db-instance-class db.r5.large

6. CLOUDWATCH LOGS: THE 4 AM GHOST IN THE MACHINE

Sunday morning, 04:00. The billing dashboard showed a new spike: CloudWatch Logs.

In the panic of Friday night, someone had enabled DEBUG logging across the entire EKS cluster to “see what was happening.” We were ingesting 500GB of logs per hour. CloudWatch ingestion costs $0.50 per GB. Do the math. That’s $250 an hour just to watch the cluster breathe.

Most of these logs were “Connection Refused” errors from the aforementioned NAT Gateway failure, repeating 100 times a second.

THE FIX:
I had to bulk-update the retention policies and the log levels. We were keeping logs “Forever.” Why? Because “storage is cheap.” No, it isn’t.

# Finding the offenders
aws logs describe-log-groups --query 'logGroups[?storedBytes > `1000000000`].[logGroupName, storedBytes]'

# Setting a 7-day retention because we aren't a library
for group in $(aws logs describe-log-groups --query 'logGroups[].logGroupName' --output text); do
    aws logs put-retention-policy --log-group-name $group --retention-in-days 7
done

We also implemented a FluentBit filter to drop any log entry that didn’t have a level of ERROR or CRITICAL in production. If you want to DEBUG, do it in staging.


7. FINAL RECKONING: IMPLEMENTING AWS BEST PRACTICES BEFORE WE GO UNDER

It is now Monday. I have had four hours of sleep. The $85,000 is gone. It’s a “learning expense” now. But if I see another t2.micro in a production environment, I’m quitting.

The root cause of this weekend wasn’t a technical glitch. It was “drift.” We allowed the infrastructure to drift away from the code. We allowed manual changes in the console (“ClickOps”) instead of enforcing Terraform.

THE NON-NEGOTIABLES:

  1. Infrastructure as Code (IaC) is Law: No one touches the console. If it’s not in a Terraform file, it doesn’t exist. We will run terraform plan every 60 minutes via a cron job to detect drift. If someone manually changes a Security Group, Terraform will revert it.
  2. Egress Monitoring: We are implementing VPC Flow Logs and sending them to a dedicated S3 bucket for analysis with Athena. If egress exceeds $100/hour, the circuit breaker trips.
  3. Instance Selection: No more burstable instances (t-series) for production workloads. We use m5 or c5 instances with dedicated resources. We will use gp3 for all EBS volumes to decouple IOPS from capacity.
  4. IAM Hardening: All developers are losing their direct access to the prod account. You will use AWS IAM Identity Center (formerly SSO) to assume short-lived roles.
  5. Tagging Policy: Every resource must have a CostCenter, Environment, and Owner tag.
# The new standard for launching anything (if I ever let you again)
aws ec2 run-instances \
    --image-id ami-0abcdef1234567890 \
    --instance-type m5.large \
    --subnet-id subnet-01234567 \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=CostCenter,Value=Engineering},{Key=Environment,Value=Production}]' \
    --monitoring "Enabled=true"

THE FINAL WORD:

We were lucky. If the botnet had stayed active for the whole weekend, the bill would have been $250,000. We are currently operating on a “Trust, but Verify” model, but after this, the “Trust” part is gone.

Following aws best practices is the only thing standing between us and the total liquidation of this company’s assets. If you think a security group rule with 0.0.0.0/0 is “fine for a quick test,” please hand in your badge.

I’m going home. Do not page me unless the building is literally on fire. Even then, check the CloudWatch logs first to see if the fire is in the budget or the server room.

END OF REPORT


APPENDIX A: THE “SHAME” LIST (RESOURCES DELETED)

Resource ID Type Reason Cost (72h)
i-0992837465 p3.16xlarge Unauthorized Mining $1,728.00
nat-012345678 NAT Gateway Egress Loop $12,400.00
vol-088776655 io2 Over-provisioned (100k IOPS) $4,200.00
cw-logs-prod CloudWatch Debug Log Ingestion $18,900.00
s3-egress Data Transfer Public Bucket Leak $32,000.00

APPENDIX B: THE RECOVERY SCRIPT (SNIPPET)

This script was used to force-terminate the rogue instances across all regions. It’s blunt, but effective.

#!/bin/bash
REGIONS=$(aws ec2 describe-regions --query "Regions[].RegionName" --output text)

for region in $REGIONS; do
    echo "Checking region: $region"
    INSTANCES=$(aws ec2 describe-instances --region $region \
        --filters "Name=instance-state-name,Values=running" \
        --query "Reservations[].Instances[?InstanceType=='p3.16xlarge'].InstanceId" --output text)

    if [ ! -z "$INSTANCES" ]; then
        echo "TERMINATING ROGUE INSTANCES IN $region: $INSTANCES"
        aws ec2 terminate-instances --region $region --instance-ids $INSTANCES
    fi
done

This script is now part of our automated incident response. If an instance type we don’t use appears in our account, it is terminated within 60 seconds. No questions asked. No exceptions.

POST-MORTEM COMPLETE.

Related Articles

Explore more insights and best practices:

Leave a Comment