10 Essential AWS Best Practices for Cloud Optimization

INTERNAL POST-MORTEM: INCIDENT #8842-BRAVO DATE: Monday, October 16, 2023 DURATION: 72 Hours, 14 Minutes TOTAL ESTIMATED LOSS: $85,422.19 (Infrastructure Egress + Compute Over-provisioning + Lost Revenue) STATUS: SEV-0 (Mitigated, Not Resolved) AUTHOR: Senior SRE (Platform Reliability Team) 1. THE INCIDENT SUMMARY TIMESTAMP: 2023-10-13 23:43:12 UTC The pager went off. I was three sips into a … Read more

Mastering Amazon AWS: A Complete Guide for Beginners

json { “Version”: “2012-10-17”, “Statement”: [ { “Sid”: “ExplicitDenyAllExceptVPC”, “Effect”: “Deny”, “Principal”: ““, “Action”: “s3:“, “Resource”: [ “arn:aws:s3:::critical-prod-assets/“, “arn:aws:s3:::critical-prod-assets” ], “Condition”: { “StringNotEquals”: { “aws:SourceVpc”: “vpc-0a1b2c3d4e5f6g7h8” }, “Bool”: { “aws:PrincipalIsAWSService”: “false” } } }, { “Sid”: “AllowAppRoleAccess”, “Effect”: “Allow”, “Principal”: { “AWS”: “arn:aws:iam::123456789012:role/application-server-role” }, “Action”: [ “s3:GetObject”, “s3:PutObject” ], “Resource”: “arn:aws:s3:::critical-prod-assets/” } ] } The … Read more

What is Kubernetes? A Complete Guide to Orchestration

TIMESTAMP: 2024-05-22 04:12:08 UTC STATUS: CRITICAL / DEGRADED INCIDENT ID: #8829-BETA-CASCADING-FAILURE OPERATOR: SRE_042 (COFFEE_LEVEL: CRITICAL) The hum of the data center fans is a sound I can hear even in my own apartment now. It’s a low-frequency vibration that lives in the base of my skull. I’ve been staring at a Grafana dashboard for forty-eight … Read more

docker compose – Guide

text $ docker compose up -d [+] Running 5/5 ⠿ Network infrastructure_default Created 0.1s ⠿ Container postgres-db Started 0.5s ⠿ Container redis-cache Started 0.4s ⠿ Container backend-api Error 2.1s ⠿ Container frontend-app Started 0.8s Error response from daemon: failed to mount local volume: mount /home/sre_hell/app/config:/etc/app/config, flags: 0x1000: no such file or directory $ docker compose … Read more

10 Kubernetes Best Practices for Production Success

INCIDENT REPORT: POST-MORTEM #882-B (THE “FRIDAY AFTERNOON ARCHITECT SPECIAL”) TIMESTAMP: 2024-05-17T03:04:12Z INITIAL ALERT: CRITICAL – PagerDuty – Service: checkout-api – Severity: 1 – Status: FAILED FIRST LOG ENTRY: kubelet[1024]: E0517 03:04:12.442103 1024 remote_runtime.go:116] “RunPodSandbox from runtime service failed” err=”rpc error: code = Unknown desc = failed to setup network for sandbox: failed to allocate for … Read more

Artificial Intelligence Best Practices: A Complete Guide

[2024-05-22 03:14:22] ERROR: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.50 GiB (GPU 0; 79.35 GiB total capacity; 64.12 GiB already allocated; 10.23 GiB free; 66.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-05-22 03:14:22] … Read more

10 DevOps Best Practices for Faster Software Delivery

text $ kubectl get pods -n prod NAME READY STATUS RESTARTS AGE api-gateway-v2-7f8d9b4c-xhq2z 0/1 CrashLoopBackOff 42 (3m ago) 14h order-processor-66d5f4e3-99abc 0/1 OOMKilled 12 (1m ago) 14h payment-service-55c2a1b0-zxy98 1/1 Running 0 14h marketing-tracker-88f123a4-bbbbb 1/1 Running 0 14h $ kubectl logs -f api-gateway-v2-7f8d9b4c-xhq2z –previous {“level”:”fatal”,”ts”:1715432100.123,”caller”:”main.go:45″,”msg”:”failed to connect to redis”,”error”:”dial tcp 10.96.0.15:6379: i/o timeout”} {“level”:”info”,”ts”:1715432105.456,”msg”:”Attempting reconnection… (Attempt 43)”} … Read more

10 Docker Best Practices to Optimize Your Containers

text [2024-05-22 03:14:02] INFO: Deployment triggered by ‘Rockstar_Dev_69’ [2024-05-22 03:15:44] ERROR: Failed to pull image “registry.internal/awesome-app:latest” [2024-05-22 03:15:44] ERROR: RPC error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /usr/src/app/node_modules/huge-useless-library/dist/bundle.js: no space left on device [2024-05-22 03:16:10] CRITICAL: Node ip-10-0-42-12.ec2.internal is DiskPressure [2024-05-22 03:16:15] CRITICAL: Kubelet stopped … Read more

Top Cybersecurity Jobs in 2024: Careers, Salary & Skills

May 14 02:14:01 prod-web-04 sshd[14202]: Failed password for root from 192.168.1.45 port 54322 ssh2 May 14 02:14:05 prod-web-04 sshd[14202]: Accepted password for root from 192.168.1.45 port 54322 ssh2 May 14 02:14:06 prod-web-04 systemd[1]: Starting User Manager for UID 0… May 14 02:14:06 prod-web-04 systemd-logind[645]: New session 142 of user root. whoami root uname -a Linux … Read more

Master the Python List: A Complete Guide with Examples

INCIDENT REPORT #882-B: Why our heap looks like a disaster zone. DATE: October 14, 2023 TO: Engineering Lead, CTO, and anyone else who thinks “it’s just a script” FROM: Senior Systems Engineer (Level 4, Infrastructure) SUBJECT: Post-Mortem of the Node-04 Memory Exhaustion Event (Python 3.11.4) SYSTEM INCIDENT LOG: 2023-10-12T03:14:22Z 03:14:22 – CRITICAL – kernel: [99283.12] … Read more