10 Docker Best Practices to Optimize Your Containers

text
[2024-05-22 03:14:02] INFO: Deployment triggered by ‘Rockstar_Dev_69’
[2024-05-22 03:15:44] ERROR: Failed to pull image “registry.internal/awesome-app:latest”
[2024-05-22 03:15:44] ERROR: RPC error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /usr/src/app/node_modules/huge-useless-library/dist/bundle.js: no space left on device
[2024-05-22 03:16:10] CRITICAL: Node ip-10-0-42-12.ec2.internal is DiskPressure
[2024-05-22 03:16:15] CRITICAL: Kubelet stopped posting node status.
[2024-05-22 03:17:00] FATAL: Production cluster is unresponsive. PagerDuty initiated.

I am currently staring at a cold cup of coffee and a terminal screen that is bleeding red. It is 4:00 AM. I have been awake for 72 hours because you decided that the "docker best" way to deploy a simple Node.js microservice was to wrap it in a 3GB container image and push it directly to our production registry using the `latest` tag. 

Our nodes didn't just fail; they choked. They tried to pull your bloated monstrosity, ran out of disk space on the root partition, and the Kubelet died in a fit of rage. I had to manually prune the build cache on twelve different workers just to get the control plane to talk to me again. 

Sit down. We are going to talk about why your "it works on my machine" mentality is a cancer on this infrastructure.

## Stop Using 'Latest' Like a Gambler

You tagged your image as `latest`. Do you know what `latest` means in Docker Engine 24.0.7? It means absolutely nothing. It is a default label, not a pointer to the most recent stable build. When you pushed that 3GB image, every single node in the cluster that had an `imagePullPolicy: Always` setting tried to pull it simultaneously. 

Because you didn't version your image, I couldn't roll back. I couldn't just tell Kubernetes to go back to `v1.2.3` because you’ve been overwriting `latest` for the last three weeks. We had to dig through the registry API to find the SHA256 hash of the previous working image while the CEO was screaming in the Slack bridge.

In a real environment, you pin your versions. You use semantic versioning. You use the git commit hash. You do anything except use that godforsaken tag. If I see another deployment manifest with `image: awesome-app:latest`, I am revoking your registry write permissions and making you deploy via carrier pigeon.

## Your Layers are Bloated and You Should Feel Bad

I ran a `docker history` on your image. It was like performing an autopsy on a man who died from eating too much lead. Here is what I saw:

```bash
IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
<missing>      2 hours ago    /bin/sh -c #(nop)  CMD ["npm" "start"]          0B        
<missing>      2 hours ago    /bin/sh -c npm install                          1.8GB     
<missing>      2 hours ago    /bin/sh -c #(nop) COPY dir:7e3... in /app       1.1GB     
<missing>      2 hours ago    /bin/sh -c apt-get update && apt-get install…   450MB     
<missing>      3 weeks ago    /bin/sh -c #(nop)  WORKDIR /app                 0B        
<missing>      3 weeks ago    /bin/sh -c #(nop)  FROM node:bookworm           850MB

Three gigabytes. For a Hello World API.

You used node:bookworm (Debian Bookworm) as your base image. Why? Do you need a full suite of build tools, man pages, and a C compiler in production? No. You don’t. You included the .git directory. You included the tests folder. You included the node_modules from your local macOS environment which, newsflash, doesn’t work on Linux when you have native C++ bindings.

You need to understand OverlayFS2. Every RUN, COPY, and ADD instruction in your Dockerfile creates a new layer. These layers are stacked. If you COPY 1GB of data in one layer and then RUN rm -rf that data in the next, your image is still 1GB larger. The data is still there, hidden in the lowerdir, haunting the filesystem like a vengeful ghost. To keep things “docker best” compliant, you must clean up in the same layer where the mess was made.

Table of Contents

Multi-Stage Builds are Not a Suggestion

If you want to be a “rockstar,” learn to use a multi-stage Dockerfile. This isn’t 2014. We don’t need to ship the compiler with the binary. We use one stage to build the assets and a second, clean stage to run them.

Here is the “Broken” Dockerfile you pushed, which I have pinned to the wall of shame:

# THE BROKEN WAY - DO NOT DO THIS
FROM node:20
WORKDIR /app
COPY . .
RUN apt-get update && apt-get install -y python3 make g++ # Why is this here?
RUN npm install
EXPOSE 3000
CMD npm start

And here is what a professional would have written using Alpine 3.19 to keep the footprint small and the attack surface narrow:

# THE BETTER WAY - Multi-stage with Alpine 3.19
# Stage 1: Build
FROM node:20-alpine3.19 AS builder
WORKDIR /app
# Use build mounts to cache npm packages
RUN --mount=type=cache,target=/root/.npm \
    COPY package*.json ./ && \
    npm ci

COPY . .
RUN npm run build

# Stage 2: Runtime
FROM node:20-alpine3.19
RUN apk add --no-cache tini
WORKDIR /app
# Copy only the necessary artifacts
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./package.json

# Security: Never run as root
USER 10001
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "dist/index.js"]

By using multi-stage builds, you separate the build-time dependencies (like python3 or g++ for native modules) from the runtime environment. The final image only contains the compiled code and the production node_modules. This would have turned your 3GB disaster into a 150MB image that pulls in three seconds instead of three minutes.

Stop Running as Root Before I Revoke Your Sudo

I noticed your container process was running as root. Do you have any idea how dangerous that is? If there is a container breakout vulnerability in the Linux kernel or the container runtime, the attacker doesn’t just have your app; they have the host.

You wrote USER 10001 in your head and then forgot to put it in the file. In the “docker best” world, we follow the principle of least privilege. We don’t need UID 0 to serve a JSON payload over HTTP. By specifying a non-privileged UID (like 10001), even if someone exploits your shitty code, they are trapped in a sandbox with no permissions to do anything meaningful to the underlying node.

Also, notice the apk add --no-cache tini. This brings us to the next point that you clearly ignored while you were busy “disrupting” the uptime.

ENTRYPOINT is Not a Suggestion, and Shell Form is a Lie

In your Dockerfile, you used CMD npm start. This is the “shell form.” When you do this, Docker starts a shell (/bin/sh -c) and then the shell starts your process. Your application is no longer PID 1; the shell is.

Why does this matter? Because shells do not forward signals.

When Kubernetes tries to stop your pod, it sends a SIGTERM. It waits 30 seconds for your app to shut down gracefully—closing database connections, finishing active requests, and saving state. But because you used the shell form, your app never sees the SIGTERM. It just sits there, oblivious, until the 30-second grace period expires and the kernel sends a SIGKILL.

SIGKILL is the digital equivalent of a bullet to the back of the head. It doesn’t allow for cleanup. It leaves database locks dangling. It corrupts files.

Use the “JSON form” for ENTRYPOINT and CMD.
– Shell form: CMD node index.js -> Runs as /bin/sh -c "node index.js"
– JSON form: CMD ["node", "index.js"] -> Runs as node index.js (PID 1)

And use tini. It’s a tiny init process that handles zombie processes and forwards signals correctly. It ensures that when I tell the cluster to restart, it actually restarts, rather than hanging for 30 seconds while the Kubelet gets impatient.

Build Performance and the Myth of the Clean Slate

You complained that the build was “slow” on your machine, so you started skipping the .dockerignore file. You thought it was faster to just COPY . . and let Docker figure it out.

Every time you change a single line in your README.md, you invalidate the cache for the COPY . . layer. Because that layer is invalidated, every subsequent layer—including the npm install—has to run from scratch. That is why our CI/CD pipeline was backed up for two hours.

By using --mount=type=cache, as shown in my “Fixed” Dockerfile, we can persist the npm cache across builds. Even if the layer is invalidated, the package manager doesn’t have to re-download the entire internet. It just checks the local cache.

Also, for the love of all that is holy, use a .dockerignore file.

.git
node_modules
npm-debug.log
dist
tests
.env

If these files are in your build context, they are sent to the Docker daemon. If your .git folder is 1.2GB (which yours was, because you never prune your objects), then every time you run docker build, you are transferring 1.2GB of useless metadata to the daemon. It’s slow, it’s inefficient, and it’s the opposite of “docker best” behavior.

The Anatomy of OverlayFS2 and Why You Broke the Disk

Let’s get technical, since you seem to think “DevOps” is just a buzzword you put on LinkedIn. Docker uses OverlayFS2 to manage image layers. It uses a “lowerdir” (read-only layers), an “upperdir” (the writable layer of the container), and a “merged” view.

When you run apt-get update, you are writing hundreds of megabytes of package metadata to the upperdir. If you don’t delete that metadata in the same RUN command, it is committed to a read-only layer.

# WRONG: The cache stays in the image forever
RUN apt-get update
RUN apt-get install -y heavy-package
RUN rm -rf /var/lib/apt/lists/*

# RIGHT: The cache is deleted before the layer is committed
RUN apt-get update && \
    apt-get install -y heavy-package && \
    rm -rf /var/lib/apt/lists/*

In your image, you had five different RUN commands for system updates. You had layers that were nothing but apt metadata. You were literally shipping a graveyard of cached files that served no purpose other than to make my life miserable at 3:00 AM.

When the container runtime pulls these layers, it has to extract them. Extraction requires disk space. If your image is 3GB compressed, it might be 6GB or 7GB uncompressed. Our worker nodes have 20GB root partitions. You took up 35% of the entire disk with one single, poorly optimized container. When the second and third pods tried to schedule on the same node, the disk filled up, the Kubelet couldn’t write its own logs, and the node went NotReady.

The Cost of Ignorance

This isn’t just about disk space. It’s about the blast radius. Because the node went NotReady, Kubernetes tried to reschedule those pods onto other nodes. Those nodes also had 20GB disks. They also tried to pull your 3GB image. They also ran out of space. It was a cascading failure—a “death spiral” triggered by your lack of basic container hygiene.

I spent four hours writing scripts to purge the /var/lib/docker/overlay2 directory across the cluster because docker system prune wasn’t enough to recover the nodes in their hung state. I had to stop the Docker daemon, manually unmount stuck overlay points, and pray that I didn’t corrupt the underlying XFS filesystem.

We are moving to a strict policy. From now on, any image over 500MB gets automatically rejected by the admission controller. Any image using the latest tag will be deleted from the registry by a cron job every hour. Any container running as root will be killed by the Pod Security Admission controller.

You want to be a rockstar? Start by acting like an engineer. Learn how the Linux kernel handles signals. Learn how the filesystem manages layers. Stop treating Docker like a “magic box” that hides your bad habits.

I’m going home now. I’m going to sleep for fourteen hours. When I come back, if I see a 3GB image in the staging registry, I’m not going to fix it. I’m just going to point the PagerDuty alerts directly to your personal phone and see how much you enjoy the “rockstar” lifestyle at 4:00 AM.

Go read the documentation. And don’t you dare use the word “optimized” in the stand-up tomorrow unless you’ve actually reduced the layer count.

Actually, don’t bother. You’ll probably just find a way to include a 2GB machine learning model in a CSS-only microservice by next week anyway. I’m updating my resume.

Explore more insights and best practices: