the dockerfile VOLUME instruction is a kubernetes footgun
About
Kubernetes nodes were kernel-panicking under what seemed like a normal workload. After a long investigation, it turned out two VOLUME lines in a Dockerfile — leftovers that hadn’t been relevant in years — were causing containerd to copy gigabytes of data on every container start.
The Setup
We had a service that shipped ML model data baked into its Docker image — about 500MB of serialized model files at /mnt/models. The image also declared a VOLUME at that path, a leftover from the old Docker-era --volumes-from data container pattern:
FROM python:3.11-slim
# ... build steps ...
ADD models/ /mnt/models
VOLUME ["/mnt/models"]
In Kubernetes, the actual volume mounting is handled by the kubelet — pod specs define emptyDir, hostPath, PVC, etc. The VOLUME instruction in the Dockerfile doesn’t do anything useful.. or so we thought.
What VOLUME Actually Does in containerd
When containerd creates a container from an image that has a VOLUME declaration, it copies everything at that path from the image layers into a per-container directory on the node’s disk:
/var/lib/containerd/io.containerd.grpc.v1.cri/containers/<container-id>/volumes/
This happens before the container starts, whether or not Kubernetes mounts anything at that path. Containerd is faithfully implementing the Docker spec — if no external volume is mounted, the container should still see the image’s data at that path. The copy ensures that.
But in Kubernetes, nothing ever reads from this copy. The kubelet manages volumes through pod specs, not Dockerfile declarations. So containerd writes hundreds of megabytes per container to a directory that just sits there on disk doing nothing.
How This Becomes a Node Killer
On NVMe-backed instances with local storage, this is invisible. NVMe throughput is so high that even copying 500MB per container takes fractions of a second.
But when our scheduler started placing these pods on EBS-backed nodes.. things went sideways.
The math:
- Model data per container: ~500MB
- Pods bin-packed on one node: 50
- Total unnecessary writes: 50 x 500MB = ~25GB
- gp3 EBS baseline throughput: 125 MB/s
- Time to write 25GB at 125 MB/s: ~200 seconds
That’s over 3 minutes of sustained disk I/O before a single container actually starts running. During that time, all other container operations on the node are blocked behind this I/O — new containers can’t start, health checks time out, and the kernel’s writeback subsystem gets backed up.
In our case, the I/O pressure triggered kernel panics via hung tasks in the overlay filesystem’s sync path. Nodes rebooted, containerd’s overlay snapshots got corrupted, and pods entered CrashLoopBackOff with exec format error because their binaries were now zero-byte files.
What We Thought the Problem Was
Kubernetes doesn’t have a native “data volume” concept — there’s no way to say “share these files from this image with these containers” without copying. So the pattern we used was a data container: bake the model files into an image, run an init container that copies them from the image to an emptyDir, and mount that emptyDir into the runtime container.
initContainers:
- name: data-loader
image: my-models:latest
command: ["cp", "-r", "/mnt/models/.", "/dest/models/"]
volumeMounts:
- name: model-data
mountPath: /dest/models
containers:
- name: app
volumeMounts:
- name: model-data
mountPath: /mnt/models
When pods started taking forever on EBS nodes, we assumed the bottleneck was this cp step — 50 init containers all reading from the same image layers on a single EBS volume at once. Shared reads from a throughput-constrained disk.. that seemed right.
But when we dug into the actual disk I/O, the writes were the problem, not the reads. And the writes were happening before any init container ran.
What Was Actually Happening
We checked the node filesystem and found the smoking gun:
$ du -sh /var/lib/containerd/io.containerd.grpc.v1.cri/containers/*/volumes/
533M /var/lib/containerd/.../containers/abc123/volumes/
533M /var/lib/containerd/.../containers/def456/volumes/
533M /var/lib/containerd/.../containers/ghi789/volumes/
# ... repeated for every single container
Every container had a full 500MB copy of the model data sitting in containerd’s volumes directory — completely separate from the emptyDir copy that the init container was making. Containerd was honoring the VOLUME declaration in the Dockerfile, duplicating the data before any of our code even ran.
The init container cp was actually fast — it reads from already-cached image layers in memory. The slow part was containerd writing 500MB x 50 containers = 25GB to EBS in the background.. completely invisible unless you go look at the node’s disk directly.
The Fix
Two lines deleted from the Dockerfile:
FROM python:3.11-slim
# ... build steps ...
ADD models/ /mnt/models
# VOLUME ["/mnt/models"] <<<<< removed
That’s it. No other changes — the ADD instruction still puts the data in the image layers, the Kubernetes volume mounts still work exactly the same, and rsync-based init containers that copy data at runtime are unaffected.
Should You Ever Use VOLUME in a Dockerfile?
Probably not. The VOLUME instruction made sense in the Docker Compose era for --volumes-from patterns. In Kubernetes, volumes are declared in pod specs, not Dockerfiles. So the Dockerfile VOLUME instruction just.. causes containerd to copy data on every container start for no reason. It can’t be overridden without rebuilding the image, and it doesn’t show up in kubectl describe or any Kubernetes-level tooling — so you won’t even know it’s happening.
If you’re inheriting base images, check if they declare volumes you don’t need:
$ docker inspect <image> --format '{{.Config.Volumes" }}'
map[/mnt/models:{}]
Final Thoughts
The fix was two lines.. finding it was the hard part. The VOLUME instruction is a Docker-era artifact that containerd still faithfully implements, and in Kubernetes it just works against you. If you’re running containers on Kubernetes — especially on EBS or any throughput-constrained storage — audit your Dockerfiles for VOLUME declarations. You might be burning gigabytes of I/O per node and not even know it.