Why you should use immutable Docker tags
June 02, 2022 · by Nils Caspar · 5 min read
Many companies use tags such as latest
for images of their Docker-based services. Those tags are usually automatically updated to point at the most recently built image from a specific Git branch. This way of managing Docker tags is commonly referred to as rolling tags. In contrast to that are immutable tags: Docker tags that, once pushed, will always point at the exact same image. Often version numbers such as 1.5.213
are used or, my preferred way, a combination of Git branch name and commit hash, e.g. main-3035eca
.
The title of the article already gives it away but I strongly recommend treating Docker tags as immutable for any of your own production services. To show you why, let's play through an example of an organization that uses rolling tags: Meet Yolo Inc.
A bad commit is merged, built, and tagged
Let's face it: Developers are humans and humans make mistakes. Regardless of how rigorous your test automation and review process are, bugs happen and sneak into production.
Jeffrey, a software engineer working for Yolo Inc., has his co-worker review a PR that, unbeknownst to everybody, introduces an expensive SQL query running on a very common request. Oops.
The PR is reviewed, tested in staging, and finally makes its way to the main
branch. The trusty CI system picks up the change, builds a new Docker image, and tags it as latest
, overwriting the previous latest
tag.
The virus spreads
Even without triggering a deployment the bad version will begin to make its way into production. Yolo Inc. uses AWS ECS with its task definition configured to always run the latest
tag in production. Whenever ECS decides to scale up or cycle a task for whatever reason, suddenly the bad version starts running in production alongside tasks running the previous latest
tag.
Since the bad version in this case will increase response time, the system will continue scaling up and run more and more tasks using the Docker image with Jeffrey's bad commit. This in turn increases the number of requests running the expensive SQL query which drives up the response time again. Rinse and repeat.
Have you tried turning it off and on again?
At this point Jeffrey and his co-workers notice that something is going on. The CEO starts sending angry messages on Slack and demands to know what the heck is going on.
"Have you tried turning it off and on again" is often the default response to any kind of trouble in the production environment and as such, Jeffrey decides to test his luck by just redeploying the current task definition. "That's the same version that's already been running in production, right?", Jeffrey murmurs to himself.
This, of course, makes things worse as now the last holdouts running the previous latest
tag are cycled out of production and replaced with tasks running the new buggy latest
version.
Time for a rollback
At this point Jeffrey realizes that latest
is a rolling tag and things must have gone awry due to a recent commit. But which one? Sandra and Ben have been committing a bunch of suspicious things as well right around the time when things started breaking. When exactly was the latest
tag pointing at a working build most recently?
After an hour of looking through commits, staring at metrics, and parsing logs, Jeffrey's bad commit is finally identified as the culprit. A revert is merged to the main
Git branch. Oh, but what's that? The CI system is down for maintenance and as such the image can't be built automatically. Darn, I guess that's just our luck today.
Fortunately there is a runbook for this situation. Of course it is slightly outdated since it hadn't been tested in a couple of moons but after reading some tutorials on how to use the docker buildx
CLI command, Jeffrey manages to push and overwrite the latest
tag with a fixed version. And, after instructing ECS to cycle all tasks again, response times finally normalize and the incident is declared closed. The CEO is furious at this point since it took multiple hours to get back to a working state.
How would immutable tags have helped?
While immutable tags would not have fully prevented the incident, they would have drastically simplified the debugging and rollback process.
- First of all, the new version would not have started rolling out automatically. ECS would be configured to run a specific immutable tag at all times – this ensures that during scaling operations the last known-good version is used instead of whatever was recently pushed.
- After the bad version was explicitly deployed by updating the task definition, it would have been simple to identify which version was the last good one.
- Rolling back would not have required updating the Docker image repository at all. In fact, it would have been as simple as instructing ECS to run the previous task definition which was pointing at the previous immutable Docker tag.
I hope Jeffrey and his team at Yolo Inc. identified rolling tags as a major problem during their postmortem of the outage – and I hope you did as well, without having to experience a costly outage first!