Avatar
Mostly just a place where I dump interesting things I've learned that I'm pretty sure I'll forget.. It's my own personal knowledge base.

docker caching monorepo NodeJS projects

About

Building efficient Dockerfiles is critical for ensuring you have clean, optimized and reproducible builds. Unfortunately, sometimes while designing your Dockerfile you can fall into simple traps that wind up with your builds spending tons of time rebuilding layers that don’t actually need to be rebuilt.

This post will talk about NodeJS “monorepos”, and how to properly keep your dependencies (node_modules/**) cached while files in your application are constantly changing.

NodeJS Monorepo Setup

I’ve created a very simple monorepo at https://github.com/diranged/blog-docker-caching-with-node-projects. This repository has a pretty simple structure that you would see in many projects:

package.json
packages
packages/lib-c
packages/lib-c/package.json
packages/lib-b
packages/lib-b/package.json
packages/lib-a
packages/lib-a/package.json

In the top level package.json, I’ve informed Yarn that we have a series of packages in the packages/* path:

# package.json
{
  "name": "my-demo-monorepo",
  "private": true,
  "workspaces": ["packages/*"]
}

Then each of the nested libraries has a really simple package definition that installs a different dependency.. for example, here’s the packages/lib-c/package.json:

# packages/lib-c/package.json
{
  "name": "lib-c",
  "version": "1.0.0",
  "main": "index.js",
  "dependencies": {
    "moment": "^2.29.4"
  }
}

With this environment, we can see that a yarn install will automatically install all of the dependencies for all of the packages/lib-*/package.json packages (see the yarn.lock for details):

% yarn install
yarn install v1.22.22
[1/4] 🔍  Resolving packages...
[2/4] 🚚  Fetching packages...
[3/4] 🔗  Linking dependencies...
[4/4] 🔨  Building fresh packages...

success Saved lockfile.
✨  Done in 1.16s.
% yarn list
yarn list v1.22.22
...
├─ axios@1.11.0
│  └─ ...
├─ lodash@4.17.21
├─ moment@2.30.1
...
└─ proxy-from-env@1.1.0
✨  Done in 0.05s.

Example of a “bad” Dockerfile

In Dockerfile.bad I have laid out a structure that at first glance seems like it works:

################################################################################
# Common base layer
################################################################################
FROM node:20-alpine AS base
WORKDIR /app

################################################################################
# An easy trap to fall into ... thinking that you can do a yarn install in one
# stage and have that cached, and then not re-build later stages.
################################################################################
FROM base AS installer
WORKDIR /app
COPY . /app
RUN yarn install
RUN echo "If this ran, cache was invalidated inside the installer stage"

###############################################################################
# Now you might think that the install is cached ... and that code changes
# won't invalidate the cache.
###############################################################################
FROM base
COPY --from=installer /app /app
RUN echo "If this ran, then the installer cache has invalidated the final stage cache"

If you run your first docker build, we can see that all the layers need to be built:

% BUILDKIT_PROGRESS=plain docker build . --file Dockerfile.bad
#0 building with "orbstack" instance using docker driver

#1 [internal] load build definition from Dockerfile.bad
#1 transferring dockerfile: 1.11kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/node:20-alpine
#2 DONE 0.7s

#3 [internal] load .dockerignore
#3 transferring context: 55B done
#3 DONE 0.0s

#4 [internal] load build context
#4 transferring context: 52.36kB done
#4 DONE 0.0s

#5 [base 1/2] FROM docker.io/library/node:20-alpine@sha256:df02558528d3d3d0d621f112e232611aecfee7cbc654f6b375765f72bb262799
#5 resolve docker.io/library/node:20-alpine@sha256:df02558528d3d3d0d621f112e232611aecfee7cbc654f6b375765f72bb262799 0.0s done
...
#5 sha256:daf846a830553a0ff809807b7f2d956dbd9dcb959c875d23b6feb3d3aecdecef 0B / 42.67MB 0.2s
#5 DONE 2.5s

#6 [base 2/2] WORKDIR /app
#6 DONE 0.2s

#7 [installer 1/4] WORKDIR /app
#7 DONE 0.0s

#8 [installer 2/4] COPY . /app
#8 DONE 0.0s

#9 [installer 3/4] RUN yarn install
#9 0.252 yarn install v1.22.22
#9 0.285 [1/4] Resolving packages...
#9 0.304 [2/4] Fetching packages...
#9 1.122 [3/4] Linking dependencies...
#9 1.287 [4/4] Building fresh packages...
#9 1.291 success Saved lockfile.
#9 1.293 Done in 1.04s.
#9 DONE 1.3s

#10 [installer 4/4] RUN echo "If this ran, cache was invalidated inside the installer stage"
#10 0.115 If this ran, cache was invalidated inside the installer stage
#10 DONE 0.1s

#11 [stage-2 1/2] COPY --from=installer /app /app
#11 DONE 0.1s

#12 [stage-2 2/2] RUN echo "If this ran, then the installer cache has invalidated the final stage cache"
#12 0.091 If this ran, then the installer cache has invalidated the final stage cache
#12 DONE 0.1s

#13 exporting to image
#13 exporting layers
#13 exporting layers 0.1s done
#13 writing image sha256:b98364c03053a5b34eb2e882a342c2eb6fee170a7bd8d919d0531c7d1dc0b06d done
#13 DONE 0.1s
%

Now, re-running the build without changing any files, we can see it does seem to be cached:

% BUILDKIT_PROGRESS=plain docker build . --file Dockerfile.bad
#0 building with "orbstack" instance using docker driver

#1 [internal] load build definition from Dockerfile.bad
#1 transferring dockerfile: 1.11kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/node:20-alpine
#2 DONE 0.3s

#3 [internal] load .dockerignore
#3 transferring context: 55B done
#3 DONE 0.0s

#4 [base 1/2] FROM docker.io/library/node:20-alpine@sha256:df02558528d3d3d0d621f112e232611aecfee7cbc654f6b375765f72bb262799
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 4.78kB done
#5 DONE 0.0s

#6 [installer 4/4] RUN echo "If this ran, cache was invalidated inside the installer stage"
#6 CACHED

#7 [stage-2 1/2] COPY --from=installer /app /app
#7 CACHED

#8 [base 2/2] WORKDIR /app
#8 CACHED

#9 [installer 3/4] RUN yarn install
#9 CACHED

#10 [installer 1/4] WORKDIR /app
#10 CACHED

#11 [installer 2/4] COPY . /app
#11 CACHED

#12 [stage-2 2/2] RUN echo "If this ran, then the installer cache has invalidated the final stage cache"
#12 CACHED

#13 exporting to image
#13 exporting layers done
#13 writing image sha256:b98364c03053a5b34eb2e882a342c2eb6fee170a7bd8d919d0531c7d1dc0b06d done
#13 DONE 0.0s
%

Now here’s the problem… most application repositories do not change their dependencies nearly as often as the actual application code changes. So when application code (or even some unreated file like a README) is changed, we want to avoid re-installing all of the dependencies. Let’s see what happens if we touch an unrelated file and re-run the build:

# First, we'll touch a totally unrelated file that just so happens to be in the Docker build context.
% echo $(date) > tmp/trigger

# Now we re-run the build..
% BUILDKIT_PROGRESS=plain docker build . --file Dockerfile.bad
#0 building with "orbstack" instance using docker driver

#1 [internal] load build definition from Dockerfile.bad
#1 transferring dockerfile: 1.11kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/node:20-alpine
#2 DONE 0.3s

#3 [internal] load .dockerignore
#3 transferring context: 55B done
#3 DONE 0.0s

#4 [base 1/2] FROM docker.io/library/node:20-alpine@sha256:df02558528d3d3d0d621f112e232611aecfee7cbc654f6b375765f72bb262799
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 4.82kB done
#5 DONE 0.0s

#6 [base 2/2] WORKDIR /app
#6 CACHED

#7 [installer 1/4] WORKDIR /app
#7 CACHED

#8 [installer 2/4] COPY . /app
#8 DONE 0.0s                         <<<<<< The cache has been invalidated

#9 [installer 3/4] RUN yarn install  <<<<<< The `yarn install` is being re-run now
#9 0.290 yarn install v1.22.22
#9 0.328 [1/4] Resolving packages...
#9 0.347 [2/4] Fetching packages...
#9 1.194 [3/4] Linking dependencies...
#9 1.362 [4/4] Building fresh packages...
#9 1.365 success Saved lockfile.
#9 1.367 Done in 1.08s.
#9 DONE 1.4s

#10 [installer 4/4] RUN echo "If this ran, cache was invalidated inside the installer stage"
#10 0.114 If this ran, cache was invalidated inside the installer stage
#10 DONE 0.1s                        <<<<<<  We can see we've fully invalidated the "installer" stage

#6 [base 2/2] WORKDIR /app
#6 CACHED

#11 [stage-2 1/2] COPY --from=installer /app /app
#11 DONE 0.1s

#12 [stage-2 2/2] RUN echo "If this ran, then the installer cache has invalidated the final stage cache"
#12 0.091 If this ran, then the installer cache has invalidated the final stage cache
#12 DONE 0.1s                        <<<<<<  And now in the final application stage, we can see that we re-ran our fake build command

#13 exporting to image
#13 exporting layers 0.1s done
#13 writing image sha256:1da5edf25b9a75945992940ef7d4b1811f4b957fbcf5237ce4f74bcb170dc4ec done
#13 DONE 0.1s
%

What went wrong?

If we look carefully at the Dockerfile.bad - we can see that any change to any file in the Docker context is going to trigger invalidation to begin at the COPY . /app command:

FROM base AS installer
WORKDIR /app
COPY . /app          <<<<<< Any file change is going to cause this to be invalidated
RUN yarn install
RUN echo "If this ran, cache was invalidated inside the installer stage"

As soon as the COPY . /app is invalidated, the RUN yarn install is also invalidated and must be re-run, which is what we wanted to avoid.

Making matters even worse, the odds that the RUN yarn install will output exactly the same bytes run-after-run are very low… so when that command is re-run, it in-turn invalidates the next stage:

FROM base
COPY --from=installer /app /app
RUN echo "If this ran, then the installer cache has invalidated the final stage cache"

If you look closely at the output, we can see that not only did the installer stage get invalidated, but we can also see that the final app-stage was also invalidated!

...

#10 [installer 4/4] RUN echo "If this ran, cache was invalidated inside the installer stage"
#10 0.114 If this ran, cache was invalidated inside the installer stage
#10 DONE 0.1s                        <<<<<<  We can see we've fully invalidated the "installer" stage
...
#12 [stage-2 2/2] RUN echo "If this ran, then the installer cache has invalidated the final stage cache"
#12 0.091 If this ran, then the installer cache has invalidated the final stage cache
#12 DONE 0.1s                        <<<<<<  And now in the final application stage, we can see that we re-ran our fake build command

This invalidation of the final build stage could be extremely costly depending on the size of your NodeJS project… we have several projects that take 30-40 minutes to build, and a bug like this could cause these 30-40 minute rebuilds to occur when we’ve actually made no changes to the underyling application (a readme update, script update, etc).

A “good” Dockerfile that avoids intermediate stage rebuilds

The fix here is subtle… but what we need to do is be able to isolate changes to the package.json and yarn.lock files and only re-run the installation step if these files changed. However, because its a monorepo, we want to dynamically find these files so that as our team members add new packages, they don’t have to remember to touch the Dockerfile every time. How do we do that?

The simple answer here is that we need more build stages… we need a stage that dynamically finds all of the package.json and yarn.lock files, and then a separate stage that runs the yarn install, and finally a stage that runs our application build.

Let’s now look at Dockerfile.good:

################################################################################
# Start off with a common base image - this is primarily to make sure that all
# of our yarn steps (install, cache, etc.) are run in the same environment.
################################################################################
FROM node:20-alpine AS base
WORKDIR /app

###############################################################################
# In the first stage, we have to dynamically find all of the package.json and
# yarn.lock files in our repository - while excluding anything found in the
# nested node_modules directory.
#
# Hint: Definitely add "node_modules" to your .dockerignore file to avoid even
# copying that path into the build context.
###############################################################################
FROM base AS package_parser
COPY . /app
RUN mkdir /out
RUN find . \
        \( -name "package.json" -o -name "yarn.lock" \) \
        -not -path "*/node_modules/*" \
        -exec cp --parents {} /out \;
RUN echo "Discovered Package Files:" && find /out
RUN echo "If this ran, cache was invalidated inside the installer stage"

###############################################################################
# Now we separate out the yarn install step into its own stage. This is the
# critical piece ... pulling this out into its own stage means that it only
# has its cache invalidated if the "COPY --from=package-parser..." step changes
# its output. Otherwise, the yarn install is cached and not re-run on a build
# where some other unrelated file is changed.
###############################################################################
FROM base AS package_installer
COPY --from=package_parser /out /app
RUN --mount=type=cache,target=/usr/local/share/.cache yarn install

###############################################################################
# Application Stage - Do your application build here...
###############################################################################
FROM base
COPY --from=package_installer /app /app
RUN echo "If this ran, then the installer cache has invalidated the final stage cache"

With this file in place, we can re-run our test case above.. we’ll touch a trigger file, and see what steps that rebuilds:

# Let's touch our trigger file again
% echo $(date) > tmp/trigger

# Now we run our build
% docker build . --file Dockerfile.good
#0 building with "orbstack" instance using docker driver

#1 [internal] load build definition from Dockerfile.good
#1 transferring dockerfile: 2.21kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/node:20-alpine
#2 DONE 0.3s

#3 [internal] load .dockerignore
#3 transferring context: 55B done
#3 DONE 0.0s

#4 [base 1/2] FROM docker.io/library/node:20-alpine@sha256:df02558528d3d3d0d621f112e232611aecfee7cbc654f6b375765f72bb262799
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 4.82kB 0.0s done
#5 DONE 0.0s

#6 [base 2/2] WORKDIR /app
#6 CACHED

#7 [package_parser 1/5] COPY . /app
#7 DONE 0.0s

#8 [package_parser 2/5] RUN mkdir /out
#8 DONE 0.1s

#9 [package_parser 3/5] RUN find .         ( -name "package.json" -o -name "yarn.lock" )         -not -path "*/node_modules/*"         -exec cp --parents {} /out ;
#9 DONE 0.1s

#10 [package_parser 4/5] RUN echo "Discovered Package Files:" && find /out
#10 0.092 Discovered Package Files:
#10 0.093 /out
#10 0.093 /out/package.json
#10 0.093 /out/packages
#10 0.093 /out/packages/lib-a
#10 0.093 /out/packages/lib-a/package.json
#10 0.093 /out/packages/lib-b
#10 0.093 /out/packages/lib-b/package.json
#10 0.093 /out/packages/lib-c
#10 0.093 /out/packages/lib-c/package.json
#10 0.093 /out/yarn.lock
#10 DONE 0.1s

#11 [package_parser 5/5] RUN echo "If this ran, cache was invalidated inside the installer stage"
#11 0.145 If this ran, cache was invalidated inside the installer stage       <<<<<< This is OK - we invalidated our package parsing stage.. but the final output stays the same.
#11 DONE 0.2s

#12 [package_installer 1/2] COPY --from=package_parser /out /app
#12 CACHED                                                                    <<<<<< Now in the installer stage, we are still cached, so there is no yarn install!

#13 [package_installer 2/2] RUN --mount=type=cache,target=/usr/local/share/.cache yarn install
#13 CACHED

#14 [stage-3 1/2] COPY --from=package_installer /app /app
#14 CACHED

#15 [stage-3 2/2] RUN echo "If this ran, then the installer cache has invalidated the final stage cache"
#15 CACHED

#16 exporting to image
#16 exporting layers done
#16 writing image sha256:583d977647355590e3cf84a382e19a13096db97ae9cce33c7c86ecbd55d32b77 done
#16 DONE 0.0s
%

How does this work?

The “good” Dockerfile has a few important changes that work together:

Dynamic package.json and yarn.lock discovery

In the package_parser stage, we copy the entire application context in (COPY . /app), but then we search for the files we care about with a RUN find ... command and copy those files into a temporary directory /out. By doing this, we isolate the files that change infrequently, and allow them to be cached in the next stage.

Installation in a separate stage

After the package_parser stage has executed, we use COPY --from=package_parser /out /app to copy in only the infrequently changing depdendcy managent files.. by doing this, the cache will only be invalidated for the package_installer stage if the contents of the /out directory changed… this means that a change to /app/tmp/trigger will not change the /out files, and therefore, the package_installer stage can be cached.

Final Thoughts

There are a number of ways to approach this problem … and tools like Turborepo are great and can solve these issues for you. The examples above though hopefully give you a very clear low-level idea of what can go wrong, and how to approach fixing it if something like Turborepo isn’t right for your situation!

all tags