Perfecting my recipe for Java and Docker

 I have a lot of docker builds that I do hadoop/ livy/hive. I have a few broad challenges

  • I bounce between a developer and a user. At times I want something off the shelf with minor customization. Other times I want to patch and rebuild.
  • The applications can be fairly sizable in terms of dependencies

One of the biggest challenges I have run into is complications involving docker layers. The idea behind docker layers is that if you take a given docker file:

RUN mkdir /abc
RUN mkdir /def

and place the frequently changed things closer to the bottom, then the upperlayers are reusable. This is true, but there are big challenges.

  1. In a large java project that you would role into an assembly jar a one line change to a single file in 1000 source files invalidates the entire JAR
  2. Building requires many packages that do not make it into the final product, the ~/.m2/repo, they periodically invalidate. My .m2 is 802 MB.
  3. Build processes frequently retouch files changing timestamps and scratch directories change and docker cant (easily) tell which changes do not matter.  

If at this point you just want to jump to the solution Go here!. For the brave that want to go step by step, lets look more deeply at the problem:

Buildkit and -mount=type=cache 

Docker has cache volumes which on the surface seem perfect. You can use them across containers and they 'cache' stuff so sounds like that will just magic the problem away.

$cat Dockerfile
...
...
RUN git clone https://github.com/edwardcapriolo/deliverance.git
RUN --mount=type=cache,target=/root/.m2 cd /build/deliverance && mvn install -Dmaven.test.skip.exec=true -Dgpg.skip=true

$cat run.sh

$DOCKER_BUILDKIT=1 docker build \
--target deliverance \                   
-t deliverance .     

Again this looks like it will work, but it really doesn't work in practice. Why?

Because the "git clone" is a string and docker will cache that even if the repo is changing. But each time we change source code we want to build, so we have to --no-cache and invalidate everything. Doing that every source code change has no benefit at all, in fact its worse because the cache volume is "pig-ish" on disk.

 When you are running maven locally on your laptop the .m2 is "multi-mount". All your builds can share it. If someone is really wrong you can wipe it rm -rf ~/.m2/repo and start over, but generally the artifacts don't change, so its a great cache. 

Mutli stage builds

 One way to deal with this problem is a multi-stage build. If one docker file wont work why not two? This does work for the basic system packages like GCC. It helps but only solves some of the problem. That dreaded buildline, put it in the top file or the bottom each time you change the java code something gets invalidated. Drats!

FROM ecapriolo/jdk-25:0.0.1 AS deliverance-base
RUN apk add git
RUN apk add maven

#native module
RUN apk add curl jq unzip bash make clang

RUN mkdir /build
WORKDIR /build
RUN cd /build
RUN git clone https://github.com/edwardcapriolo/deliverance.git
...
RUN --mount=type=cache,target=/root/.m2 cd /build/deliverance
&& mvn install -Dmaven.test.skip.exec=true -Dgpg.skip=true

3 stages and build arg to the rescue

The key solution to the problem is introducing a build-arg and linking it to a git commit. I try not to make my docker and bash hideous Frankenstein mutants, but the solution can not be pure dockerfile. I make sure I keep the bash really - really small. A simple harness to generate a docker file.

SHA=610e5fd74af857c2082527d6f05b38f8771d6be1

DOCKER_BUILDKIT=1 docker build \
--target deliverance-base \
-t deliverance-base .

DOCKER_BUILDKIT=1 docker build \
--build-arg REPO_COMMIT_SHA=$SHA \
--target deliverance-sha \
-t deliverance-sha .

 The reason this build-arg is important is that the SHA is now part of the docker file. You can try to auto detect (git/cur), but here I changed it by hand. Lets move onto the final step!

Solution

For the first base layer we get all the system stuff that changes rarely. We check out out source code. We don't run "install" instead we use a different command only to pull what is needed for compilation from maven. 

$mvn dependency:go-offline

#!/bin/bash -e

cat << EOF > Dockerfile
##################STAGE1#################
FROM ecapriolo/jdk-25:0.0.1 AS deliverance-base
RUN apk add git
RUN apk add maven

#native module
RUN apk add curl jq unzip bash make clang

RUN mkdir /build
WORKDIR /build
RUN cd /build
RUN git clone https://github.com/edwardcapriolo/deliverance.git

RUN cd /build/deliverance
WORKDIR /build/deliverance

RUN --mount=type=cache,target=/root/.m2 cd /build/deliverance && mvn dependency:go-offline

This is the clever bit, This layer is good unless we change a pom version! Even it it gets slightly out of sync. It will only cause the missing libraries to be pulled repeatedly. 1000 IQ! This is your big savings, time and bandwidth, from not constantly pulling all the jars every build. 

Next, we use our build-arg. We checkout to a SHA then pass it into the build, we checkout that version and docker invalidates that step because it is indeed a different set of source files. 1000 IQ

##################STAGE2#################
FROM deliverance-base AS deliverance-sha
ARG REPO_COMMIT_SHA=LATEST
RUN cd /build/deliverance
RUN git checkout $REPO_COMMIT_SHA
RUN --mount=type=cache,target=/root/.m2 cd /build/deliverance
&& mvn install -Dmaven.test.skip.exec=true -Dgpg.skip=true

 Finally, we go to our third stage. We take what we need from the second stage and copy it in. This third stage is only as small as the runtime. None of the packages needed to build the native code is here, and none of the packages needed for maven build is here.

 

##################STAGE3#################
FROM ecapriolo/jdk-25:0.0.1 AS deliverance
RUN mkdir /deliverance
RUN apk add bash

RUN addgroup -S deliverance && adduser -S -G deliverance -H -D deliverance
RUN mkdir /deliverance/logs && chown deliverance:deliverance /deliverance/logs
COPY --from=deliverance-sha /build/deliverance/web/target/web-0.0.4-SNAPSHOT.jar
/deliverance/web.jar
COPY entry_point.sh /deliverance/entry_point.sh
COPY inlinerules.json /deliverance/inlinerules.json
COPY simple.properties /deliverance/

RUN chmod u+x /deliverance/entry_point.sh
WORKDIR /deliverance
USER deliverance
ENTRYPOINT ["/deliverance/entry_point.sh"]
EOF

The last piece is to kick it all off:

SHA=610e5fd74af857c2082527d6f05b38f8771d6be1

DOCKER_BUILDKIT=1 docker build \
--target deliverance-base \
-t deliverance-base .

DOCKER_BUILDKIT=1 docker build \
--build-arg REPO_COMMIT_SHA=$SHA \
--target deliverance-sha \
-t deliverance-sha .

DOCKER_BUILDKIT=1 docker build \
--target deliverance \
-t deliverance .

The results: 

 alpine-builder0:/data/alpine/deliverence/docker$ docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
deliverance latest caaf05160828 54 minutes ago 419MB
deliverance-sha latest 1daf76c6c287 55 minutes ago 1.07GB
deliverance-base latest eb4071d013d5 56 minutes ago 897MB 

Winning. Small final image. Fast incremental builds. 

Comments

Popular posts from this blog

For the love of Java

Large Language Models Termperature and relative performance

Guided Choice: Get only the answer you want with no fluff