Dockerfile best practices

When writing a Dockerfile, the possibilities are endless. You can create images for the same purpose, that work similarly, but are structured in a very different way. Beginning by choosing a starting image (FROM), going through the order of the commands we execute when building the image, or creating intermediate images (multistage builds), writing a Dockerfile is another world.

This document contains some of the most important guidelines that need to be followed in order to optimize the time it takes to create the image, as well as its security and the space it takes up.

1. Command order matters

Due to the way the cache works when building an image, Docker is able to detect if the command we want to execute has been executed before or not (in a previous build) and reuse the result from the cache to do it faster. The problem is that, if one of the commands has changed, the commands that follow it cannot be removed from cache because some of them may have been affected and the result may be different.

This is why it is recommended to order the commands according to how often they have to be changed. If we were creating an image that contains an application, for example, the most common modifications would be those of the code, followed by the resources and finally the dependencies. So we should sort them in ascending order to make sure we optimize cache usage.

2. Layer the commands together

In a Dockerfile, each command represents a layer of the final image. It is important to bring together the layers that share the same logic (installing dependencies, for example) to improve the use of cache and to make the Dockerfile more maintainable.

However, we should bear in mind that, if we perform too many actions in the same command, if at any time we want to change something in the command, the cache will no longer work and we will have to go back and totally rerun it. Therefore, it is important to study each scenario and evaluate the best way to do it.

WRONG ❌

FROM ubuntu
RUN apt update && apt install openjdk-8-jdk -y
RUN apt update && apt install vim -y

GOOD ✅

FROM ubuntu
RUN apt update && apt install openjdk-8-jdk vim -y

3. Delete the cache you do not need

Cache is good, yes, but which one? We have to understand that when building an image there are two types of cache: 1. that which is generated by Docker with the layers of our image and 2. that generated by our commands within the image itself. The first is good for improving build time, but the second probably is not.

The second type of cache is usually generated when installing dependencies or during the process of compiling an application, and it is very unlikely that you will use it and most probably it is only taking up space.

Look at the last line of the following Dockerfile:

FROM maven:3.6.3-jdk-11
ENTRYPOINT ["java", "-jar", "target/*.jar"]
COPY pom.xml .
COPY src ./src
RUN mvn -e -B clean package && rm -rf /root/.m2

It is important to point out that, in order to delete a file from the image, it is necessary that the file be created and deleted in the same command. If done in different commands, the file will appear to have disappeared, but it will still be in the layer in which we created it and it will continue consuming space.

This file still exists:

FROM busybox
RUN touch a
RUN rm a

This one does not:

FROM busybox
RUN touch a && rm a

If the objective is to reduce the space occupied by the final image and we cannot eliminate this type of file in the same command in which we created it, we can use the --squash option when creating the image, to join all the layers into one where we would then delete the file. But beware! The --squash option has more implications, such as deleting the image history, so use it only when strictly necessary.

4. Choose the base image well

When choosing an image to start from, the first thing we may do is to take an image that has nothing but the basics (an operating system) and install everything we need on it. This may work, but much better in terms of security, maintainability and space, is to use an image from a trusted provider who has already done so.

For example, suppose we need an image with Python 3.6 installed. We could use alpine as a base and install Python with the package manager, or use the python: 3.6-alpine image, which already has Python installed and is maintained by Python developers (in addition to other things).

The exponent that best achieves this may be Google Distroless Docker Images, which is a base image that only contains the necessary dependencies to run your application, eliminating all other elements (such as package managers, shells, and other commands), therefore reducing the attack surface of our containers. These images are specific to each language and the one you need may not be supported, but if it is, you will not find a more secure image from which to start.

5. Specify the version of the base image

If you have noticed, when choosing the Python image, we have used a tag. This is also important. For an image to be reproducible, we must choose a tag for that image so that we know it will not change over time (tags like latest or slim do change, watch out!).

Actually, there is no guarantee that a tag we choose will always stay the same, regardless of whether it is a generic one like latest or a specific one like 3.6.8-alpine-slim. The best practice of all would be to choose the specific version of an image that we want to use, and use its identifier. This identifier can be obtained with the command:

docker images --format "{{.Repository}}:{{.Tag}} {{.ID}}"

For example, if I wanted the identifier of the busybox image I just added to my local registry, I run:

$ docker images --format "{{.Repository}}:{{.Tag}} {{.ID}}" |grep busybox
busybox:latest 83aa35aa1c79

Now, I could use the identifier as FROM of my Dockerfile:

FROM 83aa35aa1c79
CMD ["echo", "Hello!"]

6. The potential of multistage builds

When we create an image, we can generate intermediate images that we use for a specific purpose (such as generating an artifact) and that end up being eliminated and are not part of the final image (although the artifact that we have generated is). This is called multistage build, and is very useful in cases where we have to compile an application, for example.

Using multistage builds will make our final image lighter, and probably more secure. Notice how in the following Dockerfile we compile the application in an image which ends up not being used, and generate a JAR that we execute in the final image where we have neither JDK nor Maven.

FROM maven:3.6.3-jdk-11 as builder
WORKDIR /app
COPY pom.xml .
RUN mvn -e -B dependency:go-offline
COPY src ./src
RUN mvn -e -B clean package

FROM adoptopenjdk:8u242-b08-jre-hotspot
COPY --from=builder /app/target/*.jar /app.jar
ENTRYPOINT ["java", "-jar", "/app.jar"]

7. User without privileges

It is considered good practice in a Dockerfile to modify the end user of the image to one that has the right privileges to fulfil the purpose of the image and nothing more. This will make our image more secure and prevent an administrator user in the container from gaining access to the host.

To do this, it is best to add a new user (and a group) and give them the permissions they need. For example:

FROM ubuntu
RUN groupadd -r usergroup && useradd -r -g user usergroup
ENTRYPOINT ["sh", "myScript.sh"]
COPY ./myScript.sh /myScript.sh
RUN chown user /myScript.sh
USER user

8. Keep your secrets hidden

It is very common that in an image, we need to use credentials, access tokens or files with information that we do not want to share. If we pass these elements to the image using commands such as COPY or ADD, they will be visible in the image and anyone who has access to it will be able to see them.

There is a way to add this information to our containers, called docker secret. The way to implement it is a bit complicated to explain in this document, since it depends on the way you are going to deploy the image (docker-compose, kubernetes, …). Introduction to Docker Secrets or Distribute Credentials Securely Using Secrets could be a good starting point.

9. Copy only what you need

The image we generate should only and exclusively contain the files needed. It is common to see commands like COPY . /app, which will copy the entire context to the /app directory. This may not be bad, depending on the context and what we intend to do, but in many cases we may be copying files that we are not going to use or that have confidential information.

There are two ways to avoid this:

Copy only the files that we are going to use, although if there are a lot and we do not have them structured in directories, it could create too many layers.
Use .dockerignore. In this file with the same syntax as .gitignore we can decide which files or directories we want to avoid adding to the context. More information.

10. Copy, do not add

There are two very similar commands in Dockerfile: COPY and ADD. The first is used to copy a series of files or directories from the host to the image. The second one does the same, but it is also capable of downloading elements from URLs or repositories and decompressing compressed files. For more information on ADD, see the documentation.

It may be that seeing as they do the same and ADD is more powerful, you want to use only this, but you should avoid it. Use COPY for most situations, which will be copying from the host, and only use ADD when you need something you cannot achieve with COPY. Using ADD without regard for the difference may carry security risks such as zip bombs.

Conclusion

Although writing a Dockerfile may seem simple, it is important to follow certain recommendations that will make our building process run faster and make the resulting image smaller and more secure.

In this article we have reviewed some of the most important points, which at the same time, in most cases, are very easy to follow. You can find more tips like these in the official documentation.