'How do I reduce a python (docker) image size using a multi-stage build?

I am looking for a way to create multistage builds with python and Dockerfile:

For example, using the following images:

1st image: install all compile-time requirements, and install all needed python modules

2nd image: copy all compiled/built packages from the first image to the second, without the compilers themselves (gcc, postgers-dev, python-dev, etc..)

The final objective is to have a smaller image, running python and the python packages that I need.

In short: how can I 'wrap' all the compiled modules (site-packages / external libs) that were created in the first image, and copy them in a 'clean' manner, to the 2nd image.



Solution 1:[1]

I recommend the approach detailed in this article (section 2). He uses virtualenv so pip install stores all the python code, binaries, etc. under one folder instead of spread out all over the file system. Then it's easy to copy just that one folder to the final "production" image. In summary:

Compile image

  • Activate virtualenv in some path of your choosing.
  • Prepend that path to your docker ENV. This is all virtualenv needs to function for all future docker RUN and CMD action.
  • Install system dev packages and pip install xyz as usual.

Production image

  • Copy the virtualenv folder from the Compile Image.
  • Prepend the virtualenv folder to docker's PATH

Solution 2:[2]

This is a place where using a Python virtual environment inside Docker can be useful. Copying a virtual environment normally is tricky since it needs to be the exact same filesystem path on the exact same Python build, but in Docker you can guarantee that.

(This is the same basic recipe @mpoisot describes in their answer and it appears in other SO answers as well.)

Say you're installing the psycopg PostgreSQL client library. The extended form of this requires the Python C development library plus the PostgreSQL C client library headers; but to run it you only need the PostgreSQL C runtime library. So here you can use a multi-stage build: the first stage installs the virtual environment using the full C toolchain, and the final stage copies the built virtual environment but only includes the minimum required libraries.

A typical Dockerfile could look like:

# Name the single Python image we're using everywhere.
ARG python=python:3.10-slim

# Build stage:
FROM ${python} AS build

# Install a full C toolchain and C build-time dependencies for
# everything we're going to need.
RUN apt-get update \
 && DEBIAN_FRONTEND=noninteractive \
    apt-get install --no-install-recommends --assume-yes \
      build-essential \
      libpq-dev

# Create the virtual environment.
RUN python3 -m venv /venv
ENV PATH=/venv/bin:$PATH

# Install the Python library dependencies, including those with
# C extensions.  They'll get installed into the virtual environment.
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Final stage:
FROM ${python}

# Install the runtime-only C library dependencies we need.
RUN apt-get update \
 && DEBIAN_FRONTEND=noninteractive \
    apt-get install --no-install-recommends --assume-yes \
      libpq5

# Copy the virtual environment from the first stage.
COPY --from=build /venv /venv
ENV PATH=/venv/bin:$PATH

# Copy the application in.
COPY . .
CMD ["./main.py"]

If your application uses a Python entry point script then you can do everything in the first stage: RUN pip install . will copy the application into the virtual environment and create a wrapper script in /venv/bin for you. In the final stage you don't need to COPY the application again. Set the CMD to run the wrapper script out of the virtual environment, which is already at the front of the $PATH.

Again, note that this approach only works because it is the same Python base image in both stages, and because the virtual environment is on the exact same path. If it is a different Python or a different container path the transplanted virtual environment may not work correctly.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mpoisot
Solution 2 David Maze