'Including another file in Dataflow Python flex template, ImportError
Is there an example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder?
My project structure is like this:
├── pipeline
│ ├── __init__.py
│ ├── main.py
│ ├── setup.py
│ ├── custom.py
I'm trying to import custom.py inside of main.py for a dataflow flex template.
I receive the following error in the pipeline execution:
"ModuleNotFoundError: No module named 'custom'"
The pipeline works fine if I include all of the code in a single file and don't make any imports.
Example Dockerfile:
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template/pipeline
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
COPY pipeline /dataflow/template/pipeline
COPY spec/python_command_spec.json /dataflow/template/
ENV DATAFLOW_PYTHON_COMMAND_SPEC /dataflow/template/python_command_spec.json
RUN pip install avro-python3 pyarrow==0.11.1 apache-beam[gcp]==2.24.0
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
Python spec file:
{
"pyFile":"/dataflow/template/pipeline/main.py"
}
I am deploying the template with the following command:
gcloud builds submit --project=${PROJECT} --tag ${TARGET_GCR_IMAGE} .
Any help is appreciated.
Solution 1:[1]
I actually solved this by passing an additional parameter setup_file to the template execution. Also need to add setup_file parameter to the template metadata
--parameters setup_file="/dataflow/template/pipeline/setup.py"
Apparently the command ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py" in the Dockerfile is useless and doesnt actually pick up the setup file.
My setup file looked like this:
import setuptools
setuptools.setup(
packages=setuptools.find_packages(),
install_requires=[
'apache-beam[gcp]==2.24.0'
],
)
Solution 2:[2]
After some tests I found out that for some unknown reasons phyton files at working directory (WORKDIR) cannot be referenced with an import. But it works if you create a subfolder and move the python dependencies into it. I tested and it worked, for example, in your use case you can have the following structure:
??? pipeline
? ??? main.py
? ??? setup.py
? ??? mypackage
? ? ??? __init__.py
? ? ??? custom.py
And you will be able to reference: import mypackage.custom. The Docker file should move in the custom.py to proper directory.
RUN mkdir -p ${WORKDIR}/mypackage
RUN touch ${WORKDIR}/mypackage/__init__.py
COPY custom.py ${WORKDIR}/mypackage
And the dependecy will be added to the python installation directory:
$ docker exec -it <container> /bin/bash
# find / -name custom.py
/usr/local/lib/python3.7/site-packages/mypackage/custom.py
Solution 3:[3]
@pavan-kumar-kattamuri asked me to post my solution, so here it is.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base:flex_templates_base_image_release_20210120_RC00
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
COPY requirements.txt .
# Read https://stackoverflow.com/questions/65766066/can-i-make-flex-template-jobs-take-less-than-10-minutes-before-they-start-to-pro#comment116304237_65766066
# to understand why apache-beam is not being installed from requirements.txt
RUN pip install --no-cache-dir -U apache-beam==2.26.0
RUN pip install --no-cache-dir -U -r ./requirements.txt
COPY mymodule.py setup.py ./
COPY protoc_gen protoc_gen/
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/mymodule.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
and here is my setup.py:
import setuptools
setuptools.setup(
packages=setuptools.find_packages(),
install_requires=[],
name="my df job modules",
)
Solution 4:[4]
Ok with apache beam 2.27 it seems that we need to follow the original practice of passing a setup_file parameter .... shame..
Solution 5:[5]
@Akshay, answering your original question with a working example for rest of the community.
A working example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder could be found here: https://github.com/toransahu/apache-beam-eg/tree/main/python/using_flex_template_adv1
Solution 6:[6]
For me I didn't need to integrate the setup_file in the command to trigger the flex template, here is my Dockerfile:
FROM gcr.io/dataflow-templates-base/python38-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
COPY . .
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
# Install apache-beam and other dependencies to launch the pipeline
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt
This is the command:
gcloud dataflow flex-template run "job_ft" --template-file-gcs-location "$TEMPLATE_PATH" --parameters paramA="valA" --region "europe-west1"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Akshay Apte |
| Solution 2 | rsantiago |
| Solution 3 | jamiet |
| Solution 4 | user1068378 |
| Solution 5 | Toran Sahu |
| Solution 6 | Idhem |
