'How to run a PySpark job with Airflow in a dockerized environment
I followed the official Airflow docker guide. It works fine for most of the simple jobs I have.
I tried to use this guide for that I needed to add in the .env file this line:
_PIP_ADDITIONAL_REQUIREMENTS=pyspark xlrd apache-airflow-providers-apache-spark
Unfortunately, the dag is not being loaded.
The problem seems to be related to JAVA_HOME because the docker output shows this message:
airflow-scheduler_1 | is not set
In the Airflow web GUI it shows the following erro:

Broken DAG: [/opt/airflow/dags/SparkETL.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/pyspark/context.py", line 339, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/home/airflow/.local/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
I tried to add install -y openjdk-11-jdk command in the docker-compose, and set JAVA_HOME: '/usr/lib/jvm/java-11-openjdk-amd64' also in the docker compose. In this situation airflow_schedule dumps that the path does not exist.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
