'How to run a PySpark job with Airflow in a dockerized environment

I followed the official Airflow docker guide. It works fine for most of the simple jobs I have.

I tried to use this guide for that I needed to add in the .env file this line:

_PIP_ADDITIONAL_REQUIREMENTS=pyspark xlrd apache-airflow-providers-apache-spark

Unfortunately, the dag is not being loaded.

The problem seems to be related to JAVA_HOME because the docker output shows this message:

airflow-scheduler_1  | is not set

In the Airflow web GUI it shows the following erro: enter image description here

Broken DAG: [/opt/airflow/dags/SparkETL.py] Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/pyspark/context.py", line 339, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/home/airflow/.local/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
    raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number

I tried to add install -y openjdk-11-jdk command in the docker-compose, and set JAVA_HOME: '/usr/lib/jvm/java-11-openjdk-amd64' also in the docker compose. In this situation airflow_schedule dumps that the path does not exist.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source