'Can't run pyspark in airflow dag with docker-compose
I downloaded https://github.com/puckel/docker-airflow, but and tried to start it and that was ok. But i need to use pyspark and i have no idea, how set it correctly. I saw many answers, but they don't work. So now i have two variants of docker-compose.
The first one
version: '3.7'
services:
jdkSetup:
image: openjdk:9
command:
- sh
- '-c'
- 'echo $JAVA_HOME'
environment:
- JAVA_HOME:/usr/lib/jvm/java-8-openjdk-amd64
ports:
- 6060:6060
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
depends_on:
- postgres
- jdkSetup
environment:
- LOAD_EX=n
- EXECUTOR=Local
- PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
- HADOOP_HOME:/usr/lib/hadoop/
- JRE_HOME:/usr/lib/jre7
- JAVA_HOME=/usr/local/bin/java
logging:
options:
max-size: 10m
max-file: "3"
volumes:
- ${JAVA_HOME}:/usr/local/bin/java
- ./dags:/usr/local/airflow/dags
- ./data:/usr/local/airflow/data
- ./requirements.txt:/requirements.txt
- ./requirements.txt:/usr/airflow/requirements.txt
- C:/Program Files/hadoop:/usr/lib/hadoop
- C:/Program Files/Java/jre7:/usr/lib/jre7
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8082:8080"
command: webserver and bash -c "pip install requirements.txt"
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
And another one version: '3.7' services: master: image: gettyimages/spark command: bin/spark-class org.apache.spark.deploy.master.Master -h master hostname: master environment: MASTER: spark://master:7077 SPARK_CONF_DIR: /conf SPARK_PUBLIC_DNS: localhost expose: - 7001 - 7002 - 7003 - 7004 - 7005 - 7077 - 6066 ports: - 4040:4040 - 6066:6066 - 7077:7077 - 8080:8080 volumes: - ./conf/master:/conf - ./data:/tmp/data
worker:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=y
- EXECUTOR=Local
logging:
options:
max-size: 10m
max-file: "3"
volumes:
- ./dags:/usr/local/airflow/dags
# Add this to have third party packages
- ./requirements.txt:/requirements.txt
- ./data/.:/usr/local/airflow/data/
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8082:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
But i always recieve Error: Java gateway process exited before sending its port number Code with Error:
def start_pyspark(file):
sc = pyspark.SparkContext("spark://master:7077")
print(sc)
spark = SparkSession.builder \
.master("local[*]") \
.appName('PySpark_bikes') \
.getOrCreate()
What should i do? Wait for your help..
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
