'Can't run pyspark in airflow dag with docker-compose

I downloaded https://github.com/puckel/docker-airflow, but and tried to start it and that was ok. But i need to use pyspark and i have no idea, how set it correctly. I saw many answers, but they don't work. So now i have two variants of docker-compose.

The first one

version: '3.7'
services:
    jdkSetup:
        image: openjdk:9
        command:
            - sh
            - '-c'
            - 'echo $JAVA_HOME'
        environment:
            - JAVA_HOME:/usr/lib/jvm/java-8-openjdk-amd64
        ports:
            - 6060:6060
    postgres:
        image: postgres:9.6
        environment:
            - POSTGRES_USER=airflow
            - POSTGRES_PASSWORD=airflow
            - POSTGRES_DB=airflow
        logging:
            options:
                max-size: 10m
                max-file: "3"

    webserver:
        image: puckel/docker-airflow:1.10.9
        restart: always
        depends_on:
            - postgres
            - jdkSetup
        environment:
            - LOAD_EX=n
            - EXECUTOR=Local
            - PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
            - HADOOP_HOME:/usr/lib/hadoop/
            - JRE_HOME:/usr/lib/jre7
            - JAVA_HOME=/usr/local/bin/java
        logging:
            options:
                max-size: 10m
                max-file: "3"
        volumes:
            - ${JAVA_HOME}:/usr/local/bin/java
            - ./dags:/usr/local/airflow/dags
            - ./data:/usr/local/airflow/data
            - ./requirements.txt:/requirements.txt
            - ./requirements.txt:/usr/airflow/requirements.txt
            - C:/Program Files/hadoop:/usr/lib/hadoop
            - C:/Program Files/Java/jre7:/usr/lib/jre7
            # - ./plugins:/usr/local/airflow/plugins
        ports:
            - "8082:8080"
        command: webserver and bash -c "pip install requirements.txt"
        healthcheck:
            test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
            interval: 30s
            timeout: 30s
            retries: 3

And another one version: '3.7' services: master: image: gettyimages/spark command: bin/spark-class org.apache.spark.deploy.master.Master -h master hostname: master environment: MASTER: spark://master:7077 SPARK_CONF_DIR: /conf SPARK_PUBLIC_DNS: localhost expose: - 7001 - 7002 - 7003 - 7004 - 7005 - 7077 - 6066 ports: - 4040:4040 - 6066:6066 - 7077:7077 - 8080:8080 volumes: - ./conf/master:/conf - ./data:/tmp/data

worker:
  image: gettyimages/spark
  command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
  hostname: worker
  environment:
    SPARK_CONF_DIR: /conf
    SPARK_WORKER_CORES: 2
    SPARK_WORKER_MEMORY: 1g
    SPARK_WORKER_PORT: 8881
    SPARK_WORKER_WEBUI_PORT: 8081
    SPARK_PUBLIC_DNS: localhost
  links:
    - master
  expose:
    - 7012
    - 7013
    - 7014
    - 7015
    - 8881
  ports:
    - 8081:8081
  volumes:
    - ./conf/worker:/conf
    - ./data:/tmp/data
postgres:
    image: postgres:9.6
    environment:
        - POSTGRES_USER=airflow
        - POSTGRES_PASSWORD=airflow
        - POSTGRES_DB=airflow
    logging:
        options:
            max-size: 10m
            max-file: "3"

webserver:
    image: puckel/docker-airflow:1.10.9
    restart: always
    depends_on:
        - postgres
    environment:
        - LOAD_EX=y
        - EXECUTOR=Local
    logging:
        options:
            max-size: 10m
            max-file: "3"
    volumes:
        - ./dags:/usr/local/airflow/dags
        # Add this to have third party packages
        - ./requirements.txt:/requirements.txt
        - ./data/.:/usr/local/airflow/data/
        # - ./plugins:/usr/local/airflow/plugins
    ports:
        - "8082:8080"
    command: webserver
    healthcheck:
        test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
        interval: 30s
        timeout: 30s
        retries: 3

But i always recieve Error: Java gateway process exited before sending its port number Code with Error:

def start_pyspark(file):
    sc = pyspark.SparkContext("spark://master:7077")
    print(sc)
    spark = SparkSession.builder \
        .master("local[*]") \
        .appName('PySpark_bikes') \
        .getOrCreate()

What should i do? Wait for your help..



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source