'Apache Beam - Error with example using Spark Runner pointing to a local spark master URL

I need to support a use case where we are able to run Beam pipelines in an external spark URL.

I took a basic example of a beam pipeline as below

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions


class ConvertToByteArray(beam.DoFn):
    def __init__(self):
        pass

    def setup(self):
        pass

    def process(self, row):
        try:
            yield bytearray(row + '\n', 'utf-8')

        except Exception as e:
            raise e


def run():

    options = PipelineOptions([
        "--runner=SparkRunner",
        # "--spark_master_url=spark://0.0.0.0:7077",
        # "--spark_version=3",
    ])
    with beam.Pipeline(options=options) as p:
        lines = (p
                 | 'Create words' >> beam.Create(['this is working'])
                 | 'Split words' >> beam.FlatMap(lambda words: words.split(' '))
                 | 'Build byte array' >> beam.ParDo(ConvertToByteArray())
                 | 'Group' >> beam.GroupBy()  # Do future batching here
                 | 'print output' >> beam.Map(print)
                 )
        
        
if __name__ == "__main__":
    run()

I try to run this pipeline in 2 ways

  1. Using Apache Beam's internal spark runner
  2. Running Spark locally and passing the spark master URL.

Approach 1 works fine and i'm able to see the output (screenshot below)

Screenshot of Output

Approach 2 gives a class incompatible error on the spark server Running spark as a docker container and natively on my machine both gave the same error. Exception trace

Spark Executor Command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=62420" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:62420" "--executor-id" "0" "--hostname" "172.17.0.2" "--cores" "4" "--app-id" "app-20220425143553-0000" "--worker-url" "spark://[email protected]:44575"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/04/25 14:35:55 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 165@ace075bc56c0
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for TERM
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for HUP
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for INT
22/04/25 14:35:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/04/25 14:35:56 INFO SecurityManager: Changing view acls to: spark,nikamath
22/04/25 14:35:56 INFO SecurityManager: Changing modify acls to: spark,nikamath
22/04/25 14:35:56 INFO SecurityManager: Changing view acls groups to: 
22/04/25 14:35:56 INFO SecurityManager: Changing modify acls groups to: 
22/04/25 14:35:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark, nikamath); groups with view permissions: Set(); users  with modify permissions: Set(spark, nikamath); groups with modify permissions: Set()
22/04/25 14:35:57 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 194 ms (0 ms spent in bootstraps)
22/04/25 14:35:57 INFO SecurityManager: Changing view acls to: spark,nikamath
22/04/25 14:35:57 INFO SecurityManager: Changing modify acls to: spark,nikamath
22/04/25 14:35:57 INFO SecurityManager: Changing view acls groups to: 
22/04/25 14:35:57 INFO SecurityManager: Changing modify acls groups to: 
22/04/25 14:35:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark, nikamath); groups with view permissions: Set(); users  with modify permissions: Set(spark, nikamath); groups with modify permissions: Set()
22/04/25 14:35:57 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 19 ms (0 ms spent in bootstraps)
22/04/25 14:35:58 INFO DiskBlockManager: Created local directory at /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/blockmgr-d7bd5b95-ddb3-4642-9863-261a6e109fc4
22/04/25 14:35:58 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/04/25 14:35:58 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:62420
22/04/25 14:35:58 INFO WorkerWatcher: Connecting to worker spark://[email protected]:44575
22/04/25 14:35:58 INFO TransportClientFactory: Successfully created connection to /172.17.0.2:44575 after 7 ms (0 ms spent in bootstraps)
22/04/25 14:35:58 INFO WorkerWatcher: Successfully connected to spark://[email protected]:44575
22/04/25 14:35:58 INFO ResourceUtils: ==============================================================
22/04/25 14:35:58 INFO ResourceUtils: No custom resources configured for spark.executor.
22/04/25 14:35:58 INFO ResourceUtils: ==============================================================
22/04/25 14:35:58 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
22/04/25 14:35:58 INFO Executor: Starting executor ID 0 on host 172.17.0.2
22/04/25 14:35:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39027.
22/04/25 14:35:59 INFO NettyBlockTransferService: Server created on 172.17.0.2:39027
22/04/25 14:35:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/04/25 14:35:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO Executor: Fetching spark://192.168.8.120:62420/jars/beam-runners-spark-3-job-server-2.33.0.jar with timestamp 1650897351765
22/04/25 14:35:59 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 5 ms (0 ms spent in bootstraps)
22/04/25 14:35:59 INFO Utils: Fetching spark://192.168.8.120:62420/jars/beam-runners-spark-3-job-server-2.33.0.jar to /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d/fetchFileTemp1955801007454794690.tmp
22/04/25 14:36:02 INFO Utils: Copying /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d/-16622853561650897351765_cache to /opt/bitnami/spark/work/app-20220425143553-0000/0/./beam-runners-spark-3-job-server-2.33.0.jar
22/04/25 14:36:04 INFO Executor: Adding file:/opt/bitnami/spark/work/app-20220425143553-0000/0/./beam-runners-spark-3-job-server-2.33.0.jar to class loader
22/04/25 14:36:04 INFO CoarseGrainedExecutorBackend: Got assigned task 0
22/04/25 14:36:04 INFO CoarseGrainedExecutorBackend: Got assigned task 1
22/04/25 14:36:04 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
22/04/25 14:36:04 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/04/25 14:36:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
    at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
    at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
    at org.apache.spark.scheduler.Task.run(Task.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
22/04/25 14:36:04 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
    at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
    at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
    at org.apache.spark.scheduler.Task.run(Task.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
22/04/25 14:36:04 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 0.0 in stage 0.0 (TID 0),5,main]
java.lang.Error: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
    at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
    at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1(Executor.scala:424)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1$adapted(Executor.scala:423)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.executor.Executor$TaskRunner.collectAccumulatorsAndResetStatusOnFailure(Executor.scala:423)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:704)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    ... 2 more
22/04/25 14:36:04 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1.0 in stage 0.0 (TID 1),5,main]
java.lang.Error: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
    at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
    at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1(Executor.scala:424)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1$adapted(Executor.scala:423)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.executor.Executor$TaskRunner.collectAccumulatorsAndResetStatusOnFailure(Executor.scala:423)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:704)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    ... 2 more
22/04/25 14:36:05 INFO MemoryStore: MemoryStore cleared
22/04/25 14:36:05 INFO BlockManager: BlockManager stopped
22/04/25 14:36:05 INFO ShutdownHookManager: Shutdown hook called
22/04/25 14:36:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d

I confirmed that a worker was running and the spark was set up properly. I submitted a sample spark job to this master and get it to work implying that there isn't anything wrong with the spark master or worker.

Versions used:

  1. Spark 3.2.1
  2. Apache Beam 2.37.0
  3. Python 3.7


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source