'Stream data from one kafka topic to another using pyspark
I have been trying to stream some sample data using pyspark from one kafka topic to another (I want to apply some transformations, but, could not get the basic data movement to work). Below is my spark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.types import StructType, StringType, IntegerType
from pyspark.sql.functions import from_json, col
import time
confluentApiKey = 'someapikeyvalue'
confluentSecret = 'someapikey'
spark = SparkSession.builder\
.appName("repartition-job") \
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
.getOrCreate()
df = spark\
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("subscribe", "test1") \
.option("topic", "test1") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "earliest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("failOnDataLoss", "true").load()
df.printSchema()
query = df \
.selectExpr("CAST(key AS STRING) AS key", "to_json(struct(*)) AS value") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("topic", "test2") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "latest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("checkpointLocation", "/tmp/checkpoint").start()
I have been able to get the schema printed well.
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
And when attempting to write to another Kafka topic using writeStream, I see the below logs and dont see the data being written and spark shuts down.
22/02/04 18:29:26 INFO CheckpointFileManager: Writing atomically to file:/tmp/checkpoint/metadata using temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp
22/02/04 18:29:26 INFO CheckpointFileManager: Renamed temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp to file:/tmp/checkpoint/metadata
22/02/04 18:29:27 INFO MicroBatchExecution: Starting [id = 71f0aeb8-46fc-49b5-8baf-3b83cb4df71f, runId = 7a274767-8830-448c-b9bc-d03217cd4465]. Use file:/tmp/checkpoint to store the query checkpoint.
22/02/04 18:29:27 INFO MicroBatchExecution: Reading table [org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@3be72f6d] from DataSourceV2 named 'kafka' [org.apache.spark.sql.kafka010.KafkaSourceProvider@276c9fdc]
22/02/04 18:29:27 INFO SparkUI: Stopped Spark web UI at http://spark-sample-9d328d7ec5fee0bc-driver-svc.default.svc:4045
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/02/04 18:29:27 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
22/02/04 18:29:27 INFO MicroBatchExecution: Starting new streaming query.
22/02/04 18:29:27 INFO MicroBatchExecution: Stream started from {}
22/02/04 18:29:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/02/04 18:29:27 INFO MemoryStore: MemoryStore cleared
22/02/04 18:29:27 INFO BlockManager: BlockManager stopped
22/02/04 18:29:27 INFO BlockManagerMaster: BlockManagerMaster stopped
22/02/04 18:29:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/02/04 18:29:28 INFO SparkContext: Successfully stopped SparkContext
22/02/04 18:29:28 INFO ConsumerConfig: ConsumerConfig values:
....
....
....
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
22/02/04 18:29:28 INFO ShutdownHookManager: Shutdown hook called
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-7acecef5-0f0b-4b9a-af81-c8aa12f7fcad
22/02/04 18:29:28 INFO AppInfoParser: Kafka version: 2.4.1
22/02/04 18:29:28 INFO AppInfoParser: Kafka commitId: c57222ae8cd7866b
22/02/04 18:29:28 INFO AppInfoParser: Kafka startTimeMs: 1643999368467
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372/pyspark-59e822b4-a4a4-403b-9937-170d99c67584
22/02/04 18:29:28 INFO KafkaConsumer: [Consumer clientId=consumer-spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0-1, groupId=spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0] Subscribed to topic(s): test1
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372
22/02/04 18:29:28 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
Also, sometimes, I do see the below logs where the kafka connection fails to establish.
22/02/06 04:50:55 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:56 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:57 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:58 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
What am I doing wrong?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
