Category "spark-structured-streaming"

Spark structured streaming- checkpoint metadata growing indefinitely

I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction in

Spark Structured Streaming writeStream trigger set to once is recording much less data than it should

I have a program that runs every hour, it receives streaming data and writes it in parquet format in batches into a datalake every time it runs, to be later pro

Generate Kafka message with Headers using Apache Spark

I have an ETL (spark-scala). After writing in a table, a message with "header" must be sent to Kafka. I couldn't add the header in the message. I have a spark D

Multi-schema data pipeline

We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a fi

Accessing the current sliding window in foreachBatch in Spark Structured Streaming

I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I re

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zoo

Connection between kafka and spark : Failed to find data source : kafka

I am trying to do link between kafka and spark by reading data from one topic and tryy to print the content of this topic into a DataFrame, but by doing connect

Structured Streaming to Save JSON to HDFS

My Structured Spark Streaming program is to read JSON data from Kafka and write to HDFS in JSON format. I am able to save JSON to HDFS but it saves the JSON st

Distinct Count on Column in Dataset in Structured Streaming

I am New in Structure Streaming Topic. so facing issue while calculating distinct count in column in Dataset/Dataframe. //DataFrame val readFromKafka = sparks

How to stream data from mongodb in Structured Streaming?

Is it possible to use spark structured streaming to read data from mongo db with a readStream ? For standard use of structured streaming, I usually do so: va

Pyspark UDF monitoring with prometheus

I am am trying to monitor some logic in a udf using counters. i.e. counter = Counter(...).labels("value") @ufd def do_smthng(col): if col: counter.label(