Category "apache-spark"

Spark structured streaming- checkpoint metadata growing indefinitely

I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction in

spark-shell commands throwing error : “error: not found: value spark”

problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con

Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable

I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa

Better/Efficient way to filter out Spark Dataframe rows with multiple conditions

I have a dataframe look like this below id pub_date version unique_id c_id p_id type source lni001 20220301 1

Converting PySpark's consecutive withColumn to SQL

I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t

Does this function computeSVD use MapReduce in Pyspark

Does computeSVD() use map , reduce since it is a predefined function? i couldn't know the code of the function. from pyspark.mllib.linalg import Vectors from py

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this: id | price | updated 1 | 3 | 2022-03-21 2 | 4 | 2022-03-20

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce

java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException while running DAG in Airflow

I am running a python project through DAG in airflow, and I encounter the following exception when the dag runs this line from the project - df = spark.sql(quer

Extract value from array in Spark

I am trying to extract a value from an array in SparkSQL, but getting the error below: Example column customer_details {"original_customer_id":"ch_382820","fi

Computing number of business days between start/end columns

I have two Dataframes facts: columns: data, start_date and end_date holidays: column: holiday_date What I want is a way to produce another Dataframe that has

Insert Spark dataframe to partitioned table

I have seen methods for inserting into Hive table, such as insertInto(table_name, overwrite =True, but I couldn't work out how to handle the scenario below. For

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ... Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable) if new table is created during run (o

Issue with display()/collect() Large DataFrame In Pyspark

Getting The Following Issue In PySpark to perform display()/collect() operation on top of a generated dataframe. The df contains single column & Row (JSON d

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

I found similar question link , but no answer provided how to fix the issue. I want to make a UDF, that would extract for me words from column. So, I want to cr

Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to ma

Specifying column with multiple datatypes in Spark Schema

I am trying to create schema to parse json into spark dataframe I have column value in json which could be either struct or string "value": { "entity-type":

how to create an encoder for a row type of map[(String,String),List[Row]] for creating a Dataset[ map[(String,String),List[Row]] ] in spark?

This is my piece of code . There is a good lot of business logic happening here. I have tried to explain it in understandable manner as much as possible. I have

Spark Structured Streaming writeStream trigger set to once is recording much less data than it should

I have a program that runs every hour, it receives streaming data and writes it in parquet format in batches into a datalake every time it runs, to be later pro

Exception when trying to use saved Spark ML model for distributed computations with SHAP

I am having this error Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkConte