Category "apache-spark"

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce

java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException while running DAG in Airflow

I am running a python project through DAG in airflow, and I encounter the following exception when the dag runs this line from the project - df = spark.sql(quer

Extract value from array in Spark

I am trying to extract a value from an array in SparkSQL, but getting the error below: Example column customer_details {"original_customer_id":"ch_382820","fi

Computing number of business days between start/end columns

I have two Dataframes facts: columns: data, start_date and end_date holidays: column: holiday_date What I want is a way to produce another Dataframe that has

Insert Spark dataframe to partitioned table

I have seen methods for inserting into Hive table, such as insertInto(table_name, overwrite =True, but I couldn't work out how to handle the scenario below. For

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ... Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable) if new table is created during run (o

Issue with display()/collect() Large DataFrame In Pyspark

Getting The Following Issue In PySpark to perform display()/collect() operation on top of a generated dataframe. The df contains single column & Row (JSON d

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

I found similar question link , but no answer provided how to fix the issue. I want to make a UDF, that would extract for me words from column. So, I want to cr

Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to ma

Specifying column with multiple datatypes in Spark Schema

I am trying to create schema to parse json into spark dataframe I have column value in json which could be either struct or string "value": { "entity-type":

how to create an encoder for a row type of map[(String,String),List[Row]] for creating a Dataset[ map[(String,String),List[Row]] ] in spark?

This is my piece of code . There is a good lot of business logic happening here. I have tried to explain it in understandable manner as much as possible. I have

Spark Structured Streaming writeStream trigger set to once is recording much less data than it should

I have a program that runs every hour, it receives streaming data and writes it in parquet format in batches into a datalake every time it runs, to be later pro

Exception when trying to use saved Spark ML model for distributed computations with SHAP

I am having this error Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkConte

Update The Data in the Exisiting table based on different levels in Spark sql

I have this Existing table tb1 in my database Now new data comes and new data is stored in another table tb2 Earlier Account_Number 9988 was Level 2, But now

How to preserve case of json key inside glue table which use serde?

I have created a glue table which converts the the json to parquet files .In one of the column which is defined as Map<String,String> having a nested json

What is the right memory allocations that can be given to multiple spark streaming jobs if it is being processed in a single EMR cluster (m5.xlarge)?

I have 12 spark streaming jobs and it receives a small size data at any time. These scripts has spark transformations and joins. What is the right memory alloca

What's the best way to rate limit a spark application

I have an application does the following: Reads URLs from a Hive table Creates HTTP requests from those URLs, hits a server with them and parses the responses W

pyspark error: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD

When I tried to search in Spark to Elasticsearch an error ocurred The code that i use is the following: from pyspark import SparkContext from pyspark.sql impor

Spark dataframe from dictionary

I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33

spark sql Find the number of extensions for a record

I have a dataset as below col1 extension_col1 2345 2246 2246 2134 2134 2091 2091 Null 1234 1111 1111 Null I need to find the number of extensions available fo