Category "apache-spark"

How to extract values from key value map?

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column. Input: ----------------+ |symbols

SPARK SQL - case when then

I'm new to SPARK-SQL. Is there an equivalent to "CASE WHEN 'CONDITION' THEN 0 ELSE 1 END" in SPARK SQL ? select case when 1=1 then 1 else 0 end from table Tha

How to use spark with large decimal numbers?

My database has numeric value, which is up to 256-bit unsigned integer. However, spark's decimalType has a limit of Decimal(38,18). When I try to do calculatio

How can I increase spark.driver.memoryOverhead in Google dataproc?

I am getting two types of errors in running a job on Google dataproc and it is causing executors to be lost one by one until the last executor is lost and the j

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS? If I use Spark standalone cluster manager and have my data distributed in HDFS c

I am trying to start a Spark Session in Jupyter Notebook and get the following error

This is my first time using PySpark. I am using a Mac and I am trying to start up a session within Jupiter Notebook using the code below: import pyspark from py

Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5?

I have been working on this problem for two days and still have not find the way. Problem: Our Spark installed via newest CDH 5 always complains about the lost

Different default persist for RDD and Dataset

I was trying to find a good answer for why the default persist for RDD is MEMORY_ONLY whereas for Dataset it is MEMORY_AND_DISK. But I couldn't find it. Does an

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it. df.persist(Stora

Pyspark UDF monitoring with prometheus

I am am trying to monitor some logic in a udf using counters. i.e. counter = Counter(...).labels("value") @ufd def do_smthng(col): if col: counter.label(

spark elasticsearch: Multiple ES-Hadoop versions detected in the classpath

I'm new to spark. I'm trying to run a spark job that loads data to elasticsearch. I've built a fat jar from my code and used it during spark-submit. spark-subm

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053',

Sum a column values based on a condition using spark scala

I have a dataframe like this: JoiKey period Age Amount Jk1 2022-02 2 200 Jk1 2022-02 3 450 Jk2 2022-03 5 500 Jk3 2022-03 0 200 Jk2 2022-02 8 300 Jk3 2022-03 9

Converting Pandas dataframe into Spark dataframe error

I'm trying to convert Pandas DF into Spark one. DF head: 10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543 10000001,2,0,1,12:36,OK,10002,1

DAG of Spark Sort application spanning two jobs

I've written a very simple Sort scala program with Spark. object Sort { def main(args: Array[String]): Unit = { if (args.length < 2) {

OutOfMemoryError : Java heap space in Spark

I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities a

Livy and Elasticsearch-Spark: Multiple ES-Hadoop versions detected

I'm trying to read from elasticsearch in a livy job using the elastisearch-spark jar. When I upload the jar to a livy client(like the example here) I get this e

Spark saveAsTextFile() results in Mkdirs failed to create for half of the directory

I am currently running a Java Spark Application in tomcat and receiving the following exception: Caused by: java.io.IOException: Mkdirs failed to create file:/

DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API.

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na