Category "apache-spark"

Error while reading date and datetime column from mariadb via spark

I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar

At what point should you force a cache in Spark when performing heavy transformations?

Say you have something like this: big_table1 = spark.table('db.big_table1').cache() big_table2 = spark.table('db.big_table2').cache() big_table2 = spark.table('

Building Cube in Apache Kylin hangs on the second step

I am trying to buld a Cube in Kylin. It successfully does the first step and then just keeps running always at 50% Log from the step 2: 2022-05-13 13:54:40,640

HDFS Date partition directory loop

I have a HDFS Directory as below. /user/staging/app_name/2022_05_06 Under such a directory I have around 1000 part files. I want to loop each of the part file

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at t

Scala spark UDF function that takes input and puts it in an Array

I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in a

What does the HiveWarehouseConnector executeUpdate() function return?

I can't believe I have to ask this here but there seems to be no documentation on what the HWC actually does. All I can find is that it returns a boolean: publi

spark how to convert a json string to a struct column without schema

Spark: 3.0.0 Scala: 2.12.8 My data frame has a column with JSON string and I want to create a new column from it with the StructType. |temp_json_string

Is there any way to read multiple parquet paths from s3 in parallel using spark?

My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Un

(py)spark weighted average taking account of missing values

Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da

Hi All, facing an issue of spark sql query for delete on basis of timestamp

I am running the delete query with the < (less then) and > (greater then) condition on the timestamp field but we are not getting the desired results. Fir

Issue while running command spark-shell on windows

I have been trying to set up spark to use it further for pyspark library. I installed JDK, Hadoop and spark. Also provided the environment variables correctly.

Java 17 solution for Spark - java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils

There are some solutions here Windows Spark Error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils The mentioned

Category "apache-spark"

Error while reading date and datetime column from mariadb via spark

At what point should you force a cache in Spark when performing heavy transformations?

Building Cube in Apache Kylin hangs on the second step

HDFS Date partition directory loop

Partition not working in mongodb spark read in java connector

Scala spark UDF function that takes input and puts it in an Array

What does the HiveWarehouseConnector executeUpdate() function return?

spark how to convert a json string to a struct column without schema

Is there any way to read multiple parquet paths from s3 in parallel using spark?

(py)spark weighted average taking account of missing values

Hi All, facing an issue of spark sql query for delete on basis of timestamp

Issue while running command spark-shell on windows

Java 17 solution for Spark - java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils

How to select all columns except 2 of them from a large table on pyspark sql?

Errors when running spark-submit on a local machine with Apache Spark (stand alone, single node)

Spark writing extra rows when saving to CSV

How to close the spark instance

How to find the number of Inserts and Updates of Merge command?

Processing data from a kafka stream using Pyspark

Error to write dataframe in Cassandra table on Amazon Keyspaces

Category "apache-spark"

Other Categories