Category "apache-spark"

Error: A JNI error has occurred, please check your installation and try again in IntelliJ IDEA for Scala-Spark Program using SBT

import org.apache.spark.sql.SparkSession object RDDBroadcast extends App { val spark = SparkSession.builder() .appName("SparkByExamples.com") .maste

How to prevent SQL Server from stripping leading zeros when importing data

A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo

Memory leak in Spark Structured Streaming driver

I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S. Theses job are reading files from S3 usi

Apache Spark Dataframe - Get length of each column

Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data

How to prevent SQL Server from stripping leading zeros when importing data

A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo

Why Apache Beam's Spark runner giving error Error: Failed to load org/apache/beam/sdk/options/PipelineOptions?

I tried to run a working Apache Beam code with Spark-Runner using command spark-submit --class org.apache.beam.examples.WordCount --master spark://localhost:404

Spark: retrieving old values of rows after casting made invalid input nulls

I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a

Why would anybody choose Flink over spark? [closed]

I see spark to be superior to Flink. Below is my research. I see that most of features of Spark are covered in Flink, except for The Fair sche

Exception: Java gateway process exited before sending its port number

I'm facing an issue when trying to use pyspark=3.1.2. I have java 1.8 installed and added in my user path. But according to the docs it does not need any other

Limiting an RDD size

I have an RDD as follows: rdd .filter { case (_, record) => predicates.forall(_.accept(record)) } .toDS() .cache() It basically filters down an RDD

Spark structured streaming- checkpoint metadata growing indefinitely

I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction in

spark-shell commands throwing error : “error: not found: value spark”

problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con

Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable

I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa

Better/Efficient way to filter out Spark Dataframe rows with multiple conditions

I have a dataframe look like this below id pub_date version unique_id c_id p_id type source lni001 20220301 1

Converting PySpark's consecutive withColumn to SQL

I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t

Does this function computeSVD use MapReduce in Pyspark

Does computeSVD() use map , reduce since it is a predefined function? i couldn't know the code of the function. from pyspark.mllib.linalg import Vectors from py

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this: id | price | updated 1 | 3 | 2022-03-21 2 | 4 | 2022-03-20

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce

java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException while running DAG in Airflow

I am running a python project through DAG in airflow, and I encounter the following exception when the dag runs this line from the project - df = spark.sql(quer

Extract value from array in Spark

I am trying to extract a value from an array in SparkSQL, but getting the error below: Example column customer_details {"original_customer_id":"ch_382820","fi

Category "apache-spark"

Other Categories