import org.apache.spark.sql.SparkSession object RDDBroadcast extends App { val spark = SparkSession.builder() .appName("SparkByExamples.com") .maste
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S. Theses job are reading files from S3 usi
Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data
A data file is imported to a SQL Server table. One of the columns in data file is of text data type with values in this column to be integers only. The correspo
I tried to run a working Apache Beam code with Spark-Runner using command spark-submit --class org.apache.beam.examples.WordCount --master spark://localhost:404
I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a
I see spark to be superior to Flink. Below is my research. I see that most of features of Spark are covered in Flink, except for The Fair sche
I'm facing an issue when trying to use pyspark=3.1.2. I have java 1.8 installed and added in my user path. But according to the docs it does not need any other
I have an RDD as follows: rdd .filter { case (_, record) => predicates.forall(_.accept(record)) } .toDS() .cache() It basically filters down an RDD
I use spark struture streaming 3.1.2. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction in
problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con
I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa
I have a dataframe look like this below id pub_date version unique_id c_id p_id type source lni001 20220301 1
I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t
Does computeSVD() use map , reduce since it is a predefined function? i couldn't know the code of the function. from pyspark.mllib.linalg import Vectors from py
I'm storing in a delta table the prices of products. The schema of the table is like this: id | price | updated 1 | 3 | 2022-03-21 2 | 4 | 2022-03-20
I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce
I am running a python project through DAG in airflow, and I encounter the following exception when the dag runs this line from the project - df = spark.sql(quer
I am trying to extract a value from an array in SparkSQL, but getting the error below: Example column customer_details {"original_customer_id":"ch_382820","fi