Category "rdd"

Sort by key (Month) using RDDs in Pyspark

I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark? Note: Don't want to use spark.sql or Dataframe. +-----+-----+ |Month|co

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneTy

Pipe Pyspark OSError: [WinError 87] The parameter is incorrect

I have installed Spark 3.0.0 on a Windows 64 bit machine with Python 3.9.7 using an anaconda base environment. I'm trying to execute the next code in the pyspar

Spark partition size greater than the executor memory

I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (T

How did spark RDD map to Cassandra table?

I am new to Spark, and recently I saw a code is saving data in RDD format to Cassandra table. But I am not able to figure it out how it is doing the column mapp

ValueError: RDD is empty-- Pyspark (Windows Standalone)

I am trying to create an RDD but spark not creating it, throwing back error, pasted below; data = records.map(lambda r: LabeledPoint(extract_label(r), extract_

Jupyter Notebook PySpark OSError [WinError 123] The filename, directory name, or volume label syntax is incorrect:

System Configuration: Operating System: Windows 10 Python Version: 3.7 Spark Version: 2.4.4 SPARK_HOME: C:\spark\spark-2.4.4-bin-hadoop2.7 Problem I am using

Spark dataframe transform multiple rows to column

I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m

Category "rdd"

Other Categories