Category "pyspark"

Spark partition size greater than the executor memory

I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (T

How to divide two aggreate sum dataframe

I want to divide the sum of two columns in pyspark. For example, I have a datasets like below: A B C 1 1 2 3 2 1 2 3 3 1 2 3 What I want is t

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr

java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter in SparkSubmit

I've been trying to submit applications to a Kubernetes. I have followed the tutorial in https://spark.apache.org/docs/latest/running-on-kubernetes.html such as

Pyspark Window function on entire data frame

Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+

Delta Table / Athena And Spark

I have my delta table, which can be read from Athena. When I try to get the data through a query from spark I get the following error: Caused by: org.apache.sp

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zoo

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata i

Spatial with SparkSQL/Python in Synapse Spark Pool using apache-sedona?

I would like to run spatial queries on large data sets; e.g. geopandas would be too slow. Inspiration I found here: https://anant-sharma.medium.com/apache-sedon

IllegalArgumentException: File must be dbfs or s3n: /

dbutils.fs.mount( source = f"wasbs://{blob.storage_account_container}@{blob.storage_account_name}.blob.core.windows.net/", mount_point = "/mnt/MLRExtract/"

Recursive View Message - Azure Data Bricks

Error : AnalysisException: Recursive view management_db.v_extract detected (cycle: management_db.v_extract -> management_db.v_extract) Query outisde of the v

Convert date to ISO week date in Spark

Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as

access objects in pyspark user-defined function from outer scope, avoid PicklingError: Could not serialize object

How do I avoid initializing a class within a pyspark user-defined function? Here is an example. Creating a spark session and DataFrame representing four latitu

Auto increment id in delta table while inserting

I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not mat

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based

TypeError: 'str' object is not callable -Pyspark

df1=df.withColumn('etl_load_dt_part_new', concat_ws("-",year(df.ETL_LOAD_DT_PART),lit('12'),lit('31')).cast('date') ) i am trying to add new column named as e

Start of the week on Monday in Spark

This is my dataset: from pyspark.sql import SparkSession, functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([('2021-02-07',)

PySpark read data into Dataframe, transform in sql, then save to dataframe

New to Spark and Synapse....Need to do some transformation including adding a columns, changing datatypes, etc. I am reading a csv into a dataframe. I'd like t

SPARK SQL create table does not show / read all columns as expected

I am trying to create table in spark sql by providing the schema and giving the location. However when i run select on the table, i see only half the columns. (

'DecisionTreeClassificationModel' object has no attribute 'stages'

tree = dtModel.stages[-1] print(tree) #visualize the decision tree model AttributeError Traceback (most recent call last) Attribute

Category "pyspark"

Other Categories