Category "apache-spark-sql"

Create column using Spark pandas_udf, with dynamic number of input columns

I have this df: df = spark.createDataFrame( [('row_a', 5.0, 0.0, 11.0), ('row_b', 3394.0, 0.0, 4543.0), ('row_c', 136111.0, 0.0, 219255.0), (

Extract value from complex array of map type to string

I have a dataframe like below. No comp_value 1 [[ -> 10]] 2 [[ -> 35]] The schema type of column - value is. comp_value: array (nullable = tru

Perform sklearn DBSCAN on PySpark dataframe column

I have a Spark dataframe that looks like this: +-----+----------+--------+-----+ |key1 |date |variable|value| +-----+----------+--------+-----+ | A49|2022

How to assume a AWS role in pyspark

I am currently using spark 3.1, and I am using spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id) spark_context._jsc.hadoopConf

Removing white space in column values of SQL o/p

Not able to remove white space from SQL query output used in pyspark code. I tried, trim,ltrim,rtrim,replace (multiple nested also) and regex replace. Any other

How to pass dataframe to pyspark parallel operation?

I'm trying to filter the data frame by values of salary then saving them as CSV files using pyspark. spark = SparkSession.builder.appName('SparkByExamples.com')

Validate Date strict to format - more than 4 character for year - pySpark

I am trying to validate date received in file against configured date format(using to_timestamp /to_date). schema = StructType([ \ StructField("date",String

Create dataframe from json string having true false value

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array. I had used below for

spark-sql error column is neither present in the group by, nor is it an aggregate function can't solve with first_value, collected_list

I stuck with a spark.sql error that I couldn't solve with answers in stackoverflow, the point is I tried "first_value, collected_list" and they not solving erro

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below. I get the error:& is not a supported operation for types str and str. Please review your code. Any idea how to get

Apache Spark Dataframe - Get length of each column

Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data

spark-sql overwrite hive table ,why occured duplicate records

It occured duplicate records when spark-sql overwrite hive table . when spark job has failure stages,but dateframe has no duplicate records? when I run the jo

spark-shell commands throwing error : “error: not found: value spark”

problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con

Pyspark how to join common columns values to a list value

i am trying to join columns values to a list of values df1= name | department| state | id| -----+-----------+-------+---+ James|Sales |NY |101 Maria|F

Converting PySpark's consecutive withColumn to SQL

I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce

Extract value from array in Spark

I am trying to extract a value from an array in SparkSQL, but getting the error below: Example column customer_details {"original_customer_id":"ch_382820","fi

Computing number of business days between start/end columns

I have two Dataframes facts: columns: data, start_date and end_date holidays: column: holiday_date What I want is a way to produce another Dataframe that has

Insert Spark dataframe to partitioned table

I have seen methods for inserting into Hive table, such as insertInto(table_name, overwrite =True, but I couldn't work out how to handle the scenario below. For

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

I found similar question link , but no answer provided how to fix the issue. I want to make a UDF, that would extract for me words from column. So, I want to cr