Category "apache-spark-sql"

Spark dataframe from dictionary

I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33

Encounter an Error Converting Rdd in Dataframe Pyspark

I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.

Azure Databricks - Write to parquet file using spark.sql with union and subqueries

Issue: I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem

How to have a single csv file after applying partitionBy in Pysark

I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri

spark sql Find the number of extensions for a record

I have a dataset as below col1 extension_col1 2345 2246 2246 2134 2134 2091 2091 Null 1234 1111 1111 Null I need to find the number of extensions available fo

Group by id and create a column based on priority in Pyspark

Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2

Why compute row_number() order by monotonically_increasing_id() in Spark?

It is suggested that you can 'generate unique increasing numeric values' by select row_number() over (order by monotonically_increasing_id()) from /* ... */ Bu

Update a highly nested column from string to struct

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru

Spark - Update a nested column to string

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =

How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f

An error occurred while calling o590.save. : java.lang.RuntimeException: quote cannot be more than one character

When I use pyspark to write to the csv file: sql_df.write\ .format("csv")\ .option('sep', '\t')\ .option("compression", "gzip")\ .option("quote"

pyspark.sql.functions.lit() not nullable conversion [duplicate]

As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui

Calculate a sequence of Markov chain values

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like

ParseException: SQL CTE

result = aml_identity_g.connectedComponents() conn_comps = result.select("id", "component",'type') \ .createOrReplaceTempView("components") display(result)

LEAD function with date scenario

I have multiple files, but lets consider 2 files which have filename and start dates columns. Start_Date FileName 2022-01-01 product 1 2022-02-02 product 2 pl

Glue Dynamic Frame Parse text file with ¶ delimiter

I have a text file which look like below. HDR¶20200101 BDY¶1¶Jimmy BDY¶1¶Something TRL¶123 I would like to parse it to a Glue Dyn

Spark Scala - Split DataFrame column into multiple depending on the size of the column

I need to split a column in several columns depending on the number of fields that each record has, for example, if I have the following DF: +---+--------------

Spark-SQL plug in on HIVE

HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back. The Thrift frame

How to change value in a Map Datatype

I have a dataframe having a column of type MapType<StringType, StringType>. |-- identity: map (nullable = true) | |-- key: string | |-- value: st

SQL Azure Data Bricks

We have a table 1 Day table aggregated with group by call_date ,tdlinx_id ,work_request_id ,category_name another table we have 1 week level data aggregated w