Category "apache-spark-sql"

Date from week date format: 2022-W02-1 (ISO 8601) [duplicate]

Having a date, I create a column with ISO 8601 week date format: from pyspark.sql import functions as F df = spark.createDataFrame([('2019-03-

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneTy

How "stable" is monotonically_increasing_id() in Spark?

I'm looking for an inexpensive way to distinguish duplicates and/or uniquely identify rows. I've been looking at the Spark built-ins monotonically_increasing_id

Split corresponding column values in pyspark

Below table would be the input dataframe col1 col2 col3 1 12;34;56 Aus;SL;NZ 2 31;54;81 Ind;US;UK 3 null Ban 4 Ned null Expected output dataframe [values of c

How to find quantile of a row in PySpark dataframe?

I have the following PySpark dataframe and I want to find percentile row-wise. value col_a col_b col_c row_a 5.0 0.0 11.0 row_b 3394.0 0

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pa

Spark-Java : How to add an array column in spark Dataframe

I am trying to add a new column to my Spark Dataframe. New column added will be of a size based on a variable (say salt) post which I will use that column to ex

Hbase | Hbase col qualifier hidden using Hbase shell cmds but visible via hbaserdd spark code

I am stuck in a very odd situation related to Hbase design i would say. Hbase version >> Version 2.1.0-cdh6.2.1 So, the problem statement is, in Hbase, w

Flatten Nested Json String Column Table into tabular format

I am currently trying to get a flatten a data in databricks table. Since some of the columns are deeply nested and is of 'String' type, i couldn't use explode f

Why spark bucket number not equal to the number of files in the partition?

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.master", "local").getOrCreate() import spark.implicits._ case class Someth

regex_extract_all not working with spark sql

I'm using databricks notebook to extract all field occurrences from a text column using the regexp_extract_all function. Here is the input: field_map#'IFDSIMP.7

Sharing an Oracle table among Spark Nodes using Python

I have an huge Oracle table to process, so I define a list of where clauses to read by each Spark node. In the middle of the processing I need to join the data

PySpark SQL forbid certain functions/operators

Given a PySpark SQL such as park.sql('''select 10%4 as hello ''') what is the best way to throw an exception anytime an operator % is used?

Is there are difference between PySpark and SparkSQL? If so, what's the difference?

Long story short, I'm tasked with converting files from SparkSQL to PySpark as my first task at my new job. However, I'm unable to see many differences outside

How do I select the columns of a table in databricks sql?

I can use: show columns in table_name but this does not allow me to use the output in a query? This throws an error: SELECT * FROM show columns in table_name

Error while reading date and datetime column from mariadb via spark

I am reading the mariadb table from spark which has date and datetime fields. Spark is throwing error while reading. Below is the schema of mariadb table: Spar

Scala spark UDF function that takes input and puts it in an Array

I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in a

spark how to convert a json string to a struct column without schema

Spark: 3.0.0 Scala: 2.12.8 My data frame has a column with JSON string and I want to create a new column from it with the StructType. |temp_json_string

(py)spark weighted average taking account of missing values

Is there a canonical way to compute the weighted average in pyspark ignoring missing values in the denominator sum? Take the following example: # create data da

How do I Insert Overwrite with parquet format?

I am have two parquet file in azure data lake gen2 I want to Insert Overwrite onw with other. I was trying the same in azure data bricks by doing below. Reading

Category "apache-spark-sql"

Other Categories