Category "apache-spark-sql"

Removing white space in column values of SQL o/p

Not able to remove white space from SQL query output used in pyspark code. I tried, trim,ltrim,rtrim,replace (multiple nested also) and regex replace. Any other

How to pass dataframe to pyspark parallel operation?

I'm trying to filter the data frame by values of salary then saving them as CSV files using pyspark. spark = SparkSession.builder.appName('SparkByExamples.com')

Validate Date strict to format - more than 4 character for year - pySpark

I am trying to validate date received in file against configured date format(using to_timestamp /to_date). schema = StructType([ \ StructField("date",String

Create dataframe from json string having true false value

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array. I had used below for

spark-sql error column is neither present in the group by, nor is it an aggregate function can't solve with first_value, collected_list

I stuck with a spark.sql error that I couldn't solve with answers in stackoverflow, the point is I tried "first_value, collected_list" and they not solving erro

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below. I get the error:& is not a supported operation for types str and str. Please review your code. Any idea how to get

Apache Spark Dataframe - Get length of each column

Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data

spark-sql overwrite hive table ,why occured duplicate records

It occured duplicate records when spark-sql overwrite hive table . when spark job has failure stages,but dateframe has no duplicate records? when I run the jo

spark-shell commands throwing error : “error: not found: value spark”

problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con

Pyspark how to join common columns values to a list value

i am trying to join columns values to a list of values df1= name | department| state | id| -----+-----------+-------+---+ James|Sales |NY |101 Maria|F

Converting PySpark's consecutive withColumn to SQL

I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce

Extract value from array in Spark

I am trying to extract a value from an array in SparkSQL, but getting the error below: Example column customer_details {"original_customer_id":"ch_382820","fi

Computing number of business days between start/end columns

I have two Dataframes facts: columns: data, start_date and end_date holidays: column: holiday_date What I want is a way to produce another Dataframe that has

Insert Spark dataframe to partitioned table

I have seen methods for inserting into Hive table, such as insertInto(table_name, overwrite =True, but I couldn't work out how to handle the scenario below. For

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

I found similar question link , but no answer provided how to fix the issue. I want to make a UDF, that would extract for me words from column. So, I want to cr

How to retrieve data from different AWS regions for my glue job?

I have Glue DBs(db1 and db2) and tables(tbl1 and tbl2) available in different AWS regions(eu-west-1 and us-east-1) respectively. My glue job in eu-west-1, needs

how to create an encoder for a row type of map[(String,String),List[Row]] for creating a Dataset[ map[(String,String),List[Row]] ] in spark?

This is my piece of code . There is a good lot of business logic happening here. I have tried to explain it in understandable manner as much as possible. I have

Update The Data in the Exisiting table based on different levels in Spark sql

I have this Existing table tb1 in my database Now new data comes and new data is stored in another table tb2 Earlier Account_Number 9988 was Level 2, But now

How to processing json data in a column by using python/pyspark?

Trying to process JSON data in a column on Databricks. Below is the sample data from a table (its a weather device records info) JSON_Info {"sampleData":"dataD