Category "pyspark"

Removing white space in column values of SQL o/p

Not able to remove white space from SQL query output used in pyspark code. I tried, trim,ltrim,rtrim,replace (multiple nested also) and regex replace. Any other

How do I create an array from a grouping of row_number()?

I have code that uses row_number() partitioned by date. I would like to create an array that contains data grouped by the row_number that is partitioned by date

How to pass dataframe to pyspark parallel operation?

I'm trying to filter the data frame by values of salary then saving them as CSV files using pyspark. spark = SparkSession.builder.appName('SparkByExamples.com')

Validate Date strict to format - more than 4 character for year - pySpark

I am trying to validate date received in file against configured date format(using to_timestamp /to_date). schema = StructType([ \ StructField("date",String

Create dataframe from json string having true false value

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array. I had used below for

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below. I get the error:& is not a supported operation for types str and str. Please review your code. Any idea how to get

Apache Spark Dataframe - Get length of each column

Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data

why does changing global variable not work in pyspark?

i have two python scripts. the main script is like from testa import modify, see from pyspark import SparkContext if __name__ == '__main__': sc = SparkConte

How to do if else condition on Pyspark columns with regex?

I have a pyspark dataframe event_name 0 a-markets-l1 1 a-markets-watch 2 a-markets-buy 3 a-markets-z2 4 scroll_down This dataframe has event_name column EXCL

How can we truncate and load the documents to a cosmos dB collection with out dropping it in pyspark

I have a monthly job in databricks where I want to truncate all records for previous month and then load for current month in cosmos db so I tried with option("

Exception: Java gateway process exited before sending its port number

I'm facing an issue when trying to use pyspark=3.1.2. I have java 1.8 installed and added in my user path. But according to the docs it does not need any other

spark-shell commands throwing error : “error: not found: value spark”

problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con

Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable

I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa

Caused by: java.net.SocketTimeoutException: Accept timed out

I am getting this error while running pyspark package in pycharm using python 3.9 using this below code. from pyspark.sql import SparkSession from pyspark.sql.t

Pyspark how to join common columns values to a list value

i am trying to join columns values to a list of values df1= name | department| state | id| -----+-----------+-------+---+ James|Sales |NY |101 Maria|F

Converting PySpark's consecutive withColumn to SQL

I need help in converting the below function into an SQL query: start_time :- 1649289600end_time :- 1649375999 test_data = df.withColumn("from_timestamp",to_t

Does this function computeSVD use MapReduce in Pyspark

Does computeSVD() use map , reduce since it is a predefined function? i couldn't know the code of the function. from pyspark.mllib.linalg import Vectors from py

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this: id | price | updated 1 | 3 | 2022-03-21 2 | 4 | 2022-03-20

Cast Issue with AWS Glue 3.0 - Pyspark

I'm using Glue 3.0 data = [("Java", "6241499.16943521594684385382059800664452")] rdd = spark.sparkContext.parallelize(data) df = rdd.toDF() df.show() df.select(

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce