Category "pyspark"

How to use ODBC connection for pyspark.pandas

In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' datafra

Create column using Spark pandas_udf, with dynamic number of input columns

I have this df: df = spark.createDataFrame( [('row_a', 5.0, 0.0, 11.0), ('row_b', 3394.0, 0.0, 4543.0), ('row_c', 136111.0, 0.0, 219255.0), (

How do we optimise an incremental merge involving a very large target table (10 TB) and smaller incremental source table in a data lake environment?

I came across this question recently in one of the interviews and haven't been able to find a satisfying answer to this question. The incremental merge could co

Extract value from complex array of map type to string

I have a dataframe like below. No comp_value 1 [[ -> 10]] 2 [[ -> 35]] The schema type of column - value is. comp_value: array (nullable = tru

Perform sklearn DBSCAN on PySpark dataframe column

I have a Spark dataframe that looks like this: +-----+----------+--------+-----+ |key1 |date |variable|value| +-----+----------+--------+-----+ | A49|2022

How to assume a AWS role in pyspark

I am currently using spark 3.1, and I am using spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id) spark_context._jsc.hadoopConf

How do i create a single SparkSession in one file and reuse it other file

I have two py files com/demo/DemoMain.py com/demo/Sample.py In both of the above files i am recreating the SparkSession object , In Pyspark,how do i create a S

Removing white space in column values of SQL o/p

Not able to remove white space from SQL query output used in pyspark code. I tried, trim,ltrim,rtrim,replace (multiple nested also) and regex replace. Any other

How do I create an array from a grouping of row_number()?

I have code that uses row_number() partitioned by date. I would like to create an array that contains data grouped by the row_number that is partitioned by date

How to pass dataframe to pyspark parallel operation?

I'm trying to filter the data frame by values of salary then saving them as CSV files using pyspark. spark = SparkSession.builder.appName('SparkByExamples.com')

Validate Date strict to format - more than 4 character for year - pySpark

I am trying to validate date received in file against configured date format(using to_timestamp /to_date). schema = StructType([ \ StructField("date",String

Create dataframe from json string having true false value

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array. I had used below for

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below. I get the error:& is not a supported operation for types str and str. Please review your code. Any idea how to get

Apache Spark Dataframe - Get length of each column

Question: In Apache Spark Dataframe, using Python, how can we get the data type and length of each column? I'm using latest version of python. Using pandas data

why does changing global variable not work in pyspark?

i have two python scripts. the main script is like from testa import modify, see from pyspark import SparkContext if __name__ == '__main__': sc = SparkConte

How to do if else condition on Pyspark columns with regex?

I have a pyspark dataframe event_name 0 a-markets-l1 1 a-markets-watch 2 a-markets-buy 3 a-markets-z2 4 scroll_down This dataframe has event_name column EXCL

How can we truncate and load the documents to a cosmos dB collection with out dropping it in pyspark

I have a monthly job in databricks where I want to truncate all records for previous month and then load for current month in cosmos db so I tried with option("

Exception: Java gateway process exited before sending its port number

I'm facing an issue when trying to use pyspark=3.1.2. I have java 1.8 installed and added in my user path. But according to the docs it does not need any other

spark-shell commands throwing error : “error: not found: value spark”

problem screenshot :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^ here is my enviroment con

Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable

I'm querying data from Cassandra in Spark using SCC 2.5 and Spark 2.4.7 (pyspark). The table I'm reading from has a composite partition key (partition_key_1, pa