Category "pyspark"

How to have a single csv file after applying partitionBy in Pysark

I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri

Python argparse unexpected behavior when passing "``" to the argument string in pysaprk cluster mode

I am trying to pass a string in my pyspark code and it works fine but when i pass the following string to escape reserved keyword `date` or any value passed in

How to effectively run tasks parallelly in pyspark

I am working on writing a framework that basically does a data sanity check. I have a set of inputs like { "check_1": [ sql_query_1, sql_query_2 ], "check_2":

Java gateway process exited before sending its port number

I am trying to install PySpark on my Windows 10 to be used on Jupyter Lab. I have already installed Java and running Python 3.7.3: openjdk version "1.8.0_242" O

Is there a way to configure the memory resources for Spark using Pyspark

I'm working on an ETL job with an SageMaker notebook that uses spark 2.4.0. After joining a couple of tables I keep getting the following errors: Update-- I was

Why pyspark is taking so long to create a SparkSession on jupyter?

Whell i'm learning PySpark, i installed ipykernel, jupyterlab, notebook and pyspark via PIP, and Java 8 via .exe, the problem is when i need to create the sessi

Group by id and create a column based on priority in Pyspark

Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2

why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly

Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be a

Programmatic way to find the cluster version from CDSW - Cloudera Data Science Workbench

Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi

Update a highly nested column from string to struct

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru

Spark - Update a nested column to string

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =

Is it possible to connect to serverless sql pool via azure databricks?

I'm trying to connect to synapse serverless pool via databricks. I need to create synapse views and external tables directly in databricks as part of an existin

Add comments to delta

If a pyspark dataframe is reading some data from a table and writing it to azure delta lake Can we add comments to this newly written file? For e.g Df = sql("se

Pyspark - explode return an empty dataframe when a nested collection has no item

I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----

Reuse Spark Session Across Modules/Packages

We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl

How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f

An error occurred while calling o590.save. : java.lang.RuntimeException: quote cannot be more than one character

When I use pyspark to write to the csv file: sql_df.write\ .format("csv")\ .option('sep', '\t')\ .option("compression", "gzip")\ .option("quote"

java.lang.IllegalArgumentException: Illegal Capacity: -102 when reading a large parquet file by pyspark

I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error: df = spark.read.parquet("path/to/file.parqu

pyspark.sql.functions.lit() not nullable conversion [duplicate]

As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui

Calculate a sequence of Markov chain values

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like