Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2
Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be a
Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =
I'm trying to connect to synapse serverless pool via databricks. I need to create synapse views and external tables directly in databricks as part of an existin
If a pyspark dataframe is reading some data from a table and writing it to azure delta lake Can we add comments to this newly written file? For e.g Df = sql("se
I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----
We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl
I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f
When I use pyspark to write to the csv file: sql_df.write\ .format("csv")\ .option('sep', '\t')\ .option("compression", "gzip")\ .option("quote"
I have a large parquet file (~5GB) and I want to load it in spark. The following command executes without any error: df = spark.read.parquet("path/to/file.parqu
As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui
I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like
I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark? Note: Don't want to use spark.sql or Dataframe. +-----+-----+ |Month|co
When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it int
I work on DataBricks with PySpark dataframe containing string-type columns. I use .withColumnRenamed() to rename one of them. Later in the process I use a .filt
I have a program that contain few lines of functions that uses pyspark (the rest is normal Python). The portion of my code that uses pyspark: X.to_csv(r'first.t
Need help on aws Multi region Access point(mrap) . I'm using spark data frame to write data to a mrap and that is error ing out Df.write(<mrap alias>.acce
I am building a datawarehouse in Azure Synapse where one of the sources are about 20 different types of XML files (with a different XSD scheme) and 1 base schem