I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33
I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.
Issue: I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem
I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri
I have a dataset as below col1 extension_col1 2345 2246 2246 2134 2134 2091 2091 Null 1234 1111 1111 Null I need to find the number of extensions available fo
Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2
It is suggested that you can 'generate unique increasing numeric values' by select row_number() over (order by monotonically_increasing_id()) from /* ... */ Bu
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =
I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of f
When I use pyspark to write to the csv file: sql_df.write\ .format("csv")\ .option('sep', '\t')\ .option("compression", "gzip")\ .option("quote"
As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui
I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like
result = aml_identity_g.connectedComponents() conn_comps = result.select("id", "component",'type') \ .createOrReplaceTempView("components") display(result)
I have multiple files, but lets consider 2 files which have filename and start dates columns. Start_Date FileName 2022-01-01 product 1 2022-02-02 product 2 pl
I have a text file which look like below. HDR¶20200101 BDY¶1¶Jimmy BDY¶1¶Something TRL¶123 I would like to parse it to a Glue Dyn
I need to split a column in several columns depending on the number of fields that each record has, for example, if I have the following DF: +---+--------------
HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back. The Thrift frame
I have a dataframe having a column of type MapType<StringType, StringType>. |-- identity: map (nullable = true) | |-- key: string | |-- value: st
We have a table 1 Day table aggregated with group by call_date ,tdlinx_id ,work_request_id ,category_name another table we have 1 week level data aggregated w