I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API.
I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na
I am trying to update a delta table in Databricks using the Databricks documentation here as an example. This document talks only about updating a literal value
I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null
I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark? I used below code snippet from pyspark.sql import functions a
My dataset is pretty big and I would like to understand when groupby makes sense over reducebykey?
Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(d
I want to write Spark batch results data to the Apache Druid. I know Druid has native batch ingestions such as index_parallel. Druid runs Map-Reduce jobs in the
Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Ta
I am trying to fetch multiple rows in zeppelin using spark SQL. Here's my SQL statement: select id, name from target where updated_at = "val1", "val2","val3"
I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject. Let's