I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception: Exc
I am have two table 1 is with 50K records and other is with 2.5K records and I want to update this 2.5K records into table one. Currently I was doing this by us
Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?
I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data pro
I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr
Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+
I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata i
Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as
I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not mat
I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based
I have a requirement where i am reading data from a CSV file and writing data to a Delta table over scala on window OS. My scala code is given below:- import co
df1=df.withColumn('etl_load_dt_part_new', concat_ws("-",year(df.ETL_LOAD_DT_PART),lit('12'),lit('31')).cast('date') ) i am trying to add new column named as e
This is my dataset: from pyspark.sql import SparkSession, functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([('2021-02-07',)
In my project , i need to read image dataset[each folder having different object and I want to read these folder in stream one by one ], and then need to extrac
I am trying to create table in spark sql by providing the schema and giving the location. However when i run select on the table, i see only half the columns. (
I have a case where I may have null values in the column that needs to be summed up in a group. If I encounter a null in a group, I want the sum of that group t
Hi I try to run spark on my local laptop. I created a mvn project in intelijidea and in my main class I have one line like bellow and when I try to run a projec
I have a large dataset like so: | SEQ_ID|RESULT| +-------+------+ |3462099|239.52| |3462099|239.66| |3462099|239.63| |3462099|239.64| |3462099|239.57| |3462099|
I am running this on Databricks. My goal is to make a select statement with all the values in the column comma separated. Content of my df: For example, I want
Is there a way i pyspark to recover for an even number the two values of a median ? For exemple: I have this dataframe df1 = spark.createDataFrame