I followed the Dynamic allocation setup configuration however, getting the following error when starting the executors. ERROR TaskSchedulerImpl: Lost execu
The error described below occurs when I run Spark job on Databricks the second time (the first less often). The sql query just performs create table as select
I'm looking for a reliable way in Spark (v2+) to programmatically adjust the number of executors in a session. I know about dynamic allocation and the ability
These are the contents of my build.sbt file: name := "WordCounter" version := "0.1" scalaVersion := "2.13.1" libraryDependencies ++= Seq( "org.apache.spar
I'm using spark to deal with my data, like that: dataframe_mysql = spark.read.format('jdbc').options( url='jdbc:mysql://xxxxxxx',
I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/
So I am very new to pyspark but I am still unable to correctly create my own query. I try googling my problems but I just don't understand how most of this work
I want to be able to use Apache Sedona for distributed GIS computing on AWS EMR. We need the right bootstrap script to have all dependencies. I tried setting up
I have a csv file with below data. Id Subject Marks 1 M,P,C 10,8,6 2 M,P,C 5,7,9 3 M,P,C 6,7,4 I Need to find out Max value in the Marks column for each Id an
I have a spark job that needs to store the last time it ran to a text file. This has to work both on HDFS but also on local fs (for testing). However it seems
I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum): x_pivot = raw_df.groupBy("a", "b", "c"
I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their githu
I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this SD= (reduce(sqrt((add, (abs(col
I would like to take small parquet files that are spread out through multiple partition layers on s3 and compress them into larger files with a single partition
I have a few questions which I would like to clarify before installation. Please bear with me as I am still new to data science and installation packages. 1)
Whenever I try to run my main program directly in IntelliJ I get this error: Error:(5, 12) object apache is not a member of package org import org.apache.common
I am writing unit tests for my spark/scala application. I am using scalamock as well to mock objects, specifically Session / Session Factory. In one of my test
I have a Spark/Scala job in which I do this: 1: Compute a big DataFrame df1 + cache it into memory 2: Use df1 to compute dfA 3: Read raw data into df2 (again,
I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the su
Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Someho