I have a spark job that needs to store the last time it ran to a text file. This has to work both on HDFS but also on local fs (for testing). However it seems
I am trying to pivot the dataframe of raw data size 6 GB and it used to take 30 minutes time (aggregation function sum): x_pivot = raw_df.groupBy("a", "b", "c"
I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their githu
I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this SD= (reduce(sqrt((add, (abs(col
I would like to take small parquet files that are spread out through multiple partition layers on s3 and compress them into larger files with a single partition
I have a few questions which I would like to clarify before installation. Please bear with me as I am still new to data science and installation packages. 1)
Whenever I try to run my main program directly in IntelliJ I get this error: Error:(5, 12) object apache is not a member of package org import org.apache.common
I am writing unit tests for my spark/scala application. I am using scalamock as well to mock objects, specifically Session / Session Factory. In one of my test
I have a Spark/Scala job in which I do this: 1: Compute a big DataFrame df1 + cache it into memory 2: Use df1 to compute dfA 3: Read raw data into df2 (again,
I am executing a Spark job in Databricks cluster. I am triggering the job via a Azure Data Factory pipeline and it execute at 15 minute interval so after the su
Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Someho
Using "spark.sql.warehouse.dir" in the same jupyter session (no databricks) works. But after a kernel restart in jupyter the catalog db and tables arent't re
I'm new to the Delta Lake, but I want to create some indexes for fast retrieval for some tables in Delta Lake. Based on the docs, it shows that the closest is b
Anyon know Why I keeo getting this error in Jupyter Notebooks??? I've been trying to load my Tensorflow model into Apache Spark vis SparlFlowbut I can't seem to
I can't understand why my code isn't working. The last line is the problem: import findspark findspark.init() from pyspark import SparkConf, SparkContext from p
I have a Apache Beam project which works fine if I directly run it. But if i try to create a jar using maven clean:package it creates a uber jar using maven sha
I'm learning pyspark, I'm trying below code. Can someone help me to understand what wrong? >>> pairs=data.flatMap(lambda x:x.split(' ')).map(lambda x
I'm using Spark to run a grid search job using spark sklearn package. Here's my config NUM_SLAVES = 14 DRIVER_SPARK_MEMORY=53 # "spark.driver.memory" EXECUTOR_
I'm using Spark to run a grid search job using spark sklearn package. Here's my config NUM_SLAVES = 14 DRIVER_SPARK_MEMORY=53 # "spark.driver.memory" EXECUTOR_
I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m