Category "apache-spark"

pyspark delta-lake metastore

Using "spark.sql.warehouse.dir" in the same jupyter session (no databricks) works. But after a kernel restart in jupyter the catalog db and tables arent't re

Create index for tables within Delta Lake

I'm new to the Delta Lake, but I want to create some indexes for fast retrieval for some tables in Delta Lake. Based on the docs, it shows that the closest is b

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext

Anyon know Why I keeo getting this error in Jupyter Notebooks??? I've been trying to load my Tensorflow model into Apache Spark vis SparlFlowbut I can't seem to

coding reduceByKey(lambda) in map does'nt work pySpark

I can't understand why my code isn't working. The last line is the problem: import findspark findspark.init() from pyspark import SparkConf, SparkContext from p

Why Spark Submit causes NoSuchMethodError when I run a uber jar made though maven shade plugin?

I have a Apache Beam project which works fine if I directly run it. But if i try to create a jar using maven clean:package it creates a uber jar using maven sha

pyspark wordcount sort by value

I'm learning pyspark, I'm trying below code. Can someone help me to understand what wrong? >>> pairs=data.flatMap(lambda x:x.split(' ')).map(lambda x

VCores used is always less than VCores total in Spark on YARN on AWS EMR?

I'm using Spark to run a grid search job using spark sklearn package. Here's my config NUM_SLAVES = 14 DRIVER_SPARK_MEMORY=53 # "spark.driver.memory" EXECUTOR_

VCores used is always less than VCores total in Spark on YARN on AWS EMR?

I'm using Spark to run a grid search job using spark sklearn package. Here's my config NUM_SLAVES = 14 DRIVER_SPARK_MEMORY=53 # "spark.driver.memory" EXECUTOR_

Spark dataframe transform multiple rows to column

I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m

Spark on Windows - java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0

In Win10, in IntelliJ this path("C:/hive/Orders_[0-9]*.csv") works good when run as stand alone java spark job. But not working as Spring Boot spark job. Seems

spark3.2.1 cache throw NullPointerException

a job running some time about 1 day will throw the exception when i upgrade spark version to 3.2.1 i set it a driver and 2 executors executor allocate 2g memory

Increasing Spark application timeout in Jupyter/Livy

I'm using a shared EMR cluster with Jupyterhub installed. If my cluster is under heavy load, I get an error How do I increase the timeout for a spark applicati

org.apache.hadoop.hbase.io.ImmutableBytesWritable exception in HBase

We tried to test the following example code for accessing HBase tables (Spark-1.3.1, HBase-1.1.1, Hadoop-2.7.0): import sys from pyspark import SparkContext

How to stream data from mongodb in Structured Streaming?

Is it possible to use spark structured streaming to read data from mongo db with a readStream ? For standard use of structured streaming, I usually do so: va

Access Apache Spark WebUI running in Vagrant

So I setup a vagrant environment with Spark 1.5.0 installed. Then I use sbin/start-all.sh to start Spark. Inside VM I can curl localhost:8080 to get the HTML co

convert df.apply to spark to run parallely iusing all the cores

We have a panda dataframe that are using. We have a function we use in retail data which runs on a daily basis row by row to calculate the item to item differe

Pyspark-pandas not working on Spark 3.1.2

I am using spark 3.1.2 and attempting to use pyspark-pandas. However when attempting from pyspark import pandas as ps I am getting the following error: ImportEr

Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4

I am submitting Spark job with following specification:(same program has been used to run different size of data range from 50GB to 400GB) /usr/hdp/2.6.0.3-8/

Databricks display() function equivalent or alternative to Jupyter

I'm in the process of migrating current DataBricks Spark notebooks to Jupyter notebooks, DataBricks provides convenient and beautiful display(data_frame) functi

How can you parse a string that is json from an existing temp table using PySpark?

I have an existing Spark dataframe that has columns as such: -------------------- pid | response -------------------- 12 | {"status":"200"} response is a st