Category "apache-spark"

Spark job failing on jackson dependencies

I have spark job that is failing after the upgrade of the cdh from 5.5.4 which had spark 1.5.0 to cdh 5.13.0 which has spark 1.6.0 The job is running with the

Is there any an issue with the file name openjdk-8-jdk-headless?

I am trying to install Spark which requires Java with using !apt-get install openjdk-8-jdk-headless -qq > /dev/null And I get an error after it. E: Failed

Zeppelin+Spark+Kubernetes: Let Zeppelin Job run on existing Spark Cluster

In a k8s cluster. How do you configure zeppelin to run spark jobs in an existing spark cluster instead of spinning up a new pod? I've got a k8s cluster up and r

Another SparkContext is being constructed Eror

I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h

Java.io.FileNotFoundException : YAML file does not exists

When I am submitting the spark job from terminal I am getting below error that file does not exists. Although I have already placed config file to my local. spa

Apache Spark - Is it possible to use a Dependency Injection Mechanism

Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application? Is it possible to use Guice, for instance? If so,

spark ETL and spark thrift server

Some details: Spark SQL (version 3.2.1) Driver: Hive JDBC (version 2.3.9) ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker t

Write single CSV file using spark-csv

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder. Need a Scala function which wil

Why does calling cache take a long time on a Spark Dataset?

I'm loading large datasets and then caching them for reference throughout my code. The code looks something like this: val conversations = sqlContext.read .f

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied m

Merging the similar column names while joining two dataframes using pyspark

In the below program ,the duplicate columns are getting created while joining two dataframes in pyspark . >>> spark = SparkSession.builder.appName("Jo

Spark read from S3 working, but I am unable to write using the same session [duplicate]

I am using a pyspark test script to read and write files to S3. Here is how I initialize the spark-session: import findspark from pyspark.sql

Error when run jar Exception in thread "main" java.lang.NoSuchMethodError scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;

I work on spark application using (spark 2.0.0 & scala 2.11.8) and the application works fine within intellij Idea environment. I've extracted application a

Synapse spark job fails as input folder does not exist

How to do exception handling for file reading. For example, I have a daily job that will run at 8:00 am. It reads files from Azure data lake storage(Gen 2). The

pyspark get element from array Column of struct based on condition

I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null

How to add custom method to Pyspark Dataframe class by inheritance

I am trying to inherit DataFrame class and add additional custom methods as below so that i can chain fluently and also ensure all methods refers the same dataf

Databricks: Z-order vs partitionBy

I am learning Databricks and I have some questions about z-order and partitionBy. When I am reading about both functions it sounds pretty similar. Both function

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark? I used below code snippet from pyspark.sql import functions a

When is spark groupby preferred over reducebykey?

My dataset is pretty big and I would like to understand when groupby makes sense over reducebykey?

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below. please help me resolve this and understand what I am doing wrong