Category "apache-spark"

Another SparkContext is being constructed Eror

I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h

Java.io.FileNotFoundException : YAML file does not exists

When I am submitting the spark job from terminal I am getting below error that file does not exists. Although I have already placed config file to my local. spa

Apache Spark - Is it possible to use a Dependency Injection Mechanism

Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application? Is it possible to use Guice, for instance? If so,

spark ETL and spark thrift server

Some details: Spark SQL (version 3.2.1) Driver: Hive JDBC (version 2.3.9) ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker t

Write single CSV file using spark-csv

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder. Need a Scala function which wil

Why does calling cache take a long time on a Spark Dataset?

I'm loading large datasets and then caching them for reference throughout my code. The code looks something like this: val conversations = sqlContext.read .f

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied m

Merging the similar column names while joining two dataframes using pyspark

In the below program ,the duplicate columns are getting created while joining two dataframes in pyspark . >>> spark = SparkSession.builder.appName("Jo

Spark read from S3 working, but I am unable to write using the same session [duplicate]

I am using a pyspark test script to read and write files to S3. Here is how I initialize the spark-session: import findspark from pyspark.sql

Error when run jar Exception in thread "main" java.lang.NoSuchMethodError scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;

I work on spark application using (spark 2.0.0 & scala 2.11.8) and the application works fine within intellij Idea environment. I've extracted application a

Synapse spark job fails as input folder does not exist

How to do exception handling for file reading. For example, I have a daily job that will run at 8:00 am. It reads files from Azure data lake storage(Gen 2). The

pyspark get element from array Column of struct based on condition

I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null

How to add custom method to Pyspark Dataframe class by inheritance

I am trying to inherit DataFrame class and add additional custom methods as below so that i can chain fluently and also ensure all methods refers the same dataf

Databricks: Z-order vs partitionBy

I am learning Databricks and I have some questions about z-order and partitionBy. When I am reading about both functions it sounds pretty similar. Both function

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark? I used below code snippet from pyspark.sql import functions a

When is spark groupby preferred over reducebykey?

My dataset is pretty big and I would like to understand when groupby makes sense over reducebykey?

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below. please help me resolve this and understand what I am doing wrong

How do I get the last item from a list using pyspark?

Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(d

How to write Spark SQL batch job results to the Apache Druid?

I want to write Spark batch results data to the Apache Druid. I know Druid has native batch ingestions such as index_parallel. Druid runs Map-Reduce jobs in the

Need to load data from Hadoop to Druid after applying transformations. If I use Spark, can we load data from Spark RDD or dataframe to Druid directly?

I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those