I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h
When I am submitting the spark job from terminal I am getting below error that file does not exists. Although I have already placed config file to my local. spa
Is there any possibility using a framework for enabling / using Dependency Injection in a Spark Application? Is it possible to use Guice, for instance? If so,
Some details: Spark SQL (version 3.2.1) Driver: Hive JDBC (version 2.3.9) ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker t
I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder. Need a Scala function which wil
I'm loading large datasets and then caching them for reference throughout my code. The code looks something like this: val conversations = sqlContext.read .f
I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied m
In the below program ,the duplicate columns are getting created while joining two dataframes in pyspark . >>> spark = SparkSession.builder.appName("Jo
I am using a pyspark test script to read and write files to S3. Here is how I initialize the spark-session: import findspark from pyspark.sql
I work on spark application using (spark 2.0.0 & scala 2.11.8) and the application works fine within intellij Idea environment. I've extracted application a
How to do exception handling for file reading. For example, I have a daily job that will run at 8:00 am. It reads files from Azure data lake storage(Gen 2). The
I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null
I am trying to inherit DataFrame class and add additional custom methods as below so that i can chain fluently and also ensure all methods refers the same dataf
I am learning Databricks and I have some questions about z-order and partitionBy. When I am reading about both functions it sounds pretty similar. Both function
I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark? I used below code snippet from pyspark.sql import functions a
My dataset is pretty big and I would like to understand when groupby makes sense over reducebykey?
Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below. please help me resolve this and understand what I am doing wrong
Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(d
I want to write Spark batch results data to the Apache Druid. I know Druid has native batch ingestions such as index_parallel. Druid runs Map-Reduce jobs in the
I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those