Category "apache-spark"

pyspark get element from array Column of struct based on condition

How to add custom method to Pyspark Dataframe class by inheritance

I am trying to inherit DataFrame class and add additional custom methods as below so that i can chain fluently and also ensure all methods refers the same dataf

Databricks: Z-order vs partitionBy

I am learning Databricks and I have some questions about z-order and partitionBy. When I am reading about both functions it sounds pretty similar. Both function

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark? I used below code snippet from pyspark.sql import functions a

When is spark groupby preferred over reducebykey?

My dataset is pretty big and I would like to understand when groupby makes sense over reducebykey?

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below. please help me resolve this and understand what I am doing wrong

How do I get the last item from a list using pyspark?

Why does column 1st_from_end contain null: from pyspark.sql.functions import split df = sqlContext.createDataFrame([('a b c d',)], ['s',]) df.select( split(d

How to write Spark SQL batch job results to the Apache Druid?

I want to write Spark batch results data to the Apache Druid. I know Druid has native batch ingestions such as index_parallel. Druid runs Map-Reduce jobs in the

Need to load data from Hadoop to Druid after applying transformations. If I use Spark, can we load data from Spark RDD or dataframe to Druid directly?

I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those

How do we do Spark Dataframe testing using JUnit?

We are trying to build an integration test suite using JUnit. Our pipeline (built in Spark using Scala) gives us DataFrames as output, we plan to compare them a

How to correctly import pyspark.sql.functions?

from pyspark.sql.functions import isnan, when, count, sum , etc... It is very tiresome adding all of it. Is there a way to import all of it at once?

Spark Streaming "ERROR JobScheduler: error in job generator"

I build a spark Streaming application to keep receiving messages from Kafka and then write them into a table HBase. This app runs pretty good for first 25 mins

How to Override log4j with log4j2 version to resolve "SocketServer class vulnerable to deserialization" for apache-core_2.12 version

How to Override log4j version 1.2.17 with log4j-core 2.16.0 version to resolve "SocketServer class vulnerable to deserialization" for spark-core_2.12 binaries.

Google cloud dataproc cluster created with an environment.yaml with a jupyter resource but environment not available as a jupyter kernel

I have created a new dataproc cluster with a specific environment.yaml. Here is the command that I have used to create that cluster: gcloud dataproc clusters cr

Category "apache-spark"

pyspark get element from array Column of struct based on condition

How to add custom method to Pyspark Dataframe class by inheritance

Databricks: Z-order vs partitionBy

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

When is spark groupby preferred over reducebykey?

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

How do I get the last item from a list using pyspark?

How to write Spark SQL batch job results to the Apache Druid?

Need to load data from Hadoop to Druid after applying transformations. If I use Spark, can we load data from Spark RDD or dataframe to Druid directly?

How do we do Spark Dataframe testing using JUnit?

How to correctly import pyspark.sql.functions?

Spark Streaming "ERROR JobScheduler: error in job generator"

How to Override log4j with log4j2 version to resolve "SocketServer class vulnerable to deserialization" for apache-core_2.12 version

Google cloud dataproc cluster created with an environment.yaml with a jupyter resource but environment not available as a jupyter kernel

Getting java.lang.NoSuchMethodError: org.yaml.snakeyaml.Yaml.<init> while running spark based spring boot application

How do you create merge_asof functionality in PySpark?

How to apply the describe function after grouping a PySpark DataFrame?

Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading

How to use a Scala class inside Pyspark

Category "apache-spark"

Other Categories