Category "apache-spark"

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark

Pyspark throwing error while trying to read parquet

I am a newbie in pyspark, While trying to read parquet file through pyspark I get the below error. I have tried various things like reinstallation of jre and jd

How to convert from Pandas' DatetimeIndex to DataFrame in PySpark?

I have the following code: # Get the min and max dates minDate, maxDate = df2.select(f.min("MonthlyTransactionDate"), f.max("MonthlyTransactionDate")).first()

Pyspark Fetching MongoDB records using MongoConnector and Where Clause

I'm trying to read MongoDB using this guide df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load() df = df.select(['my_cols']) df = df.where('date

Calling Kubernetes Spark Operator with Java api

There is a good of examples of creating Spark jobs using the Kubernetes Spark Operator and simply submitting a request with the following kubectl apply -f spa

Spark streamming take long time read from kafka

I build a cluster use CDH5.14.2, includes 5 nodes, each node has 130G momery and 40 cpu cores. I builded the spark streamming application to read from multiple

Trigger IF Statement only when two Spark dataframe meet the conditions

I have two identical Spark DataFrame. They have the same columns. I am trying to create a IF-Else statement in one line but couldnt find a better way to do it.

KernelRestarter: restart failed in jupyter , Kernel died

[I 10:43:53.627 NotebookApp] 启动notebooks 在本地路径: /opt/soft/recommender/jupyter [I 10:43:53.627 NotebookApp]

Cannot connect to Cassandra in spark-shell

I am trying to connect to a remote cassandra cluster in my spark shell using the Spark-cassandra connector. But its throwing some unusual errors. I do the usual

Py4JJavaError in an Azure Databricks notebook pipeline

I have a curious issue, when launching a databricks notebook from a caller notebook through dbutils.notebook.run (I am working in Azure Databricks). One intere

Spark DataFrame is Untyped vs DataFrame has schema?

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often- 1) DataFrame is untyped 2) DataFrame has sch

Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab

I have followed this post pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class and followed the resolution

Category "apache-spark"

Spark 3.0 is much slower to read json files than Spark 2.4

Pyspark throwing error while trying to read parquet

How to convert from Pandas' DatetimeIndex to DataFrame in PySpark?

Pyspark Fetching MongoDB records using MongoConnector and Where Clause

Calling Kubernetes Spark Operator with Java api

Spark streamming take long time read from kafka

Trigger IF Statement only when two Spark dataframe meet the conditions

KernelRestarter: restart failed in jupyter , Kernel died

Cannot connect to Cassandra in spark-shell

Py4JJavaError in an Azure Databricks notebook pipeline

Spark DataFrame is Untyped vs DataFrame has schema?

Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab

Unable to register with external shuffle server. Failed to connect on standalone Spark cluster

Spark SQL - org.apache.spark.sql.AnalysisException

Programmatically add/remove executors to a Spark Session

sbt package is trying to download a package whose path does not exist

Why do I got TypeError: cannot pickle '_thread.RLock' object when using pyspark

Compare two dataframes Pyspark

Join two dataframes using the closest timestamp pyspark

Setup Apache Sedona on EMR

Category "apache-spark"

Other Categories