Category "apache-spark"

How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio

Spark SQL error : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '$' expecting

I am forming a query in a String Builder like below : println(dataQuery) Execution started at 2019-10-31 02:58:24.006019 PST res245: String = " SELECT transac

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(

Modify date (month) in spark date column based on condition

I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 chang

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

How do I interpret Input size / records in Spark Stage UI

I'm looking at the Spark UI (Spark v1.6.0) for a stage of a job I'm currently running and I don't understand how to interpret what its telling me: The number o

How to create Dataframe form presto db table of Array Data type column using spark

I am trying to create spark Dataframe from presto db table which has few columns as Array DataType. I tried multiple ways but I am getting same exception java.s

Spark Catalog w/ AWS Glue: database not found

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via s

how to use bm25 in spark

I have more than 1 million documents to search, and more than 100,000 keywords. Each keyword needs to search 10 most similar documents in the offline way. So ho

How to create a CSV file with PySpark?

I have a short question about pyspark write. read_jdbc = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtabl

When to cache a DataFrame?

My question is - when should I do dataframe.cache() and when it's useful? Also, in my code should I cache the dataframes in the commented lines? Note: My datafr

Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error

spark-streaming-kafka-0-8 vs spark-streaming-kafka-0-10

I am a new beginner in the big data field, I need to make a demo which streams data from Kafka topic using spark stream then make some aggregation and filtering

Unable to start spark-shell failing to submit spark-submit

I am trying to submit spark-submit but its failing with as weird message. Error: Could not find or load main class org.apache.spark.launcher.Main /opt/spark/b

Spark executor metrics don't reach prometheus sink

Circumstances: I have read through these:https://spark.apache.org/docs/3.1.2/monitoring.html https://dzlab.github.io/bigdata/2020/07/03/spark3-monitoring-1/ ver

No module named 'pyspark.streaming.kafka' even with older spark version

In another similar question, they hint 'install older spark 2.4.5.' EDIT: the solution from above link says 'install spark 2.4.5 and it does have kafkautils. Bu

dataframe Spark scala explode json array

Let's say I have a dataframe which looks like this: +--------------------+--------------------+--------------------------------------------------------------+

How to escape ( parentheses in Spark Scala?

I am trying to replace parentheses in a string (i.e. column names). It is working fine with white spaces but not with ( parentheses. I tried """, \(, \\( but I

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark