Category "apache-spark"

Installing Mesos on ubuntu 20.04 causing makefile issue

I was trying to install mesos latest version(1.9.0) on the ubuntu 20.04 using Dockefile. FROM ubuntu:20.04 ENV MESOS_VERSION 1.9.0 ENV MESOS_ARTIFACT_FILENAME

Spark in SBT console: "Could not find spark-version-info.properties"

I'm trying to instantiate a SparkContext inside a SBT console, using the following scala commands: import org.apache.spark.SparkConf import org.apache.spark.Spa

Add UUID to spark dataset [duplicate]

I am trying to add a UUID column to my dataset. getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toStrin

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed

How to quickly check if row exists in PySpark Dataframe?

I have a PySpark dataframe like this: +------+------+ | A| B| +------+------+ | 1| 2| | 1| 3| | 2| 3| | 2| 5| +------+--

Why spark is 100 times faster than Hadoop Map Reduce

Why spark is faster than Hadoop MapReduce?. As per my understanding if spark is faster due to in-memory processing then Hadoop is also load data into RAM then i

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means o

How to convert timestamp column of Spark Dataframe to string column

I want to convert Spark dataframe all TIMESTAMP columns into String columns. Could anybody say how to do that automatically for each dataframe? The position of

How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio

Spark SQL error : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '$' expecting

I am forming a query in a String Builder like below : println(dataQuery) Execution started at 2019-10-31 02:58:24.006019 PST res245: String = " SELECT transac

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(

Modify date (month) in spark date column based on condition

I would like to modify my date column in spark df to subtract 1 month only if certain months appear. I.e. only if date is yyyy-07-31 or date is yyyy-04-30 chang

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe. My data looks like: ClientID,PatientID AR0001å,DH_HL704221157198295_9

How do I interpret Input size / records in Spark Stage UI

I'm looking at the Spark UI (Spark v1.6.0) for a stage of a job I'm currently running and I don't understand how to interpret what its telling me: The number o

How to create Dataframe form presto db table of Array Data type column using spark

I am trying to create spark Dataframe from presto db table which has few columns as Array DataType. I tried multiple ways but I am getting same exception java.s

Spark Catalog w/ AWS Glue: database not found

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via s

how to use bm25 in spark

I have more than 1 million documents to search, and more than 100,000 keywords. Each keyword needs to search 10 most similar documents in the offline way. So ho

How to create a CSV file with PySpark?

I have a short question about pyspark write. read_jdbc = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtabl

When to cache a DataFrame?

My question is - when should I do dataframe.cache() and when it's useful? Also, in my code should I cache the dataframes in the commented lines? Note: My datafr