Category "pyspark"

Does this function computeSVD use MapReduce in Pyspark

Does computeSVD() use map , reduce since it is a predefined function? i couldn't know the code of the function. from pyspark.mllib.linalg import Vectors from py

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this: id | price | updated 1 | 3 | 2022-03-21 2 | 4 | 2022-03-20

Cast Issue with AWS Glue 3.0 - Pyspark

I'm using Glue 3.0 data = [("Java", "6241499.16943521594684385382059800664452")] rdd = spark.sparkContext.parallelize(data) df = rdd.toDF() df.show() df.select(

How to split csv comma separated value as single row in a new column using pyspark

I have a log file in csv which has a column contains a list of filepaths separated by comma. I want to split those filepaths into new rows using pyspark(or exce

Computing number of business days between start/end columns

I have two Dataframes facts: columns: data, start_date and end_date holidays: column: holiday_date What I want is a way to produce another Dataframe that has

Insert Spark dataframe to partitioned table

I have seen methods for inserting into Hive table, such as insertInto(table_name, overwrite =True, but I couldn't work out how to handle the scenario below. For

Issue with display()/collect() Large DataFrame In Pyspark

Getting The Following Issue In PySpark to perform display()/collect() operation on top of a generated dataframe. The df contains single column & Row (JSON d

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

I found similar question link , but no answer provided how to fix the issue. I want to make a UDF, that would extract for me words from column. So, I want to cr

Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to ma

Spark Structured Streaming writeStream trigger set to once is recording much less data than it should

I have a program that runs every hour, it receives streaming data and writes it in parquet format in batches into a datalake every time it runs, to be later pro

Exception when trying to use saved Spark ML model for distributed computations with SHAP

I am having this error Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkConte

What is the right memory allocations that can be given to multiple spark streaming jobs if it is being processed in a single EMR cluster (m5.xlarge)?

I have 12 spark streaming jobs and it receives a small size data at any time. These scripts has spark transformations and joins. What is the right memory alloca

How to processing json data in a column by using python/pyspark?

Trying to process JSON data in a column on Databricks. Below is the sample data from a table (its a weather device records info) JSON_Info {"sampleData":"dataD

pyspark error: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD

When I tried to search in Spark to Elasticsearch an error ocurred The code that i use is the following: from pyspark import SparkContext from pyspark.sql impor

Is there an elegant, easy and fast way to move data out of HBase into MongoDB?

Is there an elegant, easy and fast way to move data out of HBase into MongoDB? I want to migrate HBase to mongoDB. I am new to mongoDB. Could someone please hel

Why pyspark code running in the pycharm generate these information?

I'm a green hand of python and pyspark. When I run the code of pyspark in pycharm, it always generate the information below. I want to know the reason and solut

collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this common_id table1_address table2_address table3_address table4_address 123 null null stack building12 null 157 123road stree

Spark dataframe from dictionary

I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33

Encounter an Error Converting Rdd in Dataframe Pyspark

I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.

Python function to iterate each unique column and transform using pyspark

I'm building the following global function in Pyspark to go through each column in my CSV that is in different formats and convert them all to one unique format