Category "pyspark"

What is the right memory allocations that can be given to multiple spark streaming jobs if it is being processed in a single EMR cluster (m5.xlarge)?

I have 12 spark streaming jobs and it receives a small size data at any time. These scripts has spark transformations and joins. What is the right memory alloca

How to processing json data in a column by using python/pyspark?

Trying to process JSON data in a column on Databricks. Below is the sample data from a table (its a weather device records info) JSON_Info {"sampleData":"dataD

pyspark error: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD

When I tried to search in Spark to Elasticsearch an error ocurred The code that i use is the following: from pyspark import SparkContext from pyspark.sql impor

Is there an elegant, easy and fast way to move data out of HBase into MongoDB?

Is there an elegant, easy and fast way to move data out of HBase into MongoDB? I want to migrate HBase to mongoDB. I am new to mongoDB. Could someone please hel

Why pyspark code running in the pycharm generate these information?

I'm a green hand of python and pyspark. When I run the code of pyspark in pycharm, it always generate the information below. I want to know the reason and solut

collaborating address columns from multiple tables into one column (3million rows)

I have a table that looks like this common_id table1_address table2_address table3_address table4_address 123 null null stack building12 null 157 123road stree

Spark dataframe from dictionary

I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33

Encounter an Error Converting Rdd in Dataframe Pyspark

I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.

Python function to iterate each unique column and transform using pyspark

I'm building the following global function in Pyspark to go through each column in my CSV that is in different formats and convert them all to one unique format

How to have a single csv file after applying partitionBy in Pysark

I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri

Python argparse unexpected behavior when passing "``" to the argument string in pysaprk cluster mode

I am trying to pass a string in my pyspark code and it works fine but when i pass the following string to escape reserved keyword `date` or any value passed in

How to effectively run tasks parallelly in pyspark

I am working on writing a framework that basically does a data sanity check. I have a set of inputs like { "check_1": [ sql_query_1, sql_query_2 ], "check_2":

Java gateway process exited before sending its port number

I am trying to install PySpark on my Windows 10 to be used on Jupyter Lab. I have already installed Java and running Python 3.7.3: openjdk version "1.8.0_242" O

Is there a way to configure the memory resources for Spark using Pyspark

I'm working on an ETL job with an SageMaker notebook that uses spark 2.4.0. After joining a couple of tables I keep getting the following errors: Update-- I was

Why pyspark is taking so long to create a SparkSession on jupyter?

Whell i'm learning PySpark, i installed ipykernel, jupyterlab, notebook and pyspark via PIP, and Java 8 via .exe, the problem is when i need to create the sessi

Group by id and create a column based on priority in Pyspark

Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2

why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly

Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be a

Programmatic way to find the cluster version from CDSW - Cloudera Data Science Workbench

Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi

Update a highly nested column from string to struct

Spark - Update a nested column to string

Category "pyspark"

Other Categories