I have 12 spark streaming jobs and it receives a small size data at any time. These scripts has spark transformations and joins. What is the right memory alloca
Trying to process JSON data in a column on Databricks. Below is the sample data from a table (its a weather device records info) JSON_Info {"sampleData":"dataD
When I tried to search in Spark to Elasticsearch an error ocurred The code that i use is the following: from pyspark import SparkContext from pyspark.sql impor
Is there an elegant, easy and fast way to move data out of HBase into MongoDB? I want to migrate HBase to mongoDB. I am new to mongoDB. Could someone please hel
I'm a green hand of python and pyspark. When I run the code of pyspark in pycharm, it always generate the information below. I want to know the reason and solut
I have a table that looks like this common_id table1_address table2_address table3_address table4_address 123 null null stack building12 null 157 123road stree
I'm trying to create a spark dataframe from a dictionary which has data in the format {'33_45677': 0, '45_3233': 25, '56_4599': 43524} .. etc. dict_pairs={'33
I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I try to count the number of elements in the dataframe I get an error.
I'm building the following global function in Pyspark to go through each column in my CSV that is in different formats and convert them all to one unique format
I have to first partition by a "customer group" but I also want to make sure that I have a single csv file per "customer_group" . This is because it is timeseri
I am trying to pass a string in my pyspark code and it works fine but when i pass the following string to escape reserved keyword `date` or any value passed in
I am working on writing a framework that basically does a data sanity check. I have a set of inputs like { "check_1": [ sql_query_1, sql_query_2 ], "check_2":
I am trying to install PySpark on my Windows 10 to be used on Jupyter Lab. I have already installed Java and running Python 3.7.3: openjdk version "1.8.0_242" O
I'm working on an ETL job with an SageMaker notebook that uses spark 2.4.0. After joining a couple of tables I keep getting the following errors: Update-- I was
Whell i'm learning PySpark, i installed ipykernel, jupyterlab, notebook and pyspark via PIP, and Java 8 via .exe, the problem is when i need to create the sessi
Can someone help me with the below. I have an input dataframe. ID process_type STP_stagewise 1 loan_creation Manual 1 loan creation NSTP 1 reimbursement STP 2
Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be a
Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =