Everytime I run a simple groupby pyspark returns different values, even though I haven't done any modification on the dataframe. Here is the code I am using: df
I have a Spark DataFrame like this: +-------+------+-----+---------------+ |Account|nature|value| time| +-------+------+-----+---------------+ |
I am having a pyspark dataframe as DOCTOR | PATIENT JOHN | SAM JOHN | PETER JOHN | ROBIN BEN | ROSE BEN | GRAY and need to concatenate patient n
I want to define an environment variable in Databricks init script and then read it in Pyspark notebook. I wrote this: dbutils.fs.put("/databricks/scripts/i
This is my first time using PySpark. I am using a Mac and I am trying to start up a session within Jupiter Notebook using the code below: import pyspark from py
I am am trying to monitor some logic in a udf using counters. i.e. counter = Counter(...).labels("value") @ufd def do_smthng(col): if col: counter.label(
I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053',
I am trying to install pyspark as this: python setup.py install I get this error: Could not import pypandoc - required to package PySpark pypandoc is inst
I am trying to connect bigquery using databricks latest version(7.1+, spark 3.0) with pyspark as script editor/base language. We ran a below pyspark script to f
How to send a pandas dataframe to a hive table? I know if I have a spark dataframe, I can register it to a temporary table using df.registerTempTable("table_
I'm facing some problems regarding the memory issue, but I'm unable to solve it. Any help is highly appreciated. I am new to Spark and pyspark functionalities a
How to send a pandas dataframe to a hive table? I know if I have a spark dataframe, I can register it to a temporary table using df.registerTempTable("table_
I have already researched a lot but could not find a solution. Closest question I could find here is Why my SPARK works very slowly with mongoDB. I am trying t
I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- na
I am using spark, and got such an error which try to enter 'pyspark' in windows command prompt. I try to install the pyspark on my windows with this tutorial (h
Some details: Spark SQL (version 3.2.1) Driver: Hive JDBC (version 2.3.9) ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker t
I have a Spark Dataframe which contains groups of training data. Each group is identified by the "group" column. group | feature_1 | feature_2 | label --------
I am using a pyspark test script to read and write files to S3. Here is how I initialize the spark-session: import findspark from pyspark.sql
How to do exception handling for file reading. For example, I have a daily job that will run at 8:00 am. It reads files from Azure data lake storage(Gen 2). The
I have a spark df with the following schema: |-- col1 : string |-- col2 : string |-- customer: struct | |-- smt: string | |-- attributes: array (null