Category "pyspark"

How Execute Azure data bricks notebook from excel

Is there any way to trigger Azure data bricks notebook from excel, if is there please help me how..? Many thanks

ImportError: No module named numpy on spark workers

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Someho

Dataframe empty check pyspark

I am trying to check if a dataframe is empty in Pyspark using below. print(df.head(1).isEmpty) But, I am getting an error Attribute error: 'list' object has no

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet data/2022/05/01/00/data_00.parquet data/2022/05/01/01/data_01.parquet data/2022/05/01/02/da

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext

Anyon know Why I keeo getting this error in Jupyter Notebooks??? I've been trying to load my Tensorflow model into Apache Spark vis SparlFlowbut I can't seem to

coding reduceByKey(lambda) in map does'nt work pySpark

I can't understand why my code isn't working. The last line is the problem: import findspark findspark.init() from pyspark import SparkConf, SparkContext from p

Using monotonically_increasing_id() for assigning row number to pyspark dataframe

I am using monotonically_increasing_id() to assign row number to pyspark dataframe using syntax below: df1 = df1.withColumn("idx", monotonically_increasing_id(

pyspark wordcount sort by value

I'm learning pyspark, I'm trying below code. Can someone help me to understand what wrong? >>> pairs=data.flatMap(lambda x:x.split(' ')).map(lambda x

How can I access python variable in Spark SQL?

I have python variable created under %python in my jupyter notebook file in Azure Databricks. How can I access the same variable to make comparisons under %sql.

Pyspark-pandas not working on Spark 3.1.2

I am using spark 3.1.2 and attempting to use pyspark-pandas. However when attempting from pyspark import pandas as ps I am getting the following error: ImportEr

How can you parse a string that is json from an existing temp table using PySpark?

I have an existing Spark dataframe that has columns as such: -------------------- pid | response -------------------- 12 | {"status":"200"} response is a st

Where to set the S3 configuration in Spark locally?

I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access

How to split a list to multiple columns in Pyspark?

I have: key value a [1,2,3] b [2,3,4] I want: key value1 value2 value3 a 1 2 3 b 2 3 4 It seems that in scala I can wr

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

I am trying to I am tring to delete stop words via spark,the code is as follow from nltk.corpus import stopwords from pyspark.context import SparkContext from

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this. I

AttributeError: Can't get attribute '_fill_function' on <module 'pyspark.cloudpickle' from 'pyspark/cloudpickle/__init__.py'>

While executing pyspark code from a script. Getting following error while df.show(). from pyspark.sql.types import StructType,StructField, StringType, IntegerTy

PySpark error: AnalysisException: 'Cannot resolve column name

I am trying to transform an entire df to a single vector column, using df_vec = vectorAssembler.transform(df.drop('col200')) I am being thrown this error: F

pyspark SQL cannot resolve 'explode()' due to data type mismatch

Running Pyspark script getting the following error depending on which xml I query: cannot resolve 'explode(...)' due to data type mismatch The pyspark code: fr

Vertica data into pySpark throws "Failed to find data source"

I have spark 3.2, vertica 9.2. spark = SparkSession.builder.appName("Ukraine").master("local[*]")\ .config("spark.jars", '/home/shivamanand/spark-3.2.1-bin-hado

Can you get filename from input_file_name() in aws s3 when using gunzipped files

Been searching for an answer to do this for quite awhile now but can't seem to figure it out. I've read Why is input_file_name() empty for S3 catalog sources in