Category "pyspark"

Pyspark select multiple columns from list and filter on different values

I have a table with ~5k columns and ~1 M rows that looks like this: ID Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 ID1 0 1 0 1 0 2 1 1 2 2 0 ID2 1

Read and group json files by date element using pyspark

I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document. What I think that my c

Could not create lake database from synapse notebooks

New to azure synapse, trying to create database (Managed table) from synapse notebook. I also added Storage blob data contributor for synapse workspace and spec

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera **time spark-submit \ --deploy-mode client \ --conf spark.app.name='XXXxxxxxx' --conf spark.master=l

Dataproc YARN container logs location

i'm aware of the existence of this thread:where are the individual dataproc spark logs? However if i ssh connect to a worker node vm and navigate to the /tmp fo

Pyspark - iterate on a big dataframe

I'm using the following code events_df = [] for i in df.collect(): v = generate_event(i) events_df.append(v) events_df = spark.cr

How to process a 64 GB CSV file in Pyspark efficiently?

I have a very large CSV file in a blob storage nearly of size 64 GB. I need to do some processing on top of every row and push the data to DB. What should be th

How we can use mutimap_agg function in spark sql and also suggest if any equivalent or alternative function to this

Can anyone help how multimap_agg function in SQL and can be used in spark sql

How to create child dataframe from xml file using Pyspark?

I have all those supporting libraries in pyspark and I am able to create dataframe for parent- def xmlReader(root, row, filename): df = spark.read.format("

Problem with cassandra-connector at "load()"

I downloaded succesfully this connector: com.datastax.spark:spark-cassandra-connector_2.11:2.5.1 And when I try to load the information with this line: data = s

Pyspark to parse multiples dictionaries

Imagine I have the following list of dicts in python: list = [dict1, dict2, dict3] I want to parse these dicts and transform them to a dataframe and save it as

SparkFatalException root cause

I am using spark 3.0.2 with java 8 version. I am trying to write data on s3 path using spark job. I am getting below exception, not able to know what caused thi

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

I am new to Spark and BigData component - HBase, I am trying to write Python code in Pyspark and connect to HBase to read data from HBase. I'm using the followi

Category "pyspark"

Pyspark select multiple columns from list and filter on different values

Read and group json files by date element using pyspark

Could not create lake database from synapse notebooks

Unable write data using spark submit

Dataproc YARN container logs location

Pyspark - iterate on a big dataframe

How to process a 64 GB CSV file in Pyspark efficiently?

How we can use mutimap_agg function in spark sql and also suggest if any equivalent or alternative function to this

How to create child dataframe from xml file using Pyspark?

Problem with cassandra-connector at "load()"

Pyspark to parse multiples dictionaries

SparkFatalException root cause

py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. : java.lang.NoClassDefFoundError: org/apache/spark/Logging

Is there a way to slice dataframe based on index in pyspark?

Delete multiple rows from a delta table/pyspark data frame givien a list of IDs

AWS EMR s3a filesystem not found

AWS EMR s3a filesystem not found

pyspark create dictionary from data in two columns

Is it possible to load multiple directory separately in pyspark but process them in parallel?

How to test mocked (moto/boto) S3 read/write in PySpark

Category "pyspark"

Other Categories