Category "apache-spark"

bucketing with QuantileDiscretizer using groupBy function in pyspark

I have a large dataset like so: | SEQ_ID|RESULT| +-------+------+ |3462099|239.52| |3462099|239.66| |3462099|239.63| |3462099|239.64| |3462099|239.57| |3462099|

Spark illegal character in path

I am trying to start up Spark on my machine. But when I try to launch using spark-shell I get an error that there is an illegal character in the path. Caused by

Structured Streaming to Save JSON to HDFS

My Structured Spark Streaming program is to read JSON data from Kafka and write to HDFS in JSON format. I am able to save JSON to HDFS but it saves the JSON st

How to convert hashbytes string from sql to spark equivalent

I have a process using the following select statement in sql server SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted This give

Error while using readstream from delta on azure data lake gen 2

I get the below error while reading data from delta lake. The detailed log on azure shows its failing to read .tmp file from the _delta_log folder. I have tried

Does AWS Glue support positional arguments

How to capture a Glue job's arguments by position rather than using the getResolvedOptions function and passing the arguments as key value pairs?

what is est in filter sparkUI sql tab

I am trying to debug my spark UI, and in the SQL tab of spark UI getting this red mark on filter description, trying to figure out what does it mean. Spark UI s

Reading of Elasticsearch with pyspark fails with exceptn java.lang.NoClassDefFoundError: org/apache/commons/httpclient/protocol/ProtocolSocketFactory

I have a spark cluster in kubernetes based on image mcr.microsoft.com/mmlspark/spark2.4:v4. Spark version version 2.4.0 Using Scala version 2.11.12, OpenJDK

SQL order of execution

I wonder how this query is executing successfully. As we know 'having' clause execute before the select one then here how alias name used in 'select' statement

Pyspark's df.writeStream generates no output

I'm trying to store the tweets from my kafka cluster into Elastic Search. Initially, I set the output format to be 'org.elasticsearch.spark.sql'. But , it creat

scala spark partitionby and get current partition name

I'm using scala spark and have a DataFrame: Source | Column1 | Column2 A ... ... B ... ... B ... ... C ...

Read and group json files by date element using pyspark

I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document. What I think that my c

Spark 3.0 timeStamp parsing doesn't work ever after passing the format

This is a issue I am facing with Spark 3.0, worked before without even specifying a format. Now, I tried explicitly specifying the format, but it still doesn't

Efficient way to parse a file with different json schemas in spark

I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to sp

Could not create lake database from synapse notebooks

New to azure synapse, trying to create database (Managed table) from synapse notebook. I also added Storage blob data contributor for synapse workspace and spec

Spark History server not listing completed jars

I'm running Spark standalone jobs in Windows. I would like to monitor my Spark jobs using the spark history server. I have launched spark history server with be

Spark RDD: Find the single row that has the highest count and for that row report the month, count and hashtag name. Output Using PrintLn

[Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output us

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera **time spark-submit \ --deploy-mode client \ --conf spark.app.name='XXXxxxxxx' --conf spark.master=l

Random Sampling base on 1 column after Groupby

I have a Spark Table, which contains 400+ millions records/rows. I used spark.table to convert it into a DF. The DF looks like this below id pub_date

How we can use mutimap_agg function in spark sql and also suggest if any equivalent or alternative function to this

Can anyone help how multimap_agg function in SQL and can be used in spark sql