Category "apache-spark"

how to fix php spark serve not working and said thrown in C:\xampp\htdocs\ci-news\system\CLI\CLI.php on line 758

this is what cmd said and I don't know how to fix this I saw similar cases like this in the stackoverflow but their suggestion didn't fix my problem I hope you

how to connect spark workers to spark driver in kubernetes (standalone cluster)

I created a Dockerfile with just debian and apache spark downloaded from the main website. I then created a kubernetes deployment to have 1 pod running spark dr

Databricks Runtime 10.4 LTS - AnalysisException: No such struct field id in 0, 1 after upgrading

We are working to migrate to data bricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime

Why compute row_number() order by monotonically_increasing_id() in Spark?

It is suggested that you can 'generate unique increasing numeric values' by select row_number() over (order by monotonically_increasing_id()) from /* ... */ Bu

Unable to open or query .parquet files due to corrupted column

I am sending JSON telemetry data from Azure Stream Analytics to Azure Data Lake Gen2 serialized as .parquet files. From the data lake I've then created a view i

Programmatic way to find the cluster version from CDSW - Cloudera Data Science Workbench

Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi

Update a highly nested column from string to struct

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru

Spark - Update a nested column to string

|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =

Spark write file csv/hive take too much time and performance benchmark

I am having a very simple problem with spark, but there is very little information on the web. I have encountered this problem using both pyspark and scala. The

solr distinct query I want only certain fields to be listed

I want to find the number of unique records based on myparam value. Solr distinct query I want only certain fields to be listed. too many ifs in the distinctVal

Pyspark - explode return an empty dataframe when a nested collection has no item

I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----

attributeerror: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'

Recently, I have started to occupy the AWS platform, but when trying to occupy Sagemaker, the following error and I don't know if it is because of Sagemaker or

Reuse Spark Session Across Modules/Packages

We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl

spark agg multi collect_list. Can we guarantee that the index of multiple columns in the same row is the same?

Is it possible to ensure that the value at the same index of each Collect_set is on a single line of the original dataframe? ("a",1) ,("b",2)

pyspark.sql.functions.lit() not nullable conversion [duplicate]

As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui

Calculate a sequence of Markov chain values

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like

Show Method for Dynamic Frame in AWS glue returns empty field

When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it int

Generate Kafka message with Headers using Apache Spark

I have an ETL (spark-scala). After writing in a table, a message with "header" must be sent to Kafka. I couldn't add the header in the message. I have a spark D

How to get list of all leaf folders from ADLS Gen2 path via Scala code?

We have folders and subfolders in it with year,month, day folders in it. How can we get only the last leaf level folder list using dbutils.fs.ls utility? Exampl

How to pass Cardinality Threshold value for Histogram in Deequ package?

By default the variable DEFAULT_CARDINALITY_THRESHOLD is set to 120 in Deequ. This is very low for our use case. Can anyone please suggest if we can set this va