this is what cmd said and I don't know how to fix this I saw similar cases like this in the stackoverflow but their suggestion didn't fix my problem I hope you
I created a Dockerfile with just debian and apache spark downloaded from the main website. I then created a kubernetes deployment to have 1 pod running spark dr
We are working to migrate to data bricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime
It is suggested that you can 'generate unique increasing numeric values' by select row_number() over (order by monotonically_increasing_id()) from /* ... */ Bu
I am sending JSON telemetry data from Azure Stream Analytics to Azure Data Lake Gen2 serialized as .parquet files. From the data lake I've then created a view i
Is there any programmatic way to find out the cluster version(CDH6 or CDP7) from a CDSW session? Could any environment variable give a fool-proof way to determi
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: long (nullable = true) | | |-- z: array (nullable = tru
|-- x: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- y: struct (nullable = true) | | |-- z: struct (nullable =
I am having a very simple problem with spark, but there is very little information on the web. I have encountered this problem using both pyspark and scala. The
I want to find the number of unique records based on myparam value. Solr distinct query I want only certain fields to be listed. too many ifs in the distinctVal
I have the following dataframe +---------------+--------+ |book_id |Chapters| +---------------+--------+ |865731 |[] | +---------------+----
Recently, I have started to occupy the AWS platform, but when trying to occupy Sagemaker, the following error and I don't know if it is because of Sagemaker or
We are building a reusable data framework using PySpark. As part of this, we had built one big utilities package that hosted all the methods. But now, we are pl
Is it possible to ensure that the value at the same index of each Collect_set is on a single line of the original dataframe? ("a",1) ,("b",2)
As I create a new column with F.lit(1), while calling printSchema() I get column_name: integer (nullable = false) as lit function docs is qui
I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like
When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it int
I have an ETL (spark-scala). After writing in a table, a message with "header" must be sent to Kafka. I couldn't add the header in the message. I have a spark D
We have folders and subfolders in it with year,month, day folders in it. How can we get only the last leaf level folder list using dbutils.fs.ls utility? Exampl
By default the variable DEFAULT_CARDINALITY_THRESHOLD is set to 120 in Deequ. This is very low for our use case. Can anyone please suggest if we can set this va