Category "databricks"

Polymorphic data transformation techniques / data lake/ big data

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent

Why write from databricks spark notebook ( hadoop fileUtils) to DBFS mount location is 13 times slower than write to DBFS Root location?

Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore. I would like to understand

Get previous row value based on a timestamp for matching IDs

I have a table about shipping that has information about the arrival (country and date) to a port. Now I need to extract the country where it departed from usin

How to create a delta table with invalid characters in column headers

Looking for a work around to create delta table with invalid character (). Below is the example CREATE TABLE `validation_log` ( `Error_Description` STRING,

Scala bulkcopy not working in Azure DAtabricks runtime 7.3LTS and above

My scala code that used to work fine with databricks runtime 5.5LTS is not working with runtime 7.3LTS and above. I have tried upgrading microsoft libraries acc

Databricks Dashboard auto refresh

For Data Visualisation purposes, I am using Databricks to create dashboards. This is achieved by creating charts in a notebook and linking those charts to the d

How can we truncate and load the documents to a cosmos dB collection with out dropping it in pyspark

I have a monthly job in databricks where I want to truncate all records for previous month and then load for current month in cosmos db so I tried with option("

Time Serie with delta time travel in databricks

I'm storing in a delta table the prices of products. The schema of the table is like this: id | price | updated 1 | 3 | 2022-03-21 2 | 4 | 2022-03-20

Issue with display()/collect() Large DataFrame In Pyspark

Getting The Following Issue In PySpark to perform display()/collect() operation on top of a generated dataframe. The df contains single column & Row (JSON d

Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to ma

What Happens When a Delta Table is Created in Delta Lake?

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following, DROP TAB

MLFlow Webhook calling Azure DevOps pipeline - retrieve body

I am using the MLFlow Webhooks , mentioned here. I am using that to queue an Azure Devops Pipeline. However, I can't seem to to find a way to retrieve the paylo

How to processing json data in a column by using python/pyspark?

Trying to process JSON data in a column on Databricks. Below is the sample data from a table (its a weather device records info) JSON_Info {"sampleData":"dataD

Can I iterate through the widgets in a databricks notebook?

Can I iterate through the widgets in a databricks notebook? Something like this pseudocode? # NB - not valid inputs = {widget.name: widget.value for widget in

Databricks Runtime 10.4 LTS - AnalysisException: No such struct field id in 0, 1 after upgrading

We are working to migrate to data bricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime

Is it possible to connect to serverless sql pool via azure databricks?

I'm trying to connect to synapse serverless pool via databricks. I need to create synapse views and external tables directly in databricks as part of an existin

ParseException: SQL CTE

result = aml_identity_g.connectedComponents() conn_comps = result.select("id", "component",'type') \ .createOrReplaceTempView("components") display(result)

Sort by key (Month) using RDDs in Pyspark

I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark? Note: Don't want to use spark.sql or Dataframe. +-----+-----+ |Month|co

How to get list of all leaf folders from ADLS Gen2 path via Scala code?

We have folders and subfolders in it with year,month, day folders in it. How can we get only the last leaf level folder list using dbutils.fs.ls utility? Exampl

Is it possible to persist .env values in the .whl files when installed on a Databricks cluster? I'd prefer to keep all values in library (.whl)

I have created a project in Pycharm. This project has a .py file with functions, init.py and a .env file with my secret values. I need to be able to run this in

Category "databricks"

Other Categories