Category "databricks"

Pyspark select multiple columns from list and filter on different values

I have a table with ~5k columns and ~1 M rows that looks like this: ID Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 ID1 0 1 0 1 0 2 1 1 2 2 0 ID2 1

Read and group json files by date element using pyspark

I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document. What I think that my c

What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests. I want to run the tests on devops agent so i am using JDBC (Simba driver) bu

How to give input to prompt asked in cells in Databricks Notebook?

As you can see the library I'm using is asking to make an entry but there's no box/window where I can make the entry. How do I make an entry here amongst y/n/u/

PySpark - Convert a heterogeneous array JSON array to Spark dataframe and flatten it

I have streaming data coming in as JSON array and I want flatten it out as a single row in a Spark dataframe using Python. Here is how the JSON data looks like

Azure Storage Account file details in a table in databricks

I am loading data via pipelines in ADLS gen2 container. Now I want to create a table that has details that when the pipeline start running and then completed. l

Delete multiple rows from a delta table/pyspark data frame givien a list of IDs

I need to find a way to delete multiple rows from a delta table/pyspark data frame given a list of ID's to identify the rows. As far as I can tell there isn't a

Invalid labels for classification logistic regression model in pyspark databricks

I am using Spark ML library for classification problem using a logistic regression. I have vectorized input features and created training dataset and test datas

Error while running Scala code - Databricks 7.3LTS and above

I am running databricks 7.3LTS and having errors while trying to use scala bulk copy. The error is: object sqldb is not a member of package I hav

Azure Databricks keep long-running notebook alive when closing browser

I am working with Azure Databricks jupyter notebooks and have time-consuming jobs (complex queries, model training, loops over many items, etc.). Every time I c

sonarQube Findbug error and ##[error]java.lang.IllegalStateException: Can not execute Findbugs

I am really struggling from months. We are trying to scan SCALA code with SonarQube in Azure Devops which is in Databricks. We were getting around 30 error. But

Azure Data Explorer (ADX) vs Polybase vs Databricks

Question Today I discovered another Azure service called Azure Data Explorer (ADX). Sorry for such comparison of services, I have good understanding of all exc

Airflow DAGS Orchestration

I have three DAGs (say, DAG1, DAG2 and DAG3). I have a monthly scheduler for DAG1. DAG2 and DAG3 must not be run directly (no scheduler for these) and must be r

compare two tables having same column name but different date column names

I have table A id1 dt x1 2022-04-10 a2 2022-04-10 a1 2022-04-10 x1 2022-05-10 x2 2022-04-10 y2 2022-04-10 y1 2022-05-10 x1 2022-06 -10 Table B id1 dt a1 2022

Py4JJavaError in an Azure Databricks notebook pipeline

I have a curious issue, when launching a databricks notebook from a caller notebook through (I am working in Azure Databricks). One intere

Get the list of loaded files from Databricks Autoloader

We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader databa

Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often). The sql query just performs create table as select

Using Databricks/Python3.x ZipFile to extract 7gb file from zip

I've got a large NPI zipfile which includes a 7.3gb csv. (file can be located on NPI site here: -- the Full Replac

How to avoid zipfile error with python-pptx saving files

I am using the python-pptx package to create a number of .pptx files from a series of dataframes. All works well with adding slides and such until it comes time

Issues installing gdal-bin (libmysqlclient21 dependency) on 20.04.3 (databricks job clusters)

I've had, in the past, gdal utilities installed successfully on a Databricks Cluster running 20.04.3 LTS (focal). $ cat /etc/os-release NAME="Ubuntu" VERSION="2