'How to create a variable from a SQL statement's output in Databricks

I would like to create a variable that I can use later in multiple sql queries within the same notebook. The variable itself is simply an integer output from a sql statement such as:

    select count(people) from persons_table;

Then in subsequent sql command cells, I would simply call the variable in calculations such as:

    select (count(cars) / $variable) as car_rate from table_x;

The only variables I've used in Databricks have been simple widgets from a Python script (dbutils.widgets.text('name','value')).

I already have an alternative, where I have a temp table with the output, but would rather use a variable if this is possible.



Solution 1:[1]

I found this while looking for the same problem. As your qustion isn't answered I'll put the one I found for others. I'm not sure it is the best, but it works.

Use a python notebook.

Then you can use python variables.

peopleCount = spark.sql("select count(people) from persons_table").collect()[0][0]

The spark.sql reads the sql into a pyspark dataframe, if you just sent the SQL the variable would be a dataframe object. The collect reads the result from the dataframe into the variable. The [0][0] is the "offset" but you only need to worry about that if you have more than one column or row.

Then in the next cell

%sql 

select (count(cars) / '$peopleCount') as car_rate from table_x; 

The %sql tells databricks this is a SQL cell and the variable needs to be in quotes with a $

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 JonB65