'Indexes for Databricks (Spark SQL) tables

Curious as to how indexing works in Databricks. Can you see the partitioning as indexing because it effectively organizes the data in grouped subcategories?



Solution 1:[1]

Yes, partitioning could be seen as kind of index - it allows you to jump directly into necessary data without reading the whole dataset.

For databricks delta there is another feature - Data Skipping. When writing data to Delta, the writer is collecting statistics (for example, min & max values) for first N columns (32 by default) and write that statistics into Delta log, so when we filter data by indexed column, we know if given file may contain given data or not. Another indexing technique for databricks delta is bloom filtering that is shows if the specific value is definitely not in the file, or could be in the file.

Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1