'Druid - Distinct values of a set of columns
I need distinct values of a set of dimensions from my indexes in Druid, using the REST/JSON API. Currently I am using following aggregate query (though I don't need aggregation) because I couldn't find any other way to fetch distinct values, as all queries mandate aggregate field.
{
"queryType": "groupBy",
"dataSource": "ds1",
"granularity": "all",
"dimensions": ["level1","level2","level3", "val1", "val2"],
"aggregations": [{"type": "count","name": "val2Count","fieldName": "val2"}],
"intervals": ["2015-10-22T00:00:00.000/2015-12-22T23:00:00.000"]
}
I think this is a common scenario. Is there a way to get distinct set of columns without aggregation?
When I'm doing it with aggregation, how much does this aggregate affect performance (other than network IO for extra count data) over the way, if there was one without aggregation?
Solution 1:[1]
You can use select query to do the same. http://druid.io/docs/latest/querying/select-query.html
{
"queryType": "select",
"dataSource": "ds1",
"granularity": "all",
"descending": "true",
"intervals": ["2016-05-01T00:00:00.000/2016-06-01T00:00:00.000"],
"dimensions":["level1","level2","level3", "val1", "val2"],
"pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
}
If you have just one dimension then you may also use Lexicographic TopNMetricSpec. http://druid.io/docs/latest/querying/topnmetricspec.html
Solution 2:[2]
You can use the metadata query. This query will provide a cardinality estimation for every dimension. http://druid.io/docs/latest/querying/segmentmetadataquery.html Although if you want to list all the distinct combination let say for dimension1 and dimension2, you have to use groupBy over dimension1 and dimension2. Adding a count aggregator won't really affect the performance.
Solution 3:[3]
I was looking for the same answer and after doing some testing of my own I have discovered that the following query is more performant than a GROUP BY:
SELECT APPROX_COUNT_DISTINCT(CONCAT(level1, level2, level3, level4, level5))
FROM ds1
WHERE __time >= CURRENT_DATE - INTERVAL '1' DAY
^ this above is far more performant for me than the below
SELECT SUM("count") AS distinct_count
FROM (
SELECT 1 AS "count"
FROM ds1
WHERE __time >= CURRENT_DATE - INTERVAL '1' DAY
GROUP BY level1, level2, level3, level4, level5
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ssaurav |
| Solution 2 | |
| Solution 3 | JRob |
