'Range search with KNN on two different dimensions
I've a few million records (which are updated often) with 2 properties:
- Timestamp
- Popularity score
I'm looking for a data structure (maybe some metric tree?) that can do fast range search on 1 dimension (e.g. all records greater than a timestamp value), and locate top K records that fall within that range on the other dimension (i.e. popularity score). In other words, I can phrase this query as "Find top K popular records with timestamp greater than T".
I currently have a naive implementation where I filter the N records in linear time complexity and then identify the top K records using a partial sorting algorithm. But this is not fast enough given the number of concurrent users we need to support.
I'm not super familiar with KD trees, but I see that some popular implementations support both range searches and finding K nearest neighbors, but my requirements are a bit peculiar here -- so I'm wondering if there is a way to do this faster, at the expense of maybe additional indexing overhead.
Solution 1:[1]
If you will invest the initial sorting of a list of tuples (record_name, timestamp) by the timestamp, and create a dictionary with the record name as keys and (popularity_score, timestamp_list_idx) tuples as values you will be able to:
- Perform binary search for a particular timestamp O(logn)
- Extract the greater than values in O(1) since the array is sorted
- Extract the matching popularity vote in O(1) since they are in a dictionary
- Update a record popularity score in O(1) due to the dictionary
- Update a particular timestamp in O(1) via pulling the index of the record from the tuple in the dictionary value
Suppose you have m records with the wanted timestamp range, you can
- generate a max heap from them by popularity, this takes O(m) and then perform
kpops from that heap with O(klogm) since we need to repopulate the root after every pop. This means that the actions you want to do will take O(m + klogm). Assumingk << mthis will run in O(m). - Iterate over the
mrecords with a list in sizekto keep track of thr topkpopular songs. After passing over allmrecords you will have the topkin the list. This takes O(m) as well
Method 1 take a little more time than method 2 in terms of complexity but in case you suddenly want to know the k+1 most popular record, you can just pop abother item from the heap instead of passing over the entire m records again with a k+1 long list.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
