'Range search with KNN on two different dimensions

I've a few million records (which are updated often) with 2 properties:

  • Timestamp
  • Popularity score

I'm looking for a data structure (maybe some metric tree?) that can do fast range search on 1 dimension (e.g. all records greater than a timestamp value), and locate top K records that fall within that range on the other dimension (i.e. popularity score). In other words, I can phrase this query as "Find top K popular records with timestamp greater than T".

I currently have a naive implementation where I filter the N records in linear time complexity and then identify the top K records using a partial sorting algorithm. But this is not fast enough given the number of concurrent users we need to support.

I'm not super familiar with KD trees, but I see that some popular implementations support both range searches and finding K nearest neighbors, but my requirements are a bit peculiar here -- so I'm wondering if there is a way to do this faster, at the expense of maybe additional indexing overhead.



Solution 1:[1]

If you will invest the initial sorting of a list of tuples (record_name, timestamp) by the timestamp, and create a dictionary with the record name as keys and (popularity_score, timestamp_list_idx) tuples as values you will be able to:

  1. Perform binary search for a particular timestamp O(logn)
  2. Extract the greater than values in O(1) since the array is sorted
  3. Extract the matching popularity vote in O(1) since they are in a dictionary
  4. Update a record popularity score in O(1) due to the dictionary
  5. Update a particular timestamp in O(1) via pulling the index of the record from the tuple in the dictionary value

Suppose you have m records with the wanted timestamp range, you can

  1. generate a max heap from them by popularity, this takes O(m) and then perform k pops from that heap with O(klogm) since we need to repopulate the root after every pop. This means that the actions you want to do will take O(m + klogm). Assuming k << m this will run in O(m).
  2. Iterate over the m records with a list in size k to keep track of thr top k popular songs. After passing over all m records you will have the top k in the list. This takes O(m) as well

Method 1 take a little more time than method 2 in terms of complexity but in case you suddenly want to know the k+1 most popular record, you can just pop abother item from the heap instead of passing over the entire m records again with a k+1 long list.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1