'How do I make this function more efficient?

I'm trying to create a tiny database that contains a list of people, their names, ID, and their income as time passed on. Using this database I'd like to find their median income.

The class structure I have in mind for the people is something like this

class Person {
       string name;
       string ID;
       vector<int> Incomes;
 };

And the class structure for the database overall is just a vector containing these incomes.

 class Database{
             vector<Person> somepeople;
 };

Let's say I create a function called median income that returns the median income of all people in this database. The obvious method would be to iterate through every person, and then subsequently iterate through every value in the vector of incomes to create a list that contains all values and then from that finding the median. However, this would be in O(n*k) time. Is there a way to achieve this in O(nlogn) or less, including ways provided by the C++ STL?



Solution 1:[1]

You have a very interesting task! I fully implemented from scratch code of two solutions for you. First solution is much shorter and uses only standard C++ library, second solution is several times longer and more difficult but 3x faster than standard library solution.

You may skip whole description text below and just run C++ code provided at the bottom.

First solution implements class StdMedianFinder for finding median that is capable of doing 3 operations:

  1. Add(value, count) - this allows to add several incomes of different users at a time, income is signified by value and count means how many occurances of this income all users have. You may add income of any user at any point in time, for example if your database is updated from time to time with new users.

  2. Del(value, count) - this allows to delete incomes of a user, for example in case if user is deleted from database. value and count mean same as in Add() operation.

  3. Median() - this method returns median of all current incomes, this is exactly what you wanted in your Question. Method is very fast, returns value immediately no matter how many incomes are there, because main time and computation consumer is Add() and Del().

As you can see above library allows to do all things incrementally, meaning that in different points in time you can update your database by adding or removing some users or parts of their incomes. Incrementality gives very fast speed for small updates, meaning that if you had 1000 incomes that took for example 100 micro-seconds to add them to structure, and later added/removed 10 more incomes then they will take just 1% of initial time - just 1 micro-second, because 10 incomes / 1000 incomes = 1%.

Class above uses only std::map and nothing else. As it is known std::map implements sorted (by keys) container using Red-Black binary search tree. Adding/removing one element to my class takes approximately time of adding/removing one element to std::map plus one advance of iterator (++it; or --it;). std::map for adding/removing one element uses approximately O(Log(N)) time, where N is amount of elements already in container. Hence total time for adding/removing N elements in my class takes O(N * Log(N)) time.

In your Question you noted that you want very fast solution. Both of my solutions give O(N * Log(N)) time for adding N different incomes. If you have n users and k incomes for each user, as you said in Question, then my N = n * k, it means time is O(n * k * Log(n * k)). But there are good news - this is full time when doing only very first computation of Median, later calls to median give immediate result, and as it often happens in all DataBases new users are added or removed later from time to time, and if you add/remove a user that has k incomes then you need only small extra time O(k * Log(n * k)) to update my class and re-compute Median. When adding single user, Important is not to do whole re-computation of all users (whole database), but only to Update class with incomes of this single user.

For second solution that is much bigger, and 3x times faster, I implemented from scratch so-called B-Tree that is not present in standard C++ library, and using this tree I implmented BTreeMedianFinder class that has exactly same interface of first class StdMedianFinder.

Important Note All my code below (and especially B-Tree) is not very well tested, and can't be suited for copy-pasting into Production products. Either you have to copy-paste it into just educational projects or do some extra deep testing before usage in Production.

B-Tree is implemented in such a way that besides (Key, Value) (which are income and count of same incomes) stored in leaf nodes, it also stores inside intermediate nodes maximum value of a sub-tree and total count of incomes in a sub-tree. This allows to do very fast two operation - given a value of income it answers fast what is the position of income inside a sorted list of all incomes, and given desired position of income inside sorted list it answers fast what is the value of income at that position (this second operation is used to find Median, because median is basically a value of income at position equal to half of number of all elements).

Following code has several quite large tests that check correctness and speed of three classes BTree, StdMedianFinder, BTreeMedianFinder. Program shows in console timings of all tests and speed measurement.

If you have Boost.StackTrace library installed in your system it can be used to show stacktrace to make exceptions messages more informative. By default in my code this feature is disabled by define #define USE_BOOST_STACKTRACE 0, which you can change to 1 to enable. Also work of this feature was tested fully only on Windows, for Linux system some small fixups might be needed inside includes or defines. Program binary should be compiled with symbols information included, for this add option /Z7 in MSVC compiler and -g in GCC/CLang. Only this 3rd-party library might be used as an option, besides it only standard C++ library was used in my code.

If you do debug build, don't forget to define _DEBUG macro, if this macro is defined than I do very many extra assertion checks in all classes of my code to do extra testing of code correctness and prevent bugs and crashes.

Code below is fully compilable and working without any modifications. If you run it executes all tests and speed measurements and shows them in console. Also click on Try it online! link below to see live run of this code on GodBolt Linux servers.

Try it online!

SOURCE CODE HERE. Due to StackOveflow limit of 30000 characters for post size, providing source code shared at external PasteBin service. My post with code appeared to be 41000 characters long! Also you may click Try it online! link above, it has same source code duplicated at GodBolt servers.

Output:

'Test BTree' time 1.188 sec
'Test StdMedianFinder' time 0.532 sec
'Test BTreeMedianFinder' time 0.535 sec
'Std_Finder Add' time 1.426 sec
'BTree_Finder Add' time 0.577 sec
'Std_Tree Add' time 1.249 sec
'Std_Finder Median' time 0.001 sec
'BTree_Finder Median' time 0.321 sec
'Std_Finder Del' time 1.252 sec
'BTree_Finder Del' time 0.477 sec
'Std_Tree Del' time 1.117 sec
'Test MedianFinders Speed Total' time 6.907 sec

Solution 2:[2]

Well, only when there are two Incomes vectors, you can find the median of their merger in O(logk). Because, binary search enables that method.

But for more than that, I don't think there is any avoiding a linear traversal.

Think that even if the vectors are sorted beforehand, even then we don't know how their values compare against each other, and middle elements can randomly be in the 2nd or 88th vector.

Solution 3:[3]

Why not use std::nth_element ? This is O(n) If the number of elements is odd => the middle element is the median If the number of elements is even => find the iterator left middle element, then right middle element min() of 2nd half median is *leftMiddle + (*rightMiddle - *leftMiddle) / 2;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 mr.loop
Solution 3 SJHowe