Tuesday, January 29, 2019

Fast medians over growing data

So, getting the median from a set of numbers. This might look overly complex for such a basic task, but lets explain the (quite particular) use case of this.
This came up after a job interview question.
The guy on the other side of the phone asked me to come up with a solution for computing the median from a set of data *while inserting data*, i.e. after each insertion, get the new median (when amount of numbers is even, average the 2 middle values).

The trivial/naive approach would be inserting a number, sorting the range and getting the values in the middle by index. First, to use efficiently random access (i.e. querying index in O(1)), we'd have to sort the data in O(n*log n), and then get the median in O(1). This of course becomes n^2*log n when we have to do it after each insert. Pretty bad.

Approach #2, use quicksort's partition routine to find the median, a.k.a QuickMedian. Since we don't actually need a sorted array of data, we can do inserts in O(1), but we then need to partition the array in each step until we find the median, which can be done in O(n) (average). This is usually an acceptable solution for a one-time median, however, for the present problem, we'd have to do it 'n' times, thus running in O(n^2), and hope we don't run into a worst case. Fast insert, slow query.

#3, in lack of a computer at the time, I managed to babble something about using some sort of balanced BST, (maybe 2-3, maybe order statistics BT, with an AVL as base, whatever!), and we agreed it could be done in O(log n) insertion time (each insert), and O(log n) query time for the median (again, each time).
However, at this point the guy told me if I could come up with a solution which would always be O(1) to query, but I was running out of time and had to get going, so we had to say goodbye.

Then, I got thinking about the issue and came up with something like this. It uses 2 heaps, a min and a max, to keep each half of the data, such that the top elements will invariably always be the median. Since inserting in a normal heap is O(log n), and I had to do a limited amount of push()/pop() operations per new number, this means I could finally insert in O(log n) and query in O(1) (as long as there aren't frequent memory reallocations of the data due to space).

Here's how, hope someone finds it useful. Credits to this user who essentially had this same idea.



#include <iostream>
#include <vector>
#include <queue>
#include <algorithm>
#include <iomanip>
#include <time.h>

#define INPUT_SIZE 10000000

using namespace std;

// The idea is to keep the data partitioned around the median values.
// We use a max heap and a min heap to split the data in equal parts.
// This can mantain the given time complexity restrictions always,
// provided that there's no frequent memory realloc. to fit more data.

// compile: g++ -O3 -std=c++11 -o fastmedians fastmedians.cpp
//          g++ -O2 -std=c++11 -o fastmedians fastmedians.cpp

int main(){

    vector<int> cont1, cont2;
    vector<double> answers;
    cont1.reserve(10485760); 
    cont2.reserve(10485760);
    answers.reserve(10485760*2);
    int aux;

    srand (42);  // seed for replicability.
    int *data = new int[INPUT_SIZE]; // let's get serious
    for(int i = 0; i < INPUT_SIZE; i++)
    data[i] = rand() % 20000000;

    // Half of the data in a max-heap, half of it in a min-heap, the median will be on top of them.
    priority_queue<int, vector<int>              > maxheap(less   <int>(), move(cont1));
    priority_queue<int, vector<int>, greater<int>> minheap(greater<int>(), move(cont2));

    //Get the clock running!
    clock_t tick = clock();

    for(int i = 0; i < INPUT_SIZE; i++){
      
        minheap.push(data[i]); // O(log_n).
      
        // Keep both parts balanced, O(log_n).
        if( minheap.size() > maxheap.size() ){
            maxheap.push(minheap.top());
            minheap.pop();
        }
      
        // if the top values aren't ordered, swap them, O(log_n).
        if(minheap.size() > 0 && minheap.top() < maxheap.top()){
            aux = maxheap.top();
            maxheap.pop();
            maxheap.push(minheap.top());
            minheap.pop();
            minheap.push(aux);
        }
      
        // fetch median, O(1).
        answers.push_back( maxheap.size() == minheap.size() ? ((double)maxheap.top()+minheap.top())/2 : maxheap.top() );
  
    }
 
    tick = clock() - tick;
    cout << setprecision(12) << "Time taken: " << ((double)tick/CLOCKS_PER_SEC) << endl;
 
    // Uncomment the following to write results to std. output. (LARGE)
    //for(auto i : answers)
    // cout << answers[i] << endl;
 
    return 0;
}


This shouldn't take more than 2 seconds for n = 10M on the average nowadays laptop
Note that for a real life use case one could just have gone into #3 and add computing and storing the median to the insert operation, thus having "insert" done in O(log n), and query in O(1). Hence why I think the interview was quite pointless in this item.