'Merge two lists of intervals with priority for one list

I'm currently stuck with an algorithm problem in which I want to optimize the complexity.

I have two lists of intervals S = [[s1, s2], [s3, s4], ..., [sn-1, sn]] and W = [[w1, w2], [w3, w4], ..., [wm-1, wm]] that I want to merge respecting ordinal order, and intervals of S have priority over those of W. (S for strong, W for weak) For example, that priority imply :

  • S = [[5,8]] and W = [[1, 5], [7, 10]] will give : res = [[1, 4, W], [5, 8, S], [9, 10, W]]. Here intervals from W are cropped in priority for intervals of S
  • S = [[5, 8]] and W = [[2, 10]] will give : res = [[2, 4, W], [5, 8, S], [9, 10, W]]. Here the interval of W is split into two parts because S has priority.

While merging those lists, I need to keep track of the strong of weak nature of those intervals by writing a third element beside each interval, that we can call the symbol. that's why the result is something like : [[1, 4, W], [5, 8, S], [9, 10, W]]. Finally, as the union of all interval does not cover all integers in a certain range, we have a third symbol, let's say B for blank which fill missing interval : [[1, 2, W], [5, 8, S], [9, 10, W], [16, 20, S]] will be filled in to become : [1, 2, W], [3, 4, B], [5, 8, S], [9, 10, W], [11, 15, B], [16, 20, S]]

My first attempt was very naive and lazy (because I first wanted it to work) : If the greatest integer covered by these two lists of intervals is M, then I created a list of size M filled with B symbols : res = [B]*M = [B, B, B ..., B] Then I first take interval from W one by one and rewrite elements from res of index in this interval to change its symbol to W. Next, I do the same with intervals of S, and the priority is respected because I overwrite with the symbol S in the last step. It gives something like :

  1. [B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B]
  2. [B, B, B, W, W, W, W, B, W, W, W, W, B, W, W, B, B]
  3. [B, B, S, S, W, W, W, B, S, S, W, W, B, S, W, B, B]

Finally, I go through the big list one last time to factorize and recreate intervals with its corresponding symbols. Previous example gives : [[1, 2, B], [3, 4, S], [5, 7, W], [8, 8, B], [9, 10, S], [11, 12, W], [13, 13, B], [14, 14, S], [15, 15, W], [16, 17, B]]

Unfortunately but predictably, this algorithm is not usable in practice : M is around 1000000 in my application and this algorithm is O(n2) if I'm not mistaken.

So, I would like some advice and directions to solve this algorithmic complexity problem. I'm sure that this problem looks alike a well-known algorithmic problem but I don't know where to go.

My few ideas to improve that for now can be used to optimize the algorithm, but are quite complex to implement so I think there is better ideas. but here they are :

  • Do the same kind of overwrite process to respect priority : in list W, insert intervals of S with overwriting when necessary to respect priority. Then fill in this list to insert missing interval with B symbol. But we would have an heavy use of if to compare intervals because of the great amount of cases.
  • Construct a new list while browsing S and W step by step. In this idea we would have one cursor by list to go from interval to interval until the end of one of the two lists. Again we use a lot of if and cases we insert intervals in the new list with respect to priority. But it raises the same complex problem with the great amount of cases.

I hope I made myself clear, if not I can explain in other way. Please teach me with experience and cleverness :)

Thanks

EDIT: here is my "naive" algorithm code:

def f(W, S, size):

  #We first write one symbol per sample
  int_result = ['B'] * size
  for interval in W:
      for i in range(interval[0], interval[1]+1):
          int_result[i] = 'W'
  for interval in S:
      for i in range(interval[0], interval[1]+1):
          int_result[i] = 'S'

  #we then factorize: we store one symbol for an interval of the same    symbol.    
  symbols_intervals = []
  sym = int_result[0]
  start = 0
  for j in range(len(int_result)):
      if int_result[j] != sym:
          symbols_intervals.append([start, j-1, sym])
          sym = all_symbols[j]
          start = j
      if j == len(int_result)-1:
          symbols_intervals.append([start, j-1, sym])

  return symbols_intervals


Solution 1:[1]

Your naive method sounds very reasonable; I think the time complexity of it is O(NM), where N is the number of intervals you're trying to resolve, and M is the the range over which you're trying to resolve them. The difficulty you might have is that you also have space complexity of O(M), which might use up a fair bit of memory.

Here's a method for merging without building a "master list", which may be faster; because it treats intervals as objects, complexity is no longer tied to M.

I'll represent an interval (or list of intervals) as a set of tuples (a,b,p), each of which indicates the time points from a to b, inclusively, with the integer priority p (W can be 1, and S can be 2). In each interval, it must be the case that a < b. Higher priorities are preferred.

We need a predicate to define the overlap between two intervals:

def has_overlap(i1, i2):
    '''Returns True if the intervals overlap each other.'''
    (a1, b1, p1) = i1
    (a2, b2, p2) = i2
    A = (a1 - a2)
    B = (b2 - a1)
    C = (b2 - b1)
    D = (b1 - a2)
    return max(A * B, D * C, -A * D, B * -C) >= 0

When we find overlaps, we need to resolve them. This method takes care of that, respecting priority:

def join_intervals(i1, i2):
    '''
    Joins two intervals, fusing them if they are of the same priority,
    and trimming the lower priority one if not.

    Invariant: the interval(s) returned by this function will not
    overlap each other.

    >>> join_intervals((1,5,2), (4,8,2))
    {(1, 8, 2)}
    >>> join_intervals((1,5,2), (4,8,1))
    {(1, 5, 2), (6, 8, 1)}
    >>> join_intervals((1,3,2), (4,8,2))
    {(1, 3, 2), (4, 8, 2)}
    '''
    if has_overlap(i1, i2):
        (a1, b1, p1) = i1
        (a2, b2, p2) = i2
        if p1 == p2:
            # UNION
            return set([(min(a1, a2), max(b1, b2), p1)])
        # DIFFERENCE
        if p2 < p1:
            (a1, b1, p1) = i2
            (a2, b2, p2) = i1
        retval = set([(a2, b2, p2)])
        if a1 < a2 - 1:
            retval.add((a1, a2 - 1, p1))
        if b1 > b2 + 1:
            retval.add((b2 + 1, b1, p1))
        return retval
    else:
        return set([i1, i2])

Finally, merge_intervals takes an iterable of intervals and joins them together until there are no more overlaps:

import itertools

def merge_intervals(intervals):
    '''Resolve overlaps in an iterable of interval tuples.'''
    # invariant: retval contains no mutually overlapping intervals
    retval = set()
    for i in intervals:
        # filter out the set of intervals in retval that overlap the
        # new interval to add O(N)
        overlaps = set([i2 for i2 in retval if has_overlap(i, i2)])
        retval -= overlaps
        overlaps.add(i)
        # members of overlaps can potentially all overlap each other;
        # loop until all overlaps are resolved O(N^3)
        while True:
            # find elements of overlaps which overlap each other O(N^2)
            found = False
            for i1, i2 in itertools.combinations(overlaps, 2):
                if has_overlap(i1, i2):
                    found = True
                    break
            if not found:
                break
            overlaps.remove(i1)
            overlaps.remove(i2)
            overlaps.update(join_intervals(i1, i2))
        retval.update(overlaps)
    return retval

I think this has worst-case time complexity of O(N^4), although the average case should be fast. In any case, you may want to time this solution against your simpler method, to see what works better for your problem.

As far as I can see, my merge_intervals works for your examples:

# example 1
assert (merge_intervals({(5, 8, 2), (1, 5, 1), (7, 10, 1)}) ==
        {(1, 4, 1), (5, 8, 2), (9, 10, 1)})

# example 2
assert (merge_intervals({(5, 8, 2), (2, 10, 1)}) ==
        {(2, 4, 1), (5, 8, 2), (9, 10, 1)})

To cover the case with blank (B) intervals, simply add another interval tuple which covers the whole range with priority 0: (1, M, 0):

# example 3 (with B)
assert (merge_intervals([(1, 2, 1), (5, 8, 2), (9, 10, 1),
                         (16, 20, 2), (1, 20, 0)]) ==
        {(1, 2, 1), (3, 4, 0), (5, 8, 2),
         (9, 10, 1), (11, 15, 0), (16, 20, 2)})

Solution 2:[2]

The below solution has O(n + m) complexity, where n and m are the lengths of S and W lists. It assumes that S and W are internally sorted.

def combine(S, W):
    s, w = 0, 0 # indices of S and W
    common = []
    while s < len(S) or w < len(W):
        # only weak intervals remain, so append them to common
        if s == len(S):
            common.append((W[w][0], W[w][1], 'W'))
            w += 1
        # only strong intervals remain, so append them to common
        elif w == len(W):
            common.append((S[s][0], S[s][1], 'S'))
            s += 1
        # assume that the strong interval starts first
        elif S[s][0] <= W[w][0]:
            W[w][0] = max(W[w][0], S[s][1]+1)
            if W[w][0] > W[w][1]: # drop the weak interval
                w += 1
            common.append((S[s][0], S[s][1], 'S'))
            s += 1
        # assume that the weak interval starts first
        elif S[s][0] > W[w][0]:
            # end point of weak interval before the start of the strong
            if W[w][1] < S[s][0]:
                common.append(W[w][0], W[w][1], 'W')
                w += 1
            # end point of the weak interval between a strong interval
            elif S[s][0] <= W[w][1] <= S[s][1]:
                W[w][1] = S[s][0] - 1
                common.append((W[w][0], W[w][1], 'W'))
                w += 1
            # end point of the weak interval after the end point of the strong
            elif W[w][1] > S[s][1]:
                common.append((W[w][0], S[s][0]-1, 'W'))
                W[w][0] = S[s][1] + 1
    return common


print combine(S=[[5,8]], W=[[1, 5],[7, 10]])
print combine(S=[[5,8]], W=[[2,10]])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2