'Find subset with K elements that are closest to eachother

Given an array of integers size N, how can you efficiently find a subset of size K with elements that are closest to each other?

Let the closeness for a subset (x1,x2,x3,..xk) be defined as:

enter image description here

2 <= N <= 10^5

2 <= K <= N

constraints: Array may contain duplicates and is not guaranteed to be sorted.

My brute force solution is very slow for large N, and it doesn't check if there's more than 1 solution:

N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = []
for i in xrange(0, N):
    a.append(input())
a.sort()

minimum = sys.maxint
startindex = 0

for i in xrange(0,N-K+1):
    last = i + K
    tmp = 0
    for j in xrange(i, last):
        for l in xrange(j+1, last):
            tmp += abs(a[j]-a[l])
            if(tmp > minimum):
                break

    if(tmp < minimum):
        minimum = tmp
        startindex = i #end index = startindex + K?

Examples:

N = 7
K = 3
array = [10,100,300,200,1000,20,30]
result = [10,20,30]

N = 10
K = 4
array = [1,2,3,4,10,20,30,40,100,200]
result = [1,2,3,4]


Solution 1:[1]

itertools to the rescue?

from itertools import combinations

def closest_elements(iterable, K):
    N = set(iterable)
    assert(2 <= K <= len(N) <= 10**5)

    combs = lambda it, k: combinations(it, k)
    _abs = lambda it: abs(it[0] - it[1])
    d = {}
    v = 0

    for x in combs(N, K):
        for y in combs(x, 2):
            v += _abs(y)

        d[x] = v
        v = 0

    return min(d, key=d.get)

>>> a = [10,100,300,200,1000,20,30]
>>> b = [1,2,3,4,10,20,30,40,100,200]
>>> print closest_elements(a, 3); closest_elements(b, 4)
(10, 20, 30) (1, 2, 3, 4)

Solution 2:[2]

This procedure can be done with O(N*K) if A is sorted. If A is not sorted, then the time will be bounded by the sorting procedure.

This is based on 2 facts (relevant only when A is ordered):

  • The closest subsets will always be subsequent
  • When calculating the closeness of K subsequent elements, the sum of distances can be calculated as the sum of each two subsequent elements time (K-i)*i where i is 1,...,K-1.
  • When iterating through the sorted array, it is redundant to recompute the entire sum, we can instead remove K times the distance between the previously two smallest elements, and add K times the distance of the two new largest elements. this fact is being used to calculate the closeness of a subset in O(1) by using the closeness of the previous subset.

Here's the pseudo-code

List<pair> FindClosestSubsets(int[] A, int K)
{
    List<pair> minList = new List<pair>;
    int minVal = infinity;
    int tempSum;
    int N = A.length;

    for (int i = K - 1; i < N; i++)
    {
        tempSum = 0;

        for (int j = i - K + 1; j <= i; j++)
              tempSum += (K-i)*i * (A[i] - A[i-1]);

        if (tempSum < minVal)
        {
              minVal = tempSum;
              minList.clear();
              minList.add(new pair(i-K, i);
        }

        else if (tempSum == minVal)
              minList.add(new pair(i-K, i);
    }

    return minList;
}

This function will return a list of pairs of indexes representing the optimal solutions (the starting and ending index of each solution), it was implied in the question that you want to return all solutions of the minimal value.

Solution 3:[3]

try the following:

N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = some_unsorted_list
a.sort()

cur_diff = sum([abs(a[i] - a[i + 1]) for i in range(K - 1)])
min_diff = cur_diff
min_last_idx = K - 1
for last_idx in range(K,N):
    cur_diff = cur_diff - \
               abs(a[last_idx - K - 1] - a[last_idx - K] + \
               abs(a[last_idx] - a[last_idx - 1])
    if min_diff > cur_diff:
        min_diff = cur_diff
        min_last_idx = last_idx

From the min_last_idx, you can calculate the min_first_idx. I use range to preserve the order of idx. If this is python 2.7, it will take linearly more RAM. This is the same algorithm that you use, but slightly more efficient (smaller constant in complexity), as it does less then summing all.

Solution 4:[4]

After sorting, we can be sure that, if x1, x2, ... xk are the solution, then x1, x2, ... xk are contiguous elements, right?

So,

  1. take the intervals between numbers
  2. sum these intervals to get the intervals between k numbers
  3. Choose the smallest of them

Solution 5:[5]

My initial solution was to look through all the K element window and multiply each element by m and take the sum in that range, where m is initialized by -(K-1) and incremented by 2 in each step and take the minimum sum from the entire list. So for a window of size 3, m is -2 and the values for the range will be -2 0 2. This is because I observed a property that each element in the K window add a certain weight to the sum. For an example if the elements are [10 20 30] the sum is (30-10) + (30-20) + (20-10). So if we break down the expression we have 2*30 + 0*20 + (-2)*10. This can be achieved in O(n) time and the entire operation would be in O(NK) time. However it turns out that this solution is not optimal, and there are certain edge cases where this algorithm fails. I am yet to figure out those cases, but shared the solution anyway if anyone can figure out something useful from it.

for(i = 0 ;i <= n - k;++i)
{
    diff = 0;
    l = -(k-1);
    for(j = i;j < i + k;++j)
    {
        diff += a[j]*l;
        if(min < diff)
            break;
        l += 2;
    }
    if(j == i + k && diff > 0)
    min = diff;
}

Solution 6:[6]

You can do this is O(n log n) time with a sliding window approach (O(n) if the array is already sorted).

First, suppose we've precomputed, at every index i in our array, the sum of distances from A[i] to the previous k-1 elements. The formula for that would be

(A[i] - A[i-1]) + (A[i] - A[i-2]) + ... + (A[i] - A[i-k+1]).

If i is less than k-1, we just compute the sum to the array boundary.

Suppose we also precompute, at every index i in our array, the sum of distances from A[i] to the next k-1 elements. Then we could solve the whole problem with a single pass of a sliding window.

If our sliding window is on [L, L+k-1] with closeness sum S, then the closeness sum for the interval [L+1, L+k] is just S - dist_sum_to_next[L] + dist_sum_to_prev[L+k]. The only changes in the sum of pairwise distances are removing all terms involving A[L] when it leaves our window, and adding all terms involving A[L+k] as it enters our window.

The only remaining part is how to compute, at a position i, the sum of distances between A[i] and the previous k-1 elements (the other computation is totally symmetric). If we know the distance sum at i-1, this is easy: subtract the distance from A[i-1] to A[i-k], and add in the extra distance from A[i-1] to A[i] k-1 times

dist_sum_to_prev[i] =   (dist_sum_to_prev[i - 1] - (A[i - 1] - A[i - k])
                      + (A[i] - A[i - 1]) * (k - 1)

Python code:

def closest_subset(nums: List[int], k: int) -> List[int]:
    """Given a list of n (poss. unsorted and non-unique) integers nums,
     returns a (sorted) list of size k that minimizes the sum of pairwise
     distances between all elements in the list.

     Runs in O(n lg n) time, uses O(n) auxiliary space.
    """

    n = len(nums)
    assert len(nums) == n
    assert 2 <= k <= n

    nums.sort()

    # Sum of pairwise distances to the next (at most) k-1 elements
    dist_sum_to_next = [0] * n

    # Sum of pairwise distances to the last (at most) k-1 elements
    dist_sum_to_prev = [0] * n

    for i in range(1, n):
        if i >= k:
            dist_sum_to_prev[i] = ((dist_sum_to_prev[i - 1] -
                                    (nums[i - 1] - nums[i - k]))
                                   + (nums[i] - nums[i - 1]) * (k - 1))
        else:
            dist_sum_to_prev[i] = (dist_sum_to_prev[i - 1]
                                   + (nums[i] - nums[i - 1]) * i)

    for i in reversed(range(n - 1)):
        if i < n - k:
            dist_sum_to_next[i] = ((dist_sum_to_next[i + 1]
                                    - (nums[i + k] - nums[i + 1]))
                                   + (nums[i + 1] - nums[i]) * (k - 1))
        else:
            dist_sum_to_next[i] = (dist_sum_to_next[i + 1]
                                   + (nums[i + 1] - nums[i]) * (n-i-1))

    best_sum = math.inf
    curr_sum = 0
    answer_right_bound = 0

    for i in range(n):
        curr_sum += dist_sum_to_prev[i]
        if i >= k:
            curr_sum -= dist_sum_to_next[i - k]

        if curr_sum < best_sum and i >= k - 1:
            best_sum = curr_sum
            answer_right_bound = i

    return nums[answer_right_bound - k + 1:answer_right_bound + 1]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4 Lucas Ribeiro
Solution 5 user2905483
Solution 6 kcsquared