'How to group list of duplicate continuous value in a list with a recursion function?

I want to group consecutive values if it's duplicates and each value is just in one group, let's see my example below:

Note: results is an index of the value in test_list

test_list = ["1","2","1","2","1","1","5325235","2","62623","1","1"]
--->results = [[[0, 1], [2, 3]],
               [[4, 5], [9, 10]]]

test_list = ["1","2","1","1","2","1","5325235","2","62623","1","2","1","236","2388","626236437","1","2","1","236","2388"]
--->results = [[[9, 10, 11, 12, 13], [15, 16, 17, 18, 19]],
               [[0, 1, 2], [3, 4, 5]]]

I build a recursive function:

def group_duplicate_continuous_value(list_label_group):

    # how to know which continuous value is duplicate, I implement take next number minus the previous number
    list_flag_grouping = [str(int(j.split("_")[0]) - int(i.split("_")[0])) +f"_{j}_{i}" for i,j in zip(list_label_group,list_label_group[1:])]

    # I find duplicate value in list_flag_grouping
    counter_elements = Counter(list_flag_grouping)
    list_have_duplicate = [k for k,v in counter_elements.items() if v > 1]

    if len(list_have_duplicate) > 0:
        list_final_index = group_duplicate_continuous_value(list_flag_grouping)
        # To return exactly value, I use index to define
        for k, v in list_final_index.items():
            temp_list = [v[i] + [v[i][-1] + 1] for i in range(0,len(v))]
            list_final_index[k] = temp_list
        check_first_cursive = list_label_group[0].split("_")
    
        # If we have many list grouping duplicate countinous value with different length, we need function below to return exactly results
        if len(check_first_cursive) > 1:
            list_temp_index = find_index_duplicate(list_label_group)
            list_duplicate_index = list_final_index.values()
            list_duplicate_index = [val for sublist in list_duplicate_index for val1 in sublist for val in val1]
            for k,v in list_temp_index.items():
                list_index_v = [val for sublist in v for val in sublist]
                if any(x in list_index_v for x in list_duplicate_index) is False:
                    list_final_index[k] = v
        
                    
        return list_final_index
    else:
        if len(list_label_group) > 0:
            check_first_cursive = list_label_group[0].split("_")
            if len(check_first_cursive) > 1:
                list_final_index = find_index_duplicate(list_label_group)
                return list_final_index
        list_final_index = None
        return list_final_index

Support function:

def find_index_duplicate(list_data):
    dups = defaultdict(list)
    for i, e in enumerate(list_data):
        dups[e].append([i])
    new_dict = {key:val for key, val in dups.items() if len(val) >1}
    return new_dict

But when I run with test_list = [5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,1,2,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,1,2,5,5,5], it's very slow and make out of memory (~6GB). I knew a reason is stack overflow of my recursive function group_duplicate_continuous_value but I don't know how to fix it.



Solution 1:[1]

You can create a dict of lists, where every item from the original list is a key in the dict, and every key is mapped to the list of its indices in the original list. For instance, your list ["1","3","5","5","7","1","3","5"] would result in the dict {"1": [0, 5], "3": [1, 6], "5": [2, 3, 7], "7": [4]}.

Creating a dict of lists in this way is very idiomatic in python, and fast, too: it can be done by iterating just once on the list.

def build_dict(l):
    d = {}
    for i, x in enumerate(l):
        d.setdefault(x, []).append(i)
    return d

l = ["1","3","5","5","7","1","3","5"]
d = build_dict(l)
print(d)
# {'1': [0, 5], '3': [1, 6], '5': [2, 3, 7], '7': [4]}

Then you can iterate on the dict to build two lists of indices:

def build_index_results(l):
    d = build_dict(l)
    idx1, idx2 = [], []
    for v in d.values():
        if len(v) > 1:
            idx1.append(v[0])
            idx2.append(v[1])
    return idx1, idx2

print(build_index_results(l))
# ([0, 1, 2], [5, 6, 3])

Or using zip:

from operator import itemgetter

def build_index_results(l):
    d = build_dict(l)
    return list(zip(*map(itemgetter(0,1), (v for v in d.values() if len(v) > 1))))

print(build_index_results(l))
# [(0, 1, 2), (5, 6, 3)]

I can't resist showcasing more_itertools.map_reduce for this:

from more_itertools import map_reduce
from operator import itemgetter

def build_index_results(l):
    d = map_reduce(enumerate(l),
                   keyfunc=itemgetter(1),
                   valuefunc=itemgetter(0),
                   reducefunc=lambda v: v[:2] if len(v) > 1 else None
    )
    return list(zip(*filter(None, d.values())))

print(build_index_results(l))
# [(0, 1, 2), (5, 6, 3)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1