'Function to modify a list of lists in order to prevent repeated numbers within sublists is not working completely

Community of Stackoverflow:

I have a lists of sublists of sublists named dicts that was built by taken randomly from a df's index some values. The values can be repeated within the first level of the list of lists but not within the level of lists[e]. For example:

[[[40, 23, 29, 41, 42], [], [19, 17, 21, 20, 24]],    
 [[3, 9, 43, 44, 17], [], [20, 9, 23, 3, 27], [3, 30, 43]], #wrong because 9,3 and 43 are repeated in the three sublists
 [[2, 26, 42, 29, 44], [], [2, 3, 44, 31, 27]],  #2,44 are repeated
 [[31, 43, 32, 23, 33], [], [44, 9, 27, 23, 29]], #23 is repeated
 [[12, 27, 9, 44, 2], [], [25, 29, 40, 27, 12]]]  #27 repeated

As it can be seen, it doesn't matter if the number 3 is repeated in the second sublist of sublists and also in the third sublist of sublists. The empty lists don't matter.

I've built a function that "corrects" the repeating of those values but apparently it doesn't solve all the cases. It takes three arguments: the mentioned list of lists, the df where it takes the numbers (the df's index) called matrix and "cuantosamples" which is a list of lists that indicates how the final result will be partitioned (in uneven sized lists). It's important to note that the code also contains a segment that doesn't allow a value that is replacing a repeated value to be taken again to replace another value in the next sublist:

def vigilado(list1,matrix,cuantosamples):
    stored=[]
    lists=[[]for e in range(len(dicts))]
    vals=list(matrix.index.values)
    for e,g in zip(list1,lists):
        vig=list(itertools.chain(*e))
        dup=list(duplicates(vig))
        lendup=len(dup)
        if lendup>0:
            #assign new values
            vals=[e for e in vals if e not in dup and e not in vig and e not in stored] #si esta repetido en la sublista 1, que no vuelva atomar esos valores
            sample=matrix.loc[vals].sample(len(dup),weights='weights')
            vls=list(sample.index.values)
            #identify values to be replaced
            dups=[i for i, j in enumerate(vig) if j in dup]
            dups2=dups[lendup:]
            for i in range(len(dups2)):
                vig[dups2[i]]=vls[i]
        g.extend(vig)
        stored.extend(vig)
        
    l1=[[]for e in range(0,5)]
    for e,g,h in zip(lists,cuantosamples,l1):
        iterate=iter(e)
        l2=[list(islice(iterate,0,i))for i in g]
        h.extend(l2)
        
    return(l1)
    
vigilated=vigilado(dicts,matrix,cuantosamples)
vigilated

This return the following lists of lists, which as it can be seen, it works in mostly of the cases but not in all of them and I don't know why:

[[[40, 23, 29, 41, 42], [], [19, 17, 21, 20, 24]],
 [[3, 9, 43, 44, 17], [], [20, 9, 23, 16, 27], [33, 30, 14]], #3 and 43 are no longer repeated, BUT 9 IS STILL REPEATED
 [[2, 26, 42, 29, 44], [], [22, 3, 5, 31, 27]], #2 and 44 no longer repeated
 [[31, 43, 32, 23, 33], [], [44, 9, 27, 6, 29]], #23 no longer repeated
 [[12, 27, 9, 44, 2], [], [25, 29, 40, 1, 28]]] #27 no longer repeated

Can someone please help me? I don't have any idea of why the code is not applied to all cases because I thought that would solve it. Thanks.

Edit: this would be my desired output:

[[[40, 23, 29, 41, 42], [], [19, 17, 21, 20, 24]],
 [[3, 9, 43, 44, 17], [], [20, 10, 23, 16, 27], [33, 30, 14]],  #9 that wasn't replaced before is replaced here with a 10
 [[2, 26, 42, 29, 44], [], [22, 3, 5, 31, 27]], 
 [[31, 43, 32, 23, 33], [], [44, 9, 27, 6, 29]], 
 [[12, 27, 9, 44, 2], [], [25, 29, 40, 1, 28]]] 

As you can see it's very similar to my resulting list (because my code somehows replaces almost all values but one or two). The change here was that I replaced the 9 of the lists[1][3] to 10.



Solution 1:[1]

My response does not point out where the problem of your code is, but two approaches to your goal.

Approach 1

Generate dicts that does not have repeated index within each list of dicts. Explanations in code.

import numpy as np

index = np.arange(100)
cuantosamples = [[5, 0, 5], [5, 0, 5, 3], [5, 0, 5], [5, 0, 5], [5, 0, 5]]

np.random.seed(0)

dicts = [
    list(map(list, # convert np.array to list
        np.split( # split a list into sublists
            np.random.choice(index, sum(needs), replace=False), # generate random choices without replacement
            np.cumsum(needs)[:-1] # how to split
        )))
    for needs in cuantosamples
]
# print(dicts)

Approach 2

Replace repeated values with new values. Explanations in code.

dicts = [
    [[40, 23, 29, 41, 42], [], [19, 17, 21, 20, 24]],    
    [[3, 9, 43, 44, 17], [], [20, 9, 23, 3, 27], [3, 30, 43]], 
    [[2, 26, 42, 29, 44], [], [2, 3, 44, 31, 27]],  
    [[31, 43, 32, 23, 33], [], [44, 9, 27, 23, 29]],
    [[12, 27, 9, 44, 2], [], [25, 29, 40, 27, 12]]
]

np.random.seed(0)

new_dicts = []
for lists, needs in zip(dicts, cuantosamples):
    ary = np.array([x for l in lists for x in l ]) # flatten lists into an array
    candidates = [x for x in index if x not in ary] # find out what to be replaced with
    values, counts = np.unique(ary, return_counts=True) # find out what to replace
    
    for v, c in zip(values, counts - 1):
        if c:
            ary[ary==v] = np.concatenate([[v], np.random.choice(candidates, c, replace=False)]) #replace
            
    new_dicts.append(list(map(list, np.split(ary, np.cumsum(needs)[:-1]))))
    
new_dicts

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Raymond Kwok