'numpy take many samples with no replacement by row

I have a really big list. Imagine it looks something like this:

test = ['llama', 'cow', 'horse', 'fish', 'sheep', 'goat', 'cat', 'dog']

I want to sample out of this list many times. I want each sample to be taken without replacement. I want to avoid for loops in this case.

I've seen many solutions on StackOverflow that are close, but not exactly what I need here. Let's say each sample I wanted was to be of size 3. If I wanted to sample with replacement, this would work:

np.random.choice(test, size=(100, 3))

This would give me 100 rows with a sample of 3 in each row. The problem is that any particular row might have repeats, and I can't ask it to sample without replacement, because 300 > len(test).

Is there a way around this that maintains randomness? I saw potential solutions that use np.argsort, but I'm not sure that they're still actually random, considering sorting is being done.



Solution 1:[1]

Here's a vectorized approach with rand+argsort/argpartition trick from here -

idx = np.random.rand(100, len(test)).argpartition(3,axis=1)[:,:3]
out = np.take(test, idx)

Let's verify that all are unique per row with some pandas help -

In [51]: idx = np.random.rand(100, len(test)).argpartition(3,axis=1)[:,:3]
    ...: out = np.take(test, idx)

In [52]: import pandas as pd

In [53]: (pd.DataFrame(out).nunique(axis=1).values==3).all()
Out[53]: True

Solution 2:[2]

You can use random.sample for that, from the documentation:

Return a k length list of unique elements chosen from the population sequence. Used for random sampling without replacement.

And repeat the process n_times using a list comprehension:

n_times = 100
n_sample = 3
[random.sample(test, n_sample) for i in range(n_times)]

[['llama', 'goat', 'sheep'],
 ['cat', 'horse', 'dog'],
 ['sheep', 'dog', 'goat'],
 ['cat', 'cow', 'llama'],
 ['dog', 'fish', 'horse'],
 ['llama', 'horse', 'cow'],
 ['dog', 'goat', 'cow'],
 ['llama', 'cow', 'sheep'],
 ['fish', 'dog', 'horse'],
 ... 

Solution 3:[3]

You could run np.random.choice without replacement one time for each row, and put the results in a matrix. That can be done with this command.

np.array([np.random.choice(test, 3, replace=False) for i in range(100)])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Atnas