'Populate Pandas dataframe with random sample from another dataframe if condition is met, when columns to be assigned are not independent

I have two DataFrames, df1 and df2. The information in df1 has to be used to populate cells in df2 if a specific condition is met. This is an example:

df1 = pd.DataFrame({"A":[1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4],"B":[1, 2, 3, 1, 2, 2, 3, 1, 2, 3, 4],"C":[5, 3, 2,10,11,12, 4, 5, 7, 2, 7], "D":[0.5, 0.3, 0.5, 0.7, 0.5, 0.6, 0.1, 0.6, 0.6, 0.5, 0.6]})
df2 = pd.DataFrame({"A":[5, 5, 6, 6, 6], "B":[1, 2, 1, 2, 3], "C":np.nan, "D":np.nan})

The np.nan entries in df2 are meant to represent the cells that need to be populated. These are empty at the start of the process.

To populate df2, I need to use the values in the column df2['B']. Specifically, in this example, if the value of df2['B'] is equal to 1, then I need to get a random sample, with replacement, from df1[df1['B']==1], for both df1['C'] and df1['D']. Importantly, these values are not independent. Therefore, I need to draw a random row from the subset of rows of df1 where df1['B'] is equal to one. And then I need to do this for all rows in df2.

Doing df1[df1['B']==1][['C','D']].sample(replace = True) draws a random sample for one case when the value of df1['B'] is one, but

  1. How do I assign the corresponding values to df2?
  2. How do I do this for every row in df2?

I have tried several alternatives with loops, such as

for index, value in df2.iterrows():
    if df2.loc[index,'B'] == 1:
        temp_df = df1[df1['B'] == 1][['C','D']].sample(n = 1, replace = True)

    if df2.loc[index,'B'] == 2:
        temp_df = df1[df1['B'] == 2][['C','D']].sample(n = 1, replace = True)
        
    if df2.loc[index,'B'] == 3:
        temp_df = df1[df1['B'] == 3][['C','D']].sample(n = 1, replace = True)
        
    if df2.loc[index,'B'] == 4:
        temp_df = df1[df1['B'] == 4][['C','D']].sample(n = 1, replace = True)
        
        
    df2.loc[index, 'C'] = temp_df['C']
    df2.loc[index, 'D'] = temp_df['D']

but I get an error message saying

---> 15     df2.loc[index, 'C'] = temp_df['C']
     16     df2.loc[index, 'D'] = temp_df['D']
...
ValueError: Incompatible indexer with Series

where the ... denotes lines from the error message that I skipped.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source