'Change values in pandas dataframe to reflect desired output in the max value counts
I have a following dataframe - df (this is a demo one, actual one is very big):
| Idx | X | Y | Z |
|---|---|---|---|
| A | 4.6 | 5.2 | 3.2 |
| B | 4.5 | 5.1 | 3.0 |
| C | 4.6 | 5.3 | 3.2 |
| D | 4.8 | 5.4 | 3.3 |
In df, we see that for each row, Y is the max - the counts are as follows:
df.idxmax(axis = 1).value_counts().reindex(df.columns, fill_value=0)
| count | |
|---|---|
| X | 0 |
| Y | 4 |
| Z | 0 |
I would like to change the above count distribution for the maximum values for each rows. I want the desired distribution to be as follows:
X = 0.25 of total count (= 4) (i.e. 1 instead of 0),
Y = 0.5 of total count (i.e. 2 instead of 4),
Z = 0.25 of total count (i.e. 1 instead of 0).
Hence I would like to change the values of the df (through iteration?) to reflect the distribution I want. How would I do that? I tried it through iteration but I was lost from the start.
I tried something like:
df['X'] = df['X'] + i,
df['Y'] = df['Y'] + j,
df['Z'] = df['Z'] + k,
and finding i, j, k through iteration that would give me the desired max value counts.
Edit: Taking into account "mozway"'s question: I want the values to not change much. Herein is an example wherein I found i,j,k manually (but in actuality, I would like this to be automated as the actual dataframe is quite large with many columns and the percent count-distribution is also not this straight forward as shown in this example):
df['X'] = df['X'] + 0.6
df['Y'] = df['Y'] - 0.03
df['Z'] = df['Z'] + 2.01
df.idxmax(axis = 1).value_counts().reindex(df.columns, fill_value=0)
With the above, I get the desired output:
| count | |
|---|---|
| X | 1 |
| Y | 2 |
| Z | 1 |
So I would like a way wherein i,j,k goes through many numbers (positive and negative) iteratively until the optimum values of i,j,k are found that gives the desired count result. Note, i,j,k may not be unique solutions. They are just the ones that will satisfy and give the desired value counts.
I know you want people to show their work, but I am completely lost here. Any help in this is much appreciated. Please let me know if you would like some more explanation. Thanks in advance!
Solution 1:[1]
You could try sampling a specific fraction of the dataset and assign a new value to 'x'/'z'. Based on your input data that would be:
sam_x = df.sample(frac=0.25)
sam_z = df[~df.index.isin(sam_x.index)].sample(frac=0.25)
df.loc[sam_x.index,'X'] = df.loc[sam_x.index, 'Y']+0.1
df.loc[sam_z.index,'Z'] = df.loc[sam_z.index, 'Y']+0.1
Output:
X Y Z
Idx
A 4.6 5.2 5.3
B 4.5 5.1 3.0
C 5.4 5.3 3.2
D 4.8 5.4 3.3
These are the percentages:
df.idxmax(axis = 1).value_counts(normalize=True)
Out:
Y 0.50
Z 0.25
X 0.25
Solution 2:[2]
Herein is my code to find i,j and k that satisfies the given condition mentioned above:
i_set = set(np.linspace(0.,0.1,100))
j_set = set(np.linspace(-0.05,-0.1,100))
k_set = set(np.linspace(2,2.05,100))
i_lst=[]
j_lst=[]
k_lst=[]
for i in i_set:
for j in j_set:
for k in k_set:
df = df_originalsample.copy() # to reset df after each loop
df['X'] = df['X'] + i
df['Y'] = df['Y'] + j
df['Z'] = df['Z'] + k
if ((df.idxmax(axis = 1).value_counts().reindex(df.columns, fill_value=0)[0]==1) &
(df.idxmax(axis = 1).value_counts().reindex(df.columns, fill_value=0)[1]==2) &
(df.idxmax(axis = 1).value_counts().reindex(df.columns, fill_value=0)[2]==1)):
print('FOUND the optimum values!')
i_lst.append(i)
j_lst.append(j)
k_lst.append(k)
break
else:
continue
break
else:
continue
break
else:
continue
break
print(df.idxmax(axis = 1).value_counts())
The big problem with the above code is that it is very very time consuming. I know the nested for loops is not the best way to go about here - but the code works. If anyone has a better idea to speed up this, please share it here, if possible. Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
