'Adding new customers in a dataframe after calculating their loyalty value
I have a dataset df which includes information on old customers and a list of customers that may be new or old. I would like to assign a loyalty value to new customers, i.e. those customers not evaluated yet in the dataset df.
Please see below an example of what I would like to achieve.
Let's say that I have a list of potential new customers (e.g., customers=['cust1', 'cust76']) whose information on their loyalty might be missing in the df dataset (called here existing_cust_df).
existing_cust_df=pd.read(path.csv)
The dataset has only two fields: one refers to customer's ID and the other one refers to the loyalty values (called just value).
Customer value
cust2 13
cust3 14
cust6 35
cust7 21
cust1 24
I would like to determine the loyalty of the new customer(s) (in this example will be only cust76, as cust1 is already in the dataset and we have information on its value - ie. 24) based on the value of existing customers included in the existing_cust_df dataset.
To determine the value of new customers, I use a distinct dataset, called extra_df, which includes the new customer and a list of potential customers that might be used to determine its loyalty value. The extra_df dataset looks like as follows:
Customer Related_customer
cust76 cust32
cust76 cust2
cust76 cust1
The extra_df can then be used to look at related customers that might be included in the existing_cust_df in order to get their values from there and assign to the new customer by averaging those values.
In the example I provided, cust76 is missing in the existing_cust_df so I look at the extra_df where, out of three customers related to it, there are only two (cust2 and cust1) that are in the existing_cust_df so their values.
By averaging their values (13 and 24), I would like to assign this value (18.5) to the new customer (cust76) and append its ID (cust76) and its calculated value (18.5) to the existing_cust_df.
So at the end I would have:
Customer value
cust2 13
cust3 14
cust6 35
cust7 21
cust1 24
cust76 18.5
I hope the steps above make sense; if they do not, please let me know.
Can you please tell me how to run this check, looking at a customer in the list into the existing_cust_df, then in the extra_df if not included, calculating its average and finally update the value in the existing_cust_df?
I know that to append a row in an existing dataframe I should do:
new_row = {'Customer': cust76, 'value': 18.5}
#append row to the dataframe
existing_cust_df = existing_cust_df.append(new_row, ignore_index=False)
But the difficulties in this case are to look at customers in the list that are not in the existing_cust_df and, then, at extra_df, in order to include them and their values once calculated the average (otherwise, it would be also good to assign a null value).
Any help would be great.
Solution 1:[1]
Let's first build our dataframes :
csvfile = StringIO(
"""Customer\tvalue
cust2\t13
cust3\t14
cust6\t35
cust7\t21
cust1\t24""")
existing_cust_df = pd.read_csv(csvfile, sep = '\t', engine='python')
csvfile = StringIO("""
Customer\tRelated_customer
cust76\tcust32
cust76\tcust2
cust76\tcust1""")
extra_df = pd.read_csv(csvfile, sep = '\t', engine='python')
We want to loop for each input in extra_df, let's groupby :
dg = extra_df.groupby('Customer')
Then we can build a mask for each customer in extra_cust_df : True if a related_customer is in existing_cust_df.
dg.Related_customer.apply(lambda x : existing_cust_df.Customer.isin(x.to_numpy()))
Out[123]:
Customer
cust76 0 True
1 False
2 False
3 False
4 True
Name: Related_customer, dtype: bool
We can then apply this mask on existing_cust_df.value and then average it :
dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())"))
Out[125]:
Customer value
Customer
cust76 0 cust2 13
4 cust1 24
dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())").value.mean())
Out[126]:
Customer
cust76 18.5
Name: Related_customer, dtype: float64
Finally we can append it to your existing df
Solution 2:[2]
Here is a way to do what your question asks:
import pandas as pd
import numpy as np
# Objective:
# for each customer in customers not already in existing_cust_df,
# add a record to existing_cust_df with value equal to
# the average of the values from existing_cust_df
# of Related_customer records in extra_df
# where extra_df.Customer == customer and Related_customer is in existing_cust_df.
existing_cust_df = pd.DataFrame({'Customer':['cust2','cust3','cust6','cust7','cust1'], 'value':[13,14,35,21,24]})
print("existing_cust_df:")
print(existing_cust_df)
extra_df = pd.DataFrame({'Customer':['cust76','cust76','cust76', 'cust77', 'cust77'], 'Related_customer':['cust32','cust2','cust1', 'cust2', 'cust6']})
print("extra_df:")
print(extra_df)
customers=['cust1', 'cust76', 'cust77']
new_cust_df = pd.DataFrame({'Customer': [c for c in customers if c not in set(existing_cust_df['Customer'])]})
print("new_cust_df:")
print(new_cust_df)
# create a record for each new_cust containing list of related_cust names
# with matching customer in extra_df where related_cust is a customer in existing_cust_df
existing_related_customer_values = \
pd.merge( \
pd.merge(extra_df, new_cust_df, on= 'Customer'), \
existing_cust_df, left_on='Related_customer', right_on='Customer', suffixes=(None, '_y')) \
[['Customer', 'Related_customer', 'value']]
print("existing_related_customer_values\n (JOIN from new_cust_df and extra_df on Customer [KEEP],\n then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])")
print(existing_related_customer_values)
# find the mean of values for existing Related_customer records for new customers
new_cust_df['value'] = new_cust_df.apply(lambda x: np.mean( \
existing_related_customer_values[existing_related_customer_values['Customer'] == x['Customer']]['value']), axis=1)
print("new_cust_df (UPDATED with value column):")
print(new_cust_df)
existing_cust_df = pd.concat([existing_cust_df, new_cust_df], ignore_index=True)
print("existing_cust_df (UPDATED with new customer rows):")
print(existing_cust_df)
Output:
existing_cust_df:
Customer value
0 cust2 13
1 cust3 14
2 cust6 35
3 cust7 21
4 cust1 24
extra_df:
Customer Related_customer
0 cust76 cust32
1 cust76 cust2
2 cust76 cust1
3 cust77 cust2
4 cust77 cust6
new_cust_df:
Customer
0 cust76
1 cust77
existing_related_customer_values
(JOIN from new_cust_df and extra_df on Customer [KEEP],
then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])
Customer Related_customer value
0 cust76 cust2 13
1 cust77 cust2 13
2 cust76 cust1 24
3 cust77 cust6 35
new_cust_df (UPDATED with value column):
Customer value
0 cust76 18.5
1 cust77 24.0
existing_cust_df (UPDATED with new customer rows):
Customer value
0 cust2 13.0
1 cust3 14.0
2 cust6 35.0
3 cust7 21.0
4 cust1 24.0
5 cust76 18.5
6 cust77 24.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Zelemist |
| Solution 2 |
