'Add a new column based on two dataframes and conditions
How can I add a new column based on two dataframes and conditions? For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.
import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}
df = pd.DataFrame(data)
df
x y year
0 40.1 100.1 2000
1 50.1 110.1 2000
2 60.1 120.1 2001
3 70.1 130.1 2001
4 80.1 140.1 2003
5 90.1 150.1 2003
6 0.0 160.1 2003
7 300.1 400.1 2004
df2
data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}
df2 = pd.DataFrame(data2)
df2
x y year
0 92.2 149.3 1950
1 30.1 100.1 1951
2 82.6 139.4 1952
3 51.1 111.1 2000
4 39.4 100.8 2000
5 10.1 180.1 1954
6 0.0 0.0 1955
7 299.1 402.5 2004
Output: df
new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] + 2.5) and
(df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] + 2.5) and
df['year'].iloc[i] == df2['year'].iloc[i]):
out = 1
else:
out = 0
if out == 1:
new_coll.append(1)
else:
new_col.append(0)
df['Result'] = new_col
df
x y year Result
0 40.1 100.1 2000 0
1 50.1 110.1 2000 0
2 60.1 120.1 2001 0
3 70.1 130.1 2001 0
4 80.1 140.1 2003 0
5 90.1 150.1 2003 0
6 0.0 160.1 2003 0
7 300.1 400.1 2004 1
But the output is not correct in terms of what i want. It just compare row by row. I want to find: Is the first row in df inside df2 according to conditions? It means check all rows in df2 for each row in df. So the expected output should be as below:
Expected output: df
As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
So expected output:
x y year Result
0 40.1 100.1 2000 1
1 50.1 110.1 2000 1
2 60.1 120.1 2001 0
3 70.1 130.1 2001 0
4 80.1 140.1 2003 0
5 90.1 150.1 2003 0
6 0.0 160.1 2003 0
7 300.1 400.1 2004 1
Solution 1:[1]
I have found this code to work but please comment if it does not:
import pandas as pd
data = {'x': [431228.6013, 431233.6013], 'y': [4522094.758, 4522094.758]}
df = pd.DataFrame(data)
data2 = {'x': [431226.7421, 431280.9052], 'y': [4522093.800, 4522060.532]}
df2 = pd.DataFrame(data2)
new_col = []
for i in df.index:
symbol = 'x'
if 2.5 <= df[symbol].iloc[i] <= df2[symbol].iloc[i] or 2.5 >= df[symbol].iloc[i] >= df2[symbol].iloc[i]:
x_out = 1
else:
x_out = 0
symbol = 'y'
if 2.5 <= df[symbol].iloc[i] <= df2[symbol].iloc[i] or 2.5 >= df[symbol].iloc[i] >= df2[symbol].iloc[i]:
y_out = 1
else:
y_out = 0
if x_out == y_out:
new_col.append(1)
else:
new_col.append(0)
df['Result'] = new_col
With this I got the answers that you expected above. Also, the df and df2 have to be the same length for this to work.
Hope this helped!
Solution 2:[2]
One-line solution:
df['Result'] = (df - df2).abs().le(2.5).all(axis=1).astype(int)
Explanation: this relies on most operators and functions on DataFrames and Series being vectorized: not just arithmetic and logical expressions .le(), .all()/.any(), .sum(), .apply() all take an optional (...axis=1) argument.
First, slice the two columns of interest, vector-subtract them, compare the absolute value of the difference to 2.5 (instead of the three-way comparison -2.5 < ... < 2.5):
(df - df2)[['x','y']].abs().le(2.5)
x y
0 False False
1 False False
2 False False
3 False False
4 False False
5 False False
6 True False
7 True True
Now for each row (..., axis=1) we need to logical-and the columns into a boolean value, which we can then convert to int:
(df - df2)[['x','y']].abs().le(2.5).all(axis=1).astype(int)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
Note:
- vectorization is faster, usually gives clearer, shorter code (avoid all that repetitive clunky
df['x'].iloc[i]), and multiple operations/functions can be arbitrarily composed, as we do here. - in your case, you want to take columns
['x', 'y', 'year']all fromdf, then concatenatedf['Result']. So essentially everything comes fromdfand we're just appending one new column. We don't even need to dopd.concat([df, [...], axis=1), we might as well just directly assigndf['Result'], it gets appended.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
