'How to create a for loop with multiple constraints while appending to a new df?
I tried to create a for loop with if statements that would append to a new df, but it's not working out. I am very new to this.
This code is my try at an algorithm that decides whether ride is being shared or not, while following a number of constraints.
match = []
for all rows in d12:
if d12.loc[d12['time_min']]-5 < d12['time_min'] < d12['time_min'] +5:
continue
else:
pass
if d12.loc[df11['distance_km']]-1 < d12['distance_km'] < d12['distance_km']+1:
continue
else:
pass
if df12.loc[d12[sum['passenger_count']] <= 5:
match.append()
else:
pass
So, it needs to go through all the rows and find matches based on constrains. Everytime a match is found between 2 rows, both of these rows that matched go into a dataframe called match[] with all of the related columns. If one of the constraints is not meant, it should move on to a next row.
Once it's done, these matches are deleted from the dataframe d12.
Constraints explained in detail:
1. time_min between both trips needs to be less or equal to 5 minutes.
2. distance_km between both trips has to be +/-1km
3. A sum of passenger_count of two trips that are being combined has to be less or equal to 5.
4. A match can only be combined with two rows of the data.
Data example:
| ID | time_min | distance_km | passenger_count |
|---|---|---|---|
| 1 | 450 | 0.3 | 2 |
| 2 | 453 | 0.75 | 1 |
| 3 | 564 | 1.35 | 4 |
| 4 | 600 | 1.25 | 1 |
| 5 | 560 | 0.80 | 1 |
Solution 1:[1]
Try using a custom function to find possible matches. Then filter your DataFrame by whether a match is found or not:
def find_match(row, df):
other = df.drop(row.name)
match = other[(other["time_min"].between(row["time_min"]-5,row["time_min"]+5, inclusive="both")) &
(other["distance_km"].between(row["distance_km"]-1,row["distance_km"]+1, inclusive="both")) &
(other["passenger_count"]+row["passenger_count"]<=5)]
if match.shape[0]>0:
return match["ID"].iat[0]
return None
df["Match ID"] = df.apply(lambda row: find_match(row, df), axis=1)
match = df[df["Match ID"].notnull()].drop("Match ID", axis=1)
singles = df[df["Match ID"].isnull()].drop("Match ID", axis=1)
>>> match
ID time_min distance_km passenger_count
0 1 450 0.30 2
1 2 453 0.75 1
2 3 564 1.35 4
4 5 560 0.80 1
>>> singles
ID time_min distance_km passenger_count
3 4 600 1.25 1
Solution 2:[2]
I approached this in a natural way, and comparison logic can be easily customized:
import pandas as pd
df = pd.DataFrame({
'time_min': [450, 453, 564, 600, 560],
'distance_km': [0.3, 0.75, 1.35, 1.25, 0.8],
'passenger_count': [2, 1, 4, 1, 1]
}, index=[1, 2, 3, 4, 5])
match = pd.DataFrame()
for idx1, first in df.iterrows():
for idx2, second in df.iterrows():
if idx1 <= idx2: # because a,b == b,a
continue
# your comparison logic goes here:
# 1. time_min between both trips needs to be less or equal to 5 minutes.
# 2. distance_km between both trips has to be +/-1km
# 3. A sum of passenger_count of two trips that are being combined has to be less or equal to 5.
# 4. A match can only be combined with two rows of the data.
if (
abs(first['time_min'] - second['time_min']) <= 5
and abs(first['distance_km'] - second['distance_km']) <= 1
and first['passenger_count'] + second['passenger_count'] <= 5
):
result = pd.DataFrame.from_dict([first, second])
match = pd.concat([match, result])
print(match)
Solution 3:[3]
UPDATED:
Here is a way to:
- collect all matching rows with a configurable upper limit (in your original question you specified that a "match can only be combined with two rows of the data", but it's not clear if you meant that only 2 rows in total can be included in a match, or 2 rows in addition to the row under consideration for a total of 3 rows);
- delete these matching rows from the original dataframe.
Notes on the algorithm and test case:
- We first sort the input dataframe by
time_minso that we can use a sliding window to efficiently identify rows matching one of the constraints (rows with times within 5 minutes of one another) without having to do a full brute force comparison of all rows with all other rows. - The test case includes a few more rows than in the question, to show the behavior in the case of more than 2 rows that are possible matches.
Here's the code:
import pandas as pd
d12 = pd.DataFrame([
[1, 450, 0.3, 2],
[2, 453, 0.75, 1],
[3, 564, 1.35, 4],
[4, 600, 1.25, 1],
[5, 560, 0.80, 1],
[6, 700, 1.26, 1],
[7, 701, 1.27, 1],
[8, 702, 1.28, 1],
[9, 703, 1.29, 1]
],
columns=['ID', 'time_min', 'distance_km', 'passenger_count'])print(d12)
df2 = d12.sort_values('time_min').reset_index(drop=True)
candidatesByIdx = []
minMax = [0, 0]
def foo(x):
i, j = minMax
while x.ID != df2.ID.iloc[i] and x.time_min - df2.time_min.iloc[i] > 5:
i += 1
while i >= 0 and x.time_min - df2.time_min.iloc[i] <= 5:
i -= 1
while j < len(df2.index) and df2.time_min.iloc[j] - x.time_min <= 5:
j += 1
minMax[:] = [i + 1, j - 1]
L = []
for k in range(i + 1, j):
if x.ID == df2.ID.iloc[k]:
continue
if abs(x.distance_km - df2.distance_km.iloc[k]) <= 1 and x.passenger_count + df2.passenger_count.iloc[k] <= 5:
L.append(k)
return L
df2['cand_by_index'] = df2.apply(foo, axis=1)
maxMatches = 3
from collections import defaultdict
matchesToIgnore = defaultdict(set)
def baz(x):
if not x.cand_by_index:
return x.cand_by_index
L = x.cand_by_index if x.index_copy not in matchesToIgnore else list(set(x.cand_by_index) - matchesToIgnore[x.index_copy])
if len(L) > maxMatches - 1:
for i in range(maxMatches - 1, len(L)):
matchesToIgnore[L[i]].add(x.index_copy)
return L[:maxMatches - 1]
return L
df2['cand_by_index'] = df2.assign(index_copy=df2.index).apply(baz, axis=1)
match = df2[df2['cand_by_index'].str.len() > 0].drop(columns=['cand_by_index'])
d12 = d12[~d12.ID.isin(match.ID.to_list())]
print(d12)
Here is d12 in its initial state:
ID time_min distance_km passenger_count
0 1 450 0.30 2
1 2 453 0.75 1
2 3 564 1.35 4
3 4 600 1.25 1
4 5 560 0.80 1
5 6 700 1.26 1
6 7 701 1.27 1
7 8 702 1.28 1
8 9 703 1.29 1
Here is match:
ID time_min distance_km passenger_count
0 1 450 0.30 2
1 2 453 0.75 1
2 5 560 0.80 1
3 3 564 1.35 4
5 6 700 1.26 1
6 7 701 1.27 1
7 8 702 1.28 1
Here is d12 with matches deleted:
ID time_min distance_km passenger_count
3 4 600 1.25 1
8 9 703 1.29 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | not_speshal |
| Solution 2 | tyson.wu |
| Solution 3 |
