'How to create a for loop with multiple constraints while appending to a new df?

I tried to create a for loop with if statements that would append to a new df, but it's not working out. I am very new to this.

This code is my try at an algorithm that decides whether ride is being shared or not, while following a number of constraints.

match = []
for all rows in d12: 
    if d12.loc[d12['time_min']]-5 < d12['time_min'] < d12['time_min'] +5:
        continue
    else:
            pass
            if d12.loc[df11['distance_km']]-1 < d12['distance_km'] < d12['distance_km']+1: 
                continue 
            else:
                    pass 
                    if df12.loc[d12[sum['passenger_count']] <= 5: 
                        match.append()
                    else:
                            pass  

So, it needs to go through all the rows and find matches based on constrains. Everytime a match is found between 2 rows, both of these rows that matched go into a dataframe called match[] with all of the related columns. If one of the constraints is not meant, it should move on to a next row.

Once it's done, these matches are deleted from the dataframe d12.

Constraints explained in detail:

1. time_min between both trips needs to be less or equal to 5 minutes. 
2. distance_km between both trips has to be +/-1km
3. A sum of passenger_count of two trips that are being combined has to be less or equal to 5. 
4. A match can only be combined with two rows of the data.

Data example:

ID time_min distance_km passenger_count
1 450 0.3 2
2 453 0.75 1
3 564 1.35 4
4 600 1.25 1
5 560 0.80 1


Solution 1:[1]

Try using a custom function to find possible matches. Then filter your DataFrame by whether a match is found or not:

def find_match(row, df):
    other = df.drop(row.name)
    
    match = other[(other["time_min"].between(row["time_min"]-5,row["time_min"]+5, inclusive="both")) &
                  (other["distance_km"].between(row["distance_km"]-1,row["distance_km"]+1, inclusive="both")) &
                  (other["passenger_count"]+row["passenger_count"]<=5)]
    
    if match.shape[0]>0:
        return match["ID"].iat[0]
    return None

df["Match ID"] = df.apply(lambda row: find_match(row, df), axis=1)

match = df[df["Match ID"].notnull()].drop("Match ID", axis=1)
singles = df[df["Match ID"].isnull()].drop("Match ID", axis=1)

>>> match
   ID  time_min  distance_km  passenger_count
0   1       450         0.30                2
1   2       453         0.75                1
2   3       564         1.35                4
4   5       560         0.80                1

>>> singles
   ID  time_min  distance_km  passenger_count
3   4       600         1.25                1

Solution 2:[2]

I approached this in a natural way, and comparison logic can be easily customized:

import pandas as pd

df = pd.DataFrame({
    'time_min': [450, 453, 564, 600, 560],
    'distance_km': [0.3, 0.75, 1.35, 1.25, 0.8],
    'passenger_count': [2, 1, 4, 1, 1]
}, index=[1, 2, 3, 4, 5])

match = pd.DataFrame()

for idx1, first in df.iterrows():
    for idx2, second in df.iterrows():
        if idx1 <= idx2: # because a,b == b,a
            continue

        # your comparison logic goes here:
        # 1. time_min between both trips needs to be less or equal to 5 minutes. 
        # 2. distance_km between both trips has to be +/-1km
        # 3. A sum of passenger_count of two trips that are being combined has to be less or equal to 5. 
        # 4. A match can only be combined with two rows of the data.
        if (
            abs(first['time_min'] - second['time_min']) <= 5
            and abs(first['distance_km'] - second['distance_km']) <= 1
            and first['passenger_count'] + second['passenger_count'] <= 5
        ):
            result = pd.DataFrame.from_dict([first, second])
            match = pd.concat([match, result])
print(match)

Solution 3:[3]

UPDATED:

Here is a way to:

  • collect all matching rows with a configurable upper limit (in your original question you specified that a "match can only be combined with two rows of the data", but it's not clear if you meant that only 2 rows in total can be included in a match, or 2 rows in addition to the row under consideration for a total of 3 rows);
  • delete these matching rows from the original dataframe.

Notes on the algorithm and test case:

  • We first sort the input dataframe by time_min so that we can use a sliding window to efficiently identify rows matching one of the constraints (rows with times within 5 minutes of one another) without having to do a full brute force comparison of all rows with all other rows.
  • The test case includes a few more rows than in the question, to show the behavior in the case of more than 2 rows that are possible matches.

Here's the code:

import pandas as pd
d12 = pd.DataFrame([
[1, 450,    0.3,    2],
[2, 453,    0.75,   1],
[3, 564,    1.35,   4],
[4, 600,    1.25,   1],
[5, 560,    0.80,   1],
[6, 700,    1.26,   1],
[7, 701,    1.27,   1],
[8, 702,    1.28,   1],
[9, 703,    1.29,   1]
],
columns=['ID', 'time_min', 'distance_km', 'passenger_count'])print(d12)
df2 = d12.sort_values('time_min').reset_index(drop=True)

candidatesByIdx = []
minMax = [0, 0]
def foo(x):
    i, j = minMax
    while x.ID != df2.ID.iloc[i] and x.time_min - df2.time_min.iloc[i] > 5:
        i += 1
    while i >= 0 and x.time_min - df2.time_min.iloc[i] <= 5:
        i -= 1
    while j < len(df2.index) and df2.time_min.iloc[j] - x.time_min <= 5:
        j += 1
    minMax[:] = [i + 1, j - 1]
    L = []
    for k in range(i + 1, j):
        if x.ID == df2.ID.iloc[k]:
            continue
        if abs(x.distance_km - df2.distance_km.iloc[k]) <= 1 and x.passenger_count + df2.passenger_count.iloc[k] <= 5:
            L.append(k)
    return L
df2['cand_by_index'] = df2.apply(foo, axis=1)

maxMatches = 3
from collections import defaultdict
matchesToIgnore = defaultdict(set)
def baz(x):
    if not x.cand_by_index:
        return x.cand_by_index
    L = x.cand_by_index if x.index_copy not in matchesToIgnore else list(set(x.cand_by_index) - matchesToIgnore[x.index_copy])
    if len(L) > maxMatches - 1:
        for i in range(maxMatches - 1, len(L)):
            matchesToIgnore[L[i]].add(x.index_copy)
        return L[:maxMatches - 1]
    return L
    
df2['cand_by_index'] = df2.assign(index_copy=df2.index).apply(baz, axis=1)
match = df2[df2['cand_by_index'].str.len() > 0].drop(columns=['cand_by_index'])
d12 = d12[~d12.ID.isin(match.ID.to_list())]
print(d12)

Here is d12 in its initial state:

   ID  time_min  distance_km  passenger_count
0   1       450         0.30                2
1   2       453         0.75                1
2   3       564         1.35                4
3   4       600         1.25                1
4   5       560         0.80                1
5   6       700         1.26                1
6   7       701         1.27                1
7   8       702         1.28                1
8   9       703         1.29                1

Here is match:

   ID  time_min  distance_km  passenger_count
0   1       450         0.30                2
1   2       453         0.75                1
2   5       560         0.80                1
3   3       564         1.35                4
5   6       700         1.26                1
6   7       701         1.27                1
7   8       702         1.28                1

Here is d12 with matches deleted:

   ID  time_min  distance_km  passenger_count
3   4       600         1.25                1
8   9       703         1.29                1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 not_speshal
Solution 2 tyson.wu
Solution 3