'Is there any alternative of pandas' merge_asof when joining on multiple columns?

I need to merge 2 data frames on 3 key columns, one contains strings, one contains integers and the last contains floats. On the string and integer columns, I would need an exact match, while on the float column I want to get the values of the right data frame for the nearest corresponding float key in the left data frame. I thought that pandas' merge_asof function would be good, but unfortunately, it can be used only for a single, numerical column to join on.

I have approached this problem with cross merge, but unfortunately, it won't work for the real problem, as each data frame has ~2 million rows. (and it's not giving the correct values either...)

import pandas as pd
import numpy as np
import itertools

df1 = pd.DataFrame(data = itertools.product(['a', 'b', 'c'],
                                            [0, 1, 2, 3, 4],
                                            [0, 0.25, 0.5, 0.75, 1]), 
                   columns = ['key1', 'key2', 'key3'])

df2 = df1.copy()
df2['key3'] = np.random.rand(len(df2))
df2['values'] = np.random.rand(len(df2))

merged = df1.merge(df2, how = 'cross')
merged['key3_diff'] = np.abs(merged['key3_x'] - merged['key3_y'])

aux = merged[['key1_x', 'key2_x', 
              'key3_x', 'key3_diff']].groupby(['key1_x', 
                                               'key2_x', 
                                               'key3_x']).min().reset_index()

res = merged[merged['key1_x'].isin(aux['key1_x']) &
           merged['key2_x'].isin(aux['key2_x']) &
           merged['key3_diff'].isin(aux['key3_diff'])][['key1_x', 
                                                        'key2_x', 
                                                        'key3_x', 
                                                        'values']].rename(columns = {'key1_x': 'key1',
                                                                                     'key2_x': 'key2',
                                                                                     'key3_x': 'key3'})

Is there any alternative besides implementing this calculation sequentially (maybe with numba and jit)?



Solution 1:[1]

This is a solution using pandas's merge_asof. I didn't realize the 'by' argument. This is equivalent with the accepted answer.

df1 = df1.sort_values(by = 'key3')
df2 = df2.sort_values(by = 'key3')

res3 = pd.merge_asof(df1, df2, by = ['key1', 'key2'], on = 'key3', direction = 'nearest')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1