'How to make the dataframe faster ? either by using dictionary or numpy?

I am new to data structures and I would like to make my code faster (this is just part of a bigger code). Using dataframes while looking up variables is slowing down the getting the instances. I have two daraframes as

df = pd.DataFrame({ 'id1':['A', 'B', 'C','D','J','K','H'], 'id2':[['K','J'], 'NaN',['A'],['B','C'],'NaN','NaN',['A']],'length':[2,'NaN',1,2,'NaN','NaN',1]})

where length is number of elements in the list (id2), and

df2 = pd.DataFrame({ 'col1':['B', 'C', 'A','K','J','A'], 'col2':['D', 'D', 'C','A','A','H'],'type':['FF', 'SS', 'FS','SF','FS','FF']}) 

the part which is slow is

import time
startTime = time.time()
key='D'
if df.loc[df['id1'] == key, 'id2'].any():
    print('continue')
    for pred in df[df['id1']==key]['id2'].values[0]:
        reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
        print(reltype)
executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

What I did is to convert the df into dictionary and rewrite my code as below but I do not how to rewrite the 'reltype' part and make it faster.

df_dic=df.groupby('id1')[['id2', 'length']].apply(lambda g: g.values.tolist()).to_dict()
key='D'
if ~np.isnan(df_dic[key][0][1]):
    print('continue')
    for pred in df_dic[key][0][0]:
        reltype=df2.loc[(df2['col1'] == pred) & (df2['col2']==key), 'type'].values[0]
        print(reltype)

I am interested in making the code above faster especially the search in finding the 'reltype'. I would appreciate all suggestions. Thanks



Solution 1:[1]

I think your first implementation should be fast. There might be a small improvement if you don't repeat slicing:

for key in ['D']:
    s = df.loc[df['id1'] == key, 'id2']
    if s.any():
        print('continue')
        df_temp = df2.loc[df2['col2'] == key, :]
        for pred in s.values[0]:
            if pred != NaN:
                reltype = df_temp.loc[df_temp['col1'] == pred, 'type'].values[0]
                print(reltype)

depending on your data using df_temp might be slower than using a Boolean mask for the slicing

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1