'Create new dask df from variables as a result of boolean comparisons from source dask df

What I am trying to do is create a new ddf with columns from those Boolean values from source ddf.

var1=ddf[col1]==ddf[col2], var2=ddf[col3]==ddf[col4],... 

up to var8, then create a new dask df from those var1 to var8.

Calling pd.DataFrame is taking a long while. I expected it will take a while but It has been 2hours and it is still less than 25% complete. Is there a way to make it faster?

I have 15 million rows and 60 columns.



Solution 1:[1]

Hard to say without more context, but here's what I would do:

new_ddf = (
    ddf.assign(
        var1=ddf[col1].eq(def[col2]),
        var2=ddf[col3].eq(def[col4]),
    )
    .loc[:, ["var1", "var2"]]
)

Basically, you assign the new columns to a copy of the dataframe and the drop all of original columns.

Solution 2:[2]

Idea is compare all columns together with rename for new columns names - necessary for same columns names of comparing subset:

import pandas as pd
import dask.dataframe as dd
from dask.dataframe.utils import make_meta

df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6], 'c':[1,2,3],'d':[4,5,6]})

dsk = {('x', 0): df}

meta = make_meta({'a': 'i8', 'b': 'i8', 'c': 'i8', 'd': 'i8'}, index=pd.Index([], 'i8'))
d = dd.DataFrame(dsk, name='x', meta=meta, divisions=[0, 1, 2])
print (d)


cols1 = ['a','b']
cols2 = ['c','d']

new_cols = ['var1','var2']

 
ddf = (d[cols1].rename(columns=dict(zip(cols1, new_cols))) == 
       d[cols2].rename(columns=dict(zip(cols2, new_cols))))
print (ddf)
               var1  var2
npartitions=2            
0              bool  bool
1               ...   ...
2               ...   ...
Dask Name: eq, 11 tasks

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Paul H
Solution 2 jezrael