'Create new dask df from variables as a result of boolean comparisons from source dask df
What I am trying to do is create a new ddf with columns from those Boolean values from source ddf.
var1=ddf[col1]==ddf[col2], var2=ddf[col3]==ddf[col4],...
up to var8, then create a new dask df from those var1 to var8.
Calling pd.DataFrame is taking a long while. I expected it will take a while but It has been 2hours and it is still less than 25% complete. Is there a way to make it faster?
I have 15 million rows and 60 columns.
Solution 1:[1]
Hard to say without more context, but here's what I would do:
new_ddf = (
ddf.assign(
var1=ddf[col1].eq(def[col2]),
var2=ddf[col3].eq(def[col4]),
)
.loc[:, ["var1", "var2"]]
)
Basically, you assign the new columns to a copy of the dataframe and the drop all of original columns.
Solution 2:[2]
Idea is compare all columns together with rename for new columns names - necessary for same columns names of comparing subset:
import pandas as pd
import dask.dataframe as dd
from dask.dataframe.utils import make_meta
df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6], 'c':[1,2,3],'d':[4,5,6]})
dsk = {('x', 0): df}
meta = make_meta({'a': 'i8', 'b': 'i8', 'c': 'i8', 'd': 'i8'}, index=pd.Index([], 'i8'))
d = dd.DataFrame(dsk, name='x', meta=meta, divisions=[0, 1, 2])
print (d)
cols1 = ['a','b']
cols2 = ['c','d']
new_cols = ['var1','var2']
ddf = (d[cols1].rename(columns=dict(zip(cols1, new_cols))) ==
d[cols2].rename(columns=dict(zip(cols2, new_cols))))
print (ddf)
var1 var2
npartitions=2
0 bool bool
1 ... ...
2 ... ...
Dask Name: eq, 11 tasks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Paul H |
| Solution 2 | jezrael |
