'For Dask, is there something equivalent to ngroup() that is not coumcount()

I am trying to assign a value for each group in dask:

print(df)

Col1
a
a
a
c
c
c
c
b
b
b
y
u
i

df['Col2'] = df.groupby('Col1').ngroup()

print(df)

Col1 Col2
a 1
a 1
a 1
c 2
c 2
c 2
c 2
b 3
b 3
b 3
y 4
u 5
i 6

But dask does not recognize ngroup(). Is there an alternative?

# all the different ways I tried to get this going

df['tariff'] = str(np.random.randint(1 , 4, size=len(df), dtype=int)) df df.groupby(by=["b"]).sum() df['tariff'] = df.groupby('uid') df['tariff'] = df.groupby(['uid']).rank() df['tariff'] =str(np.random.randint(1 , 4, size=len(df), dtype=int)) df=df.sort_values('uid') df['account'] = df.groupby(['uid']).ngroup() df['account'] = df.groupby(['uid'])['value'].transform('nunique') df['account'] = df.groupby(['uid']).transform('nunique') df['account'] = df.groupby('uid').transform('ngroup') df['account'] = df.groupby('uid').ngroup() df['account'] = df.groupby(['uid']).cumcount()+1 df['account'] = df.groupby('uid')['value'].nunique() df['account'] = df.groupby(['uid']).transform('nunique') df['account'] = df.map_partitions(pd.rank(), axis="uid") df['account'] = df.groupby(['uid'], sort=False).ngroup() df['account'] ='1000000' + df['account'].astype(str)



Solution 1:[1]

Here's one non-ideal option:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({
    'x': list('aabbcd'),
})
ddf = dd.from_pandas(df, npartitions=2)

nuniq = ddf['x'].nunique().compute()
c = list(range(nuniq+1))

ddf.groupby("x").apply(lambda g: g.assign(y = lambda x: c.pop(0)), meta={'x': 'f8', 'y': 'f8'}).compute()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pavithraes