''Series' object has no attribute 'columns' in Dask

I have a function para_func that takes a dataframe as input and returns a dataframe. I would like to group rows in df and apply function para_func. Finally, I have an error

distributed.worker - WARNING -  Compute Failed
Function:  execute_task
args:      ((subgraph_callable, 'root', None, <function para_func at 0x000002F044CAC1F0>, 'drop_by_shallow_copy-b7fa028b7c197981e52eea9f870f36c8', (<function _concat at 0x000002F030567430>, [            id parent_id     root  _partitions
0      cqug90j   cqug2sr  cqug2sr            0
1      cqug90k    34fvry   34fvry            0
2      cqug90z   cqu80zb  cqu80zb            0
3      cqug91c   cqtdj4m  cqtdj4m            0
4      cqug91e   cquc4rc  cquc4rc            0
...        ...       ...      ...          ...
99995  cqv8wz8   cqv1gg9   34hylq            0
99996  cqv8wzj    34i1r5   34i1r5            0
99997  cqv8wzv    34jasa   34jasa            0
99998  cqv8wzx   cqv8k2k   34jasa            0
99999  cqv8x08   cquywos   34hywb            0

[100000 rows x 4 columns]], False), ['_partitions'], 'simple-shuffle-ead8404542740024e1572ac449733a42'))
kwargs:    {}
Exception: AttributeError("'Series' object has no attribute 'columns'")

Could you please elaborate on how to solve the error?

import pandas as pd
import networkx as nx
from dask.distributed import Client
import dask.dataframe as dd
client = Client(n_workers=4, threads_per_worker=2, processes=False, memory_limit='20GB')

def para_func(tmp_df):

    siblings = pd.DataFrame({'id': tmp_df['id'], 'num_siblings': tmp_df.groupby('parent_id')['parent_id'].transform('count') - 1})

    children = tmp_df.groupby(by = 'id').size().reindex(tmp_df['id'], fill_value = 0).to_frame().reset_index(level = 0).rename(columns = {0: 'num_children'})

    att_df = siblings.merge(children, how = 'left', on = 'id')
    
    return att_df

path = r'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/sample_df2.csv'
df = dd.read_csv(path, header = 0)

result = df.groupby('root').apply(para_func, meta = object)
computed_result = result.compute()


Solution 1:[1]

You need to pass DataFrame object that describes the output of para_func function as meta parameter of apply function. It should be the empty DataFrame with the same structure (column names and types of the columns) as the DataFrame returned by para_func function.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 denisuspenskiy