''Series' object has no attribute 'columns' in Dask
I have a function para_func that takes a dataframe as input and returns a dataframe. I would like to group rows in df and apply function para_func. Finally, I have an error
distributed.worker - WARNING - Compute Failed
Function: execute_task
args: ((subgraph_callable, 'root', None, <function para_func at 0x000002F044CAC1F0>, 'drop_by_shallow_copy-b7fa028b7c197981e52eea9f870f36c8', (<function _concat at 0x000002F030567430>, [ id parent_id root _partitions
0 cqug90j cqug2sr cqug2sr 0
1 cqug90k 34fvry 34fvry 0
2 cqug90z cqu80zb cqu80zb 0
3 cqug91c cqtdj4m cqtdj4m 0
4 cqug91e cquc4rc cquc4rc 0
... ... ... ... ...
99995 cqv8wz8 cqv1gg9 34hylq 0
99996 cqv8wzj 34i1r5 34i1r5 0
99997 cqv8wzv 34jasa 34jasa 0
99998 cqv8wzx cqv8k2k 34jasa 0
99999 cqv8x08 cquywos 34hywb 0
[100000 rows x 4 columns]], False), ['_partitions'], 'simple-shuffle-ead8404542740024e1572ac449733a42'))
kwargs: {}
Exception: AttributeError("'Series' object has no attribute 'columns'")
Could you please elaborate on how to solve the error?
import pandas as pd
import networkx as nx
from dask.distributed import Client
import dask.dataframe as dd
client = Client(n_workers=4, threads_per_worker=2, processes=False, memory_limit='20GB')
def para_func(tmp_df):
siblings = pd.DataFrame({'id': tmp_df['id'], 'num_siblings': tmp_df.groupby('parent_id')['parent_id'].transform('count') - 1})
children = tmp_df.groupby(by = 'id').size().reindex(tmp_df['id'], fill_value = 0).to_frame().reset_index(level = 0).rename(columns = {0: 'num_children'})
att_df = siblings.merge(children, how = 'left', on = 'id')
return att_df
path = r'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/sample_df2.csv'
df = dd.read_csv(path, header = 0)
result = df.groupby('root').apply(para_func, meta = object)
computed_result = result.compute()
Solution 1:[1]
You need to pass DataFrame object that describes the output of para_func function as meta parameter of apply function. It should be the empty DataFrame with the same structure (column names and types of the columns) as the DataFrame returned by para_func function.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | denisuspenskiy |
