'How to filter and clean multiple Dask frames in python?

enter image description herePost reading/appending multiple .csv files as Dask dataframe ,I am trying to clean the frame by excluding unnecessary rows. But this is throwing an error of mismatch dtypes inspite of below code being able to identify the dtypes correctly. It is neither able to show the top 5 rows [dfb.head()] nor getting converted to pandas dataframe via dfb = dfb.compute().

#########Reading and appending multiple .csv files###############
import pandas as pd[enter image description here][1]
import numpy as np
import glob
import os
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from dask import dataframe as dd

path = 'C:\\Nitin Folder\\PYTHON\\Py4\\1800\\AS32\\300\\Input'
files = glob.glob(os.path.join(path +"/*.csv"))

data = [] 
for csv in files:
frame = dd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)

dfb = dd.concat(data, ignore_index=True)
dfb = dfb.repartition(npartitions=1) 
dfb.dtypes

output:
id                 int64
Timestamp         object
student_name      object
country           object
Distance(mts)      int64
cellpower        float64
filename          object
dtype: object

############# filtering daskframe to remove non integer rows ###################
dfb = dfb[~(dfb.id == 'id')]
dfb.head()

dfb['Distance(mts)'] = dfb['Distance(mts)'].astype(int)
dfb['cellpower'] = dfb['cellpower'].astype(int)
dfb['id'] = dfb['id'].astype(int)

As per my understanding this filter is not getting applied to the entire Daskframe.I have even tried converting the dtypes manually but still same error persists. Seeking support for getting a solution for this.:)

output error message:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+---------------+--------+----------+
| Column        | Found  | Expected |
+---------------+--------+----------+
| Distance(mts) | object | int64    |
| cellpower     | object | float64  |
| id            | object | int64    |
+---------------+--------+----------+

The following columns also raised exceptions on conversion:

- Distance(mts)
  ValueError("invalid literal for int() with base 10: 'Distance(mts)'")
- cellpower
  ValueError("could not convert string to float: 'cellpower'")
- id
  ValueError("invalid literal for int() with base 10: 'id'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Distance(mts)': 'object',
   'cellpower': 'object',
   'id': 'object'}

to the call to `read_csv`/`read_table`.


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source