'How to filter and clean multiple Dask frames in python?
enter image description herePost reading/appending multiple .csv files as Dask dataframe ,I am trying to clean the frame by excluding unnecessary rows. But this is throwing an error of mismatch dtypes inspite of below code being able to identify the dtypes correctly. It is neither able to show the top 5 rows [dfb.head()] nor getting converted to pandas dataframe via dfb = dfb.compute().
#########Reading and appending multiple .csv files###############
import pandas as pd[enter image description here][1]
import numpy as np
import glob
import os
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from dask import dataframe as dd
path = 'C:\\Nitin Folder\\PYTHON\\Py4\\1800\\AS32\\300\\Input'
files = glob.glob(os.path.join(path +"/*.csv"))
data = []
for csv in files:
frame = dd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)
dfb = dd.concat(data, ignore_index=True)
dfb = dfb.repartition(npartitions=1)
dfb.dtypes
output:
id int64
Timestamp object
student_name object
country object
Distance(mts) int64
cellpower float64
filename object
dtype: object
############# filtering daskframe to remove non integer rows ###################
dfb = dfb[~(dfb.id == 'id')]
dfb.head()
dfb['Distance(mts)'] = dfb['Distance(mts)'].astype(int)
dfb['cellpower'] = dfb['cellpower'].astype(int)
dfb['id'] = dfb['id'].astype(int)
As per my understanding this filter is not getting applied to the entire Daskframe.I have even tried converting the dtypes manually but still same error persists. Seeking support for getting a solution for this.:)
output error message:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+---------------+--------+----------+
| Column | Found | Expected |
+---------------+--------+----------+
| Distance(mts) | object | int64 |
| cellpower | object | float64 |
| id | object | int64 |
+---------------+--------+----------+
The following columns also raised exceptions on conversion:
- Distance(mts)
ValueError("invalid literal for int() with base 10: 'Distance(mts)'")
- cellpower
ValueError("could not convert string to float: 'cellpower'")
- id
ValueError("invalid literal for int() with base 10: 'id'")
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Distance(mts)': 'object',
'cellpower': 'object',
'id': 'object'}
to the call to `read_csv`/`read_table`.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
