'difference between `header = None` and `header = 0` in pandas
I was writing a code to read a csv file using pandas and I saw some weird functioning of the package. My file has column names which I want to ignore, so I use header = 0 or 'infer' instead of None. But I see something weird.
When I use None and I want to get a specific column, I just need to do df[column_index] but when I use 0 or 'infer', I need to do df.ix[:,column_index] to get the column otherwise, for df[column_index] I get the following error:
Traceback (most recent call last): File "/home/sarvagya/anaconda3/envs/tf/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: column_index
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "", line 1, in File "/home/sarvagya/anaconda3/envs/tf/lib/python3.6/site-packages/pandas/core/frame.py", line 2139, in getitem return self._getitem_column(key) File "/home/sarvagya/anaconda3/envs/tf/lib/python3.6/site-packages/pandas/core/frame.py", line 2146, in _getitem_column return self._get_item_cache(key) File "/home/sarvagya/anaconda3/envs/tf/lib/python3.6/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache values = self._data.get(item) File "/home/sarvagya/anaconda3/envs/tf/lib/python3.6/site-packages/pandas/core/internals.py", line 3843, in get loc = self.items.get_loc(item) File "/home/sarvagya/anaconda3/envs/tf/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2527, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: column_index
Can someone help with this? Why is this happening?
Solution 1:[1]
The difference pops up when working with a dataframe with header, so lets say your DataFrame df has header!
header=Nonepandas automatically assign the first row ofdf(which is the actual column names) to the first row, hence your columns no longer have namesheader=0, pandas first deletes column names(header) and then assign new column names to them (only if you pass names = [........] while loading your file).read_csv( filepath, header = 0 , names = ['....' , '....' ...])
hope it helps!
Solution 2:[2]
Suppose you have a csv file like this student.csv where you have the names of columns in first row.
id class marks
0 01 10 97
1 02 9 85
2 03 11 70
and you want to read this csv file, you can do this -
df = pd.read_csv('student.csv')
or
df = pd.read_csv('student.csv', header=0)
these both statements will give the same format of csv file as above.
but if you try to use this -
df = pd.read_csv('student.csv', header=None)
pandas will assume that you don't have columns names in your file and will make it own and will print the csv file in this format.
0 1 2
0 id class marks
1 01 10 97
2 02 9 85
3 03 11 70
Solution 3:[3]
It looks like need 2 parameters - header=None and skiprows=1 if want ignore original columns names for default RangeIndex.
Because if use only header=None in first row get original columns names.
And header=0 read first row to columns names of DataFrame.
Sample:
import pandas as pd
temp=u"""a,b,c
1,2,3
4,5,6"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=0)
print (df)
a b c
0 1 2 3
1 4 5 6
Selecting by position:
print (df.iloc[:, 1])
0 2
1 5
Name: b, dtype: int64
Selecting by column name:
print (df['b'])
0 2
1 5
Name: b, dtype: int64
There is no column name 1, so:
print (df[1]) KeyError: 1
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0 1 2
0 a b c
1 1 2 3
2 4 5 6
df = pd.read_csv(pd.compat.StringIO(temp), header=None, skiprows=1)
print (df)
0 1 2
0 1 2 3
1 4 5 6
print (df[1])
0 2
1 5
Name: 1, dtype: int64
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nursnaaz |
| Solution 2 | JATIN |
| Solution 3 |
