'Unable to match excel rows with pandas dataframe row counts
I have tried to encode the Unicode characters in the file that I am passing into the pandas dataframe. But the number of unique row counts with df.column.value_counts() that I am getting in Jupyter notebook is not matching the excel row counts of the same file(after removing duplicate values).
How do I fix the issue?
I have loaded a text file(tab separated) and converted that into a pandas dataframe using encoding = 'ISO-8859-1'. The dataframe was created with unique row counts as 66370 for one of the columns.
When I applied 'Remove duplicates' on the desired column on the original csv file(I was using MS Excel to read the export file), the number of unique values = 66368.
There is a difference of 2 in these 2 files- the pandas dataframe in Jupyter Notebook - pandas unique row counts(66370) and the excel version of the row counts(66368).
I understand this could be an encoding issue but I am not able to fix the same.
Can anyone help please?
df = pd.read_csv('csv_file.csv', encoding= 'ISO-8859-1')
df.column1.value_counts()
I am expecting equal results in the excel version of unique row_counts and df.column1.value_counts().
Actual results are showing a difference of 2 in the row counts by these 2 methods.
Solution 1:[1]
-It might happen that you are reading the header column as well and do note that pandas starts indexing with Zero. Could you please retry with below and let me know the result
df = pd.read_csv('rounds2.csv', encoding= 'ISO-8859-1')
print(len(df.column1.unique()))
print(df.shape)
Please let me know outputs of both , also you you try to open the file in notepad++ and reconcile the numbers.
let me know your output and then i will edit my answer accordingly
Solution 2:[2]
Ok Guys. I have found the answer finally !! After about 6 hours of struggle, I finally figured out the right encoding technique-- the right encoder for my problem was 'ANSI'
so the only change to my code was the encoder below:
df = pd.read_csv('csv_file.csv', encoding= 'mbcs')
I found the answer by going through this link: Get encoding of a file in Windows
The right encoder is here: https://docs.python.org/3/library/codecs.html#standard-encodings
Solution 3:[3]
This problem could also arise when the content of some rows have hidden \ns in them. So editors like vim shows them on different lines but they are actually a single row as per the dataframe.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Puneet Sinha |
| Solution 2 | Prashant Mishra |
| Solution 3 | A.R.K.S |
