'How to find out which encoding to use in Pandas

I am trying to open a .CSV file in Pandas, but I keep getting an encoding error. I have literally tried all possible encoding codes, but none of them work:

encode_list = ['ascii','big5','big5hkscs','cp037','cp273','cp424','cp437','cp500','cp720','cp737','cp775','cp850','cp852','cp855','cp856','cp857','cp858','cp860','cp861','cp862','cp863','cp864','cp865','cp866','cp869','cp874','cp875','cp932','cp949','cp950','cp1006','cp1026','cp1125','cp1140','cp1250','cp1251','cp1252','cp1253','cp1254','cp1255','cp1256','cp1257','cp1258','euc_jp','euc_jis_2004','euc_jisx0213','euc_kr','gb2312','gbk','gb18030','hz','iso2022_jp','iso2022_jp_1','iso2022_jp_2','iso2022_jp_2004','iso2022_jp_3','iso2022_jp_ext','iso2022_kr','latin_1','iso8859_2','iso8859_3','iso8859_4','iso8859_5','iso8859_6','iso8859_7','iso8859_8','iso8859_9','iso8859_10','iso8859_11','iso8859_13','iso8859_14','iso8859_15','iso8859_16','johab','koi8_r','koi8_t','koi8_u','kz1048','mac_cyrillic','mac_greek','mac_iceland','mac_latin2','mac_roman','mac_turkish','ptcp154','shift_jis','shift_jis_2004','shift_jisx0213','utf_32','utf_32_be','utf_32_le','utf_16','utf_16_be','utf_16_le','utf_7','utf_8','utf_8_sig']

for encode in encode_list:

  try:

    df= pd.read_csv("myFile.csv", encoding = encode)
    print(encode)

  except Exception as e:
    print(f"error: {e}")
    pass

here is all the errors:

error: 'ascii' codec can't decode byte 0x92 in position 15: ordinal not in range(128)
error: 'big5' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'big5hkscs' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: Error tokenizing data. C error: Expected 1 fields in line 9, saw 4

error: Error tokenizing data. C error: Expected 1 fields in line 9, saw 4

error: 'charmap' codec can't decode byte 0x76 in position 12: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 9, saw 4

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0xad in position 49: character maps to <undefined>
error: 'charmap' codec can't decode byte 0xf2 in position 60: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0x9c in position 58: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0x94 in position 50: character maps to <undefined>
error: 'charmap' codec can't decode byte 0x9c in position 58: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 9, saw 4

error: 'cp932' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: 'cp949' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: 'cp950' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 9, saw 4

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 9, saw 4

error: 'charmap' codec can't decode byte 0x81 in position 116: character maps to <undefined>
error: 'charmap' codec can't decode byte 0x98 in position 145: character maps to <undefined>
error: 'charmap' codec can't decode byte 0x81 in position 116: character maps to <undefined>
error: 'charmap' codec can't decode byte 0x9c in position 58: character maps to <undefined>
error: 'charmap' codec can't decode byte 0x81 in position 116: character maps to <undefined>
error: 'charmap' codec can't decode byte 0x9c in position 58: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0x9c in position 58: character maps to <undefined>
error: 'charmap' codec can't decode byte 0x9a in position 67: character maps to <undefined>
error: 'euc_jp' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'euc_jis_2004' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'euc_jisx0213' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'euc_kr' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'gb2312' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'gbk' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: 'gb18030' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: 'hz' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'iso2022_jp' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'iso2022_jp_1' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'iso2022_jp_2' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'iso2022_jp_2004' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'iso2022_jp_3' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'iso2022_jp_ext' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: 'iso2022_kr' codec can't decode byte 0x92 in position 15: illegal multibyte sequence
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0xf0 in position 22: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0xb2 in position 17: character maps to <undefined>
error: 'charmap' codec can't decode byte 0xd2 in position 172: character maps to <undefined>
error: 'charmap' codec can't decode byte 0xc3 in position 53: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0xdb in position 104: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'johab' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0x9c in position 58: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'charmap' codec can't decode byte 0x98 in position 145: character maps to <undefined>
error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: Error tokenizing data. C error: Expected 1 fields in line 5, saw 3

error: 'shift_jis' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: 'shift_jis_2004' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: 'shift_jisx0213' codec can't decode byte 0xf0 in position 22: illegal multibyte sequence
error: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
error: 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
error: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
error: 'utf-16-le' codec can't decode bytes in position 122-123: illegal UTF-16 surrogate
error: 'utf-16-be' codec can't decode bytes in position 80-81: illegal UTF-16 surrogate
error: 'utf-16-le' codec can't decode bytes in position 122-123: illegal UTF-16 surrogate
error: 'utf7' codec can't decode byte 0x92 in position 15: unexpected special character
error: 'utf-8' codec can't decode byte 0x92 in position 15: invalid start byte
error: 'utf-8' codec can't decode byte 0x92 in position 15: invalid start byte
 

If I try to open this particular .CSV with the Notepad, the data is all gibberish, but if I open it with Excel or Gnumeric, I get the data perfectly in a table.

The file has client information, so I unfortunately can not share it.

How do I open this file as a pandas dataframe?



Solution 1:[1]

If you have a look at the documentation for read_csv() you'll find that you can use the argument encoding_errors='ignore' to ignore those encoding errors and move on with the import. This should allow you to open the file with the most appropriate codec.

Other suitable values for this argument can be found in the python codecs documentation.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1