'How to read a column of csv as dtype list using pandas?
I have a csv file with 3 columns, wherein each row of Column 3 has list of values in it. As you can see from the following table structure
Col1,Col2,Col3
1,a1,"['Proj1', 'Proj2']"
2,a2,"['Proj3', 'Proj2']"
3,a3,"['Proj4', 'Proj1']"
4,a4,"['Proj3', 'Proj4']"
5,a5,"['Proj5', 'Proj2']"
Whenever I try to read this csv, Col3 is getting read as str object and not as list. I tried to alter the dtype of that column to list but got "Attribute Error" as below
df = pd.read_csv("inputfile.csv")
df.Col3.dtype = list
AttributeError Traceback (most recent call last)
<ipython-input-19-6f9ec76b1b30> in <module>()
----> 1 df.Col3.dtype = list
C:\Python27\lib\site-packages\pandas\core\generic.pyc in __setattr__(self, name, value)
1953 object.__setattr__(self, name, value)
1954 except (AttributeError, TypeError):
-> 1955 object.__setattr__(self, name, value)
1956
1957 #----------------------------------------------------------------------
AttributeError: can't set attribute
It would be really great if you can guide me how to go about it.
Solution 1:[1]
You could use the ast lib:
from ast import literal_eval
df.Col3 = df.Col3.apply(literal_eval)
print(df.Col3[0][0])
Proj1
You can also do it when you create the dataframe from the csv, using converters:
df = pd.read_csv("in.csv",converters={"Col3": literal_eval})
If you are sure the format is he same for all strings, stripping and splitting will be a lot faster:
df = pd.read_csv("in.csv",converters={"Col3": lambda x: x.strip("[]").split(", ")})
But you will end up with the strings wrapped in quotes
Solution 2:[2]
Adding a replace to Cunninghams answer:
df = pd.read_csv("in.csv",converters={"Col3": lambda x: x.strip("[]").replace("'","").split(", ")})
Solution 3:[3]
If you have the option to write the file -
you can use pd.to_parquet and pd.read_parquet (instead of csv).
It will properly parse this column.
Solution 4:[4]
I have a different approach for this, which can be used for string representations of other data types, besides just lists.
You can use the json library and apply json.loads() to the desired column. e.g
import json
df.my_column = df.my_column.apply(json.loads)
For this to work, however, your input strings must be enclosed in double quotations.
Solution 5:[5]
@Padraic Cunningham's answer will not work if you have to parse lists of strings that do not have quotes. For example, literal_eval will successfully parse "['a', 'b', 'c']", but not "[a, b, c]". To load strings like this, use the PyYAML library.
import io
import pandas as pd
data = '''
A,B,C
"[1, 2, 3]",True,"[a, b, c]"
"[4, 5, 6]",False,"[d, e, f]"
'''
df = pd.read_csv(io.StringIO(data), sep=',')
df
A B C
0 [1, 2, 3] True [a, b, c]
1 [4, 5, 6] False [d, e, f]
df['C'].tolist()
# ['[a, b, c]', '[d, e, f]']
import yaml
df[['A', 'C']] = df[['A', 'C']].applymap(yaml.safe_load)
df['C'].tolist()
# [['a', 'b', 'c'], ['d', 'e', 'f']]
yaml can be installed using pip install pyyaml.
Solution 6:[6]
Another solution for array-like columns in your csv:
df = pd.read_csv("inputfile.csv")
df["Col3"] = df["Col3"].fillna("[]").apply(lambda x: eval(x))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | 5norre |
| Solution 3 | theletz |
| Solution 4 | Ricardo |
| Solution 5 | |
| Solution 6 | matt525252 |
