'Extract dictionary value from column in data frame
I'm looking for a way to optimize my code.
I have entry data in this form:
import pandas as pn
a=[{'Feature1': 'aa1','Feature2': 'bb1','Feature3': 'cc2' },
{'Feature1': 'aa2','Feature2': 'bb2' },
{'Feature1': 'aa1','Feature2': 'cc1' }
]
b=['num1','num2','num3']
df= pn.DataFrame({'num':b, 'dic':a })
I would like to extract element 'Feature3' from dictionaries in column 'dic'(if exist) in above data frame. So far I was able to solve it but I don't know if this is the fastest way, it seems to be a little bit over complicated.
Feature3=[]
for idx, row in df['dic'].iteritems():
l=row.keys()
if 'Feature3' in l:
Feature3.append(row['Feature3'])
else:
Feature3.append(None)
df['Feature3']=Feature3
print df
Is there a better/faster/simpler way do extract this Feature3 to separate column in the dataframe?
Thank you in advance for help.
Solution 1:[1]
You can use a list comprehension to extract feature 3 from each row in your dataframe, returning a list.
feature3 = [d.get('Feature3') for d in df.dic]
If 'Feature3' is not in dic, it returns None by default.
You don't even need pandas, as you can again use a list comprehension to extract the feature from your original dictionary a.
feature3 = [d.get('Feature3') for d in a]
Solution 2:[2]
If you apply a Series, you get a quite nice DataFrame:
>>> df.dic.apply(pn.Series)
Feature1 Feature2 Feature3
0 aa1 bb1 cc2
1 aa2 bb2 NaN
2 aa1 cc1 NaN
From this point, you can just use regular pandas operations.
Solution 3:[3]
df['Feature3'] = df['dic'].apply(lambda x: x.get('Feature3'))
Agree with maxymoo. Consider changing the format of your dataframe.
(Sidenote: pandas is generally imported as pd)
Solution 4:[4]
I think you can first create new DataFrame by comprehension and then create new column like:
df1 = pd.DataFrame([x for x in df['dic']])
print df1
Feature1 Feature2 Feature3
0 aa1 bb1 cc2
1 aa2 bb2 NaN
2 aa1 cc1 NaN
df['Feature3'] = df1['Feature3']
print df
dic num Feature3
0 {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F... num1 cc2
1 {u'Feature2': u'bb2', u'Feature1': u'aa2'} num2 NaN
2 {u'Feature2': u'cc1', u'Feature1': u'aa1'} num3 NaN
Or one line:
df['Feature3'] = pd.DataFrame([x for x in df['dic']])['Feature3']
print df
dic num Feature3
0 {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F... num1 cc2
1 {u'Feature2': u'bb2', u'Feature1': u'aa2'} num2 NaN
2 {u'Feature2': u'cc1', u'Feature1': u'aa1'} num3 NaN
Timings:
len(df) = 3:
In [24]: %timeit pd.DataFrame([x for x in df['dic']])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 596 µs per loop
In [25]: %timeit df.dic.apply(pn.Series)
1000 loops, best of 3: 1.43 ms per loop
len(df) = 3000:
In [27]: %timeit pd.DataFrame([x for x in df['dic']])
100 loops, best of 3: 3.16 ms per loop
In [28]: %timeit df.dic.apply(pn.Series)
1 loops, best of 3: 748 ms per loop
Solution 5:[5]
I think you're thinking about the data structures slightly wrong. It's better to create the data frame with the features as columns from the start; pandas is actually smart enough to do this by default:
In [240]: pd.DataFrame(a)
Out[240]:
Feature1 Feature2 Feature3
0 aa1 bb1 cc2
1 aa2 bb2 NaN
2 aa1 cc1 NaN
You would then add on your "num" column in a separate step, since the data is in a different orientation, either with
df['num'] = b
or
df = df.assign(num = b)
(I prefer the second option since it's got a more functional flavour).
Solution 6:[6]
df = pd.concat([df, pd.DataFrame(list(df['dic']))], axis=1)
Then do whatever you want with the result, if a key was missing at one place you will get NaN there.
Solution 7:[7]
There is now a vectorial method, you can use the str accessor:
df['dic'].str['Feature3']
Or with str.get
df['dic'].get('Feature3')
output:
0 cc2
1 None
2 None
Name: dic, dtype: object
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Ami Tavory |
| Solution 3 | as133 |
| Solution 4 | |
| Solution 5 | maxymoo |
| Solution 6 | hk_03 |
| Solution 7 | mozway |
