'Handle missing data when flattening nested array field in pandas dataframe
We need to flatten this into a standard 2D DataFrame:
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Here's our code, which is working just fine for this dummy example
for i in range(len(mydf)):
output_list = []
for i in range(len(mydf)):
team1 = mydf['arr'][i][0]
team2 = mydf['arr'][i][1]
zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
output_list.append(zed)
output_df = pd.DataFrame(output_list)
final_df = pd.concat([mydf, output_df], axis=1)
final_df.pop('arr')
final_df
name t1 t2
0 nick ORL ATL
1 tom NYK BOS
Our source of data is not reliable and ma have missing values, and our code seems fraught with structural weaknesses. In particular, errors are thrown when either of these are the raw data (missing field, missing dict):
# missing dict
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
# missing "abbr" field
arr = [
[{ 'id': 3, 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Is it possible to (a) replace the for-loop with a more structurally sound approach (apply), and (b) handle the missing data concerns?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
