'create a pivot table from a dataset with duplicates

I am trying to create a pivot table and a heat map from this dataset (gender pay)

My code is:

df = df.pivot('Seniority', 'TotalPay', 'Gender')

ax = sns.heatmap(df)

But then I get an error:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, 
key, method, tolerance)
    3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

6 frames
pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Gender'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, 
key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'Gender'

Can anyone help me with this? I tried to removed duplicates using drop_duplicates(), but still does not work. Many thanks in advance.



Solution 1:[1]

My guess is that I created the code with the understanding that I want to create a heatmap of monthly and bonus totals by year and gender. I am posting this as the questioner has given me an opportunity to respond. The grouping of the SENIORITY to be indexed, which is the cause of the error in the pivot, is eliminated. Pivot and transform the data for the heatmap.

import pandas as pd
import seaborn as sns

df = pd.read_csv('/content/Glassdoor Gender Pay Gap.csv', sep=';')
df['TotalPay'] = df['BasePay'] + df['Bonus']

df.head()
JobTitle    Gender  Age     PerfEval    Education   Dept    Seniority   BasePay     Bonus   TotalPay
0   Graphic Designer    Female  18  5   College     Operations  2   42363   9938    52301
1   Software Engineer   Male    21  5   College     Management  5   108476  11128   119604
2   Warehouse Associate     Female  19  4   PhD     Administration  5   90208   9268    99476
3   Software Engineer   Male    20  5   Masters     Sales   4   108080  10154   118234
4   Graphic Designer    Male    26  5   Masters     Engineering     5   99464   9319    108783

df = df.groupby(['Seniority', 'Gender'])['TotalPay'].sum().to_frame('TotalPay').reset_index()

df.head()
    Seniority   Gender  TotalPay
0   1   Female  6267525
1   1   Male    9892202
2   2   Female  8738918
3   2   Male    10224439
4   3   Female  10359925

df = df.pivot('Seniority', 'Gender', 'TotalPay')
df
Gender  Female  Male
Seniority       
1   6267525     9892202
2   8738918     10224439
3   10359925    11775112
4   8483675     11932083
5   11273034    11992901

sns.heatmap(df, annot=True, fmt=',')

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 r-beginners