'dynamically create a grand total row in pandas/pyspark?

I currently have a pandas dataframe that looks like this:

location count qty approved_count
Phoenix 24 300 15
Dallas 18 403 14

I would like to append a row to the dataframe that iterates over the columns and sums them, and then appends a new row to the bottom, including the value "Grand Total" in the 'location' column. The resulting dataset should look like this:

location count qty approved_count
Phoenix 24 300 15
Dallas 18 403 14
Grand Total 42 703 29

I am currently able to get this result this way:

    df = df.append({'location' : 'Grand Total', 'count' : 
    df['count'].sum(), 'qty' : df['qty'].sum(),
    'approved_count' : df['approved_count'].sum()}, ignore_index = True)

however I would like to be able to dynamically iterate over the columns and sum, excluding the 'location' column from the sum process. Is this possible with Pandas or Pyspark?



Solution 1:[1]

Try:

df = df.set_index("location")
df.loc["Grand Total"] = df.sum()
df = df.reset_index()

>>> df
      location  count  qty  approved_count
0      Phoenix     24  300              15
1       Dallas     18  403              14
2  Grand Total     42  703              29

Or in one line using concat:

>>> pd.concat([df.set_index("location"), df.drop("location",axis=1).sum().rename("Grand Total").to_frame().T]).reset_index()
         index  count  qty  approved_count
0      Phoenix     24  300              15
1       Dallas     18  403              14
2  Grand Total     42  703              29

Solution 2:[2]

Using Pyspark as below -

Input Data

df = spark.createDataFrame([('Phoenix', 24, 300, 15), ('Dallas', 18, 403, 14)], schema = ["location", "count", "qty", "approved_count"])
df.show()

+--------+-----+---+--------------+
|location|count|qty|approved_count|
+--------+-----+---+--------------+
| Phoenix|   24|300|            15|
|  Dallas|   18|403|            14|
+--------+-----+---+--------------+

Creating a new dataframe to find the summation of columns

df1 = df.withColumn("location", lit("Grand Total")).groupBy("location").agg(*[sum(c).alias(c) for c in df.columns if c != "location"])
df1.show()

+-----------+-----+---+--------------+
|   location|count|qty|approved_count|
+-----------+-----+---+--------------+
|Grand Total|   42|703|            29|
+-----------+-----+---+--------------+

Union both the dataframes to append the rows

df.unionByName(df1).show()

+-----------+-----+---+--------------+
|   location|count|qty|approved_count|
+-----------+-----+---+--------------+
|    Phoenix|   24|300|            15|
|     Dallas|   18|403|            14|
|Grand Total|   42|703|            29|
+-----------+-----+---+--------------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 DKNY