'dynamically create a grand total row in pandas/pyspark?
I currently have a pandas dataframe that looks like this:
| location | count | qty | approved_count |
|---|---|---|---|
| Phoenix | 24 | 300 | 15 |
| Dallas | 18 | 403 | 14 |
I would like to append a row to the dataframe that iterates over the columns and sums them, and then appends a new row to the bottom, including the value "Grand Total" in the 'location' column. The resulting dataset should look like this:
| location | count | qty | approved_count |
|---|---|---|---|
| Phoenix | 24 | 300 | 15 |
| Dallas | 18 | 403 | 14 |
| Grand Total | 42 | 703 | 29 |
I am currently able to get this result this way:
df = df.append({'location' : 'Grand Total', 'count' :
df['count'].sum(), 'qty' : df['qty'].sum(),
'approved_count' : df['approved_count'].sum()}, ignore_index = True)
however I would like to be able to dynamically iterate over the columns and sum, excluding the 'location' column from the sum process. Is this possible with Pandas or Pyspark?
Solution 1:[1]
Try:
df = df.set_index("location")
df.loc["Grand Total"] = df.sum()
df = df.reset_index()
>>> df
location count qty approved_count
0 Phoenix 24 300 15
1 Dallas 18 403 14
2 Grand Total 42 703 29
Or in one line using concat:
>>> pd.concat([df.set_index("location"), df.drop("location",axis=1).sum().rename("Grand Total").to_frame().T]).reset_index()
index count qty approved_count
0 Phoenix 24 300 15
1 Dallas 18 403 14
2 Grand Total 42 703 29
Solution 2:[2]
Using Pyspark as below -
Input Data
df = spark.createDataFrame([('Phoenix', 24, 300, 15), ('Dallas', 18, 403, 14)], schema = ["location", "count", "qty", "approved_count"])
df.show()
+--------+-----+---+--------------+
|location|count|qty|approved_count|
+--------+-----+---+--------------+
| Phoenix| 24|300| 15|
| Dallas| 18|403| 14|
+--------+-----+---+--------------+
Creating a new dataframe to find the summation of columns
df1 = df.withColumn("location", lit("Grand Total")).groupBy("location").agg(*[sum(c).alias(c) for c in df.columns if c != "location"])
df1.show()
+-----------+-----+---+--------------+
| location|count|qty|approved_count|
+-----------+-----+---+--------------+
|Grand Total| 42|703| 29|
+-----------+-----+---+--------------+
Union both the dataframes to append the rows
df.unionByName(df1).show()
+-----------+-----+---+--------------+
| location|count|qty|approved_count|
+-----------+-----+---+--------------+
| Phoenix| 24|300| 15|
| Dallas| 18|403| 14|
|Grand Total| 42|703| 29|
+-----------+-----+---+--------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | DKNY |
