'Can I make dataframe that summarises/aggregates data from a much larger one? [duplicate]
I've got a horrendously large dataframe (covering hundreds of days worth of data) that contains data of the following pattern:
df = data.frame(date = c('2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09'),
category = c(UKS, USD, UKS, UKS, USD, USD, UKS, USD, UKZ, UKY),
time = c(07:59:53, 08:00:03, 08:00:03, 08:00:03, 08:00:03, 08:00:04, 08:00:08, 08:00:11, 08:00:14, 08:00:15)
quantity = c(0.001, 0.003, 0.018, 0.010, 0.043, 0.005, 0.023, 0.005, 0.001, 0.008)
cumvol = c(0.001, 0.004, 0.022, 0.032, 0.075, 0.080, 0.103, 0.108, 0.109, 0.117)
type = c(TSV, OSN, TSS, TSV, TSS, TSS, OSN, TSV, OSN, TSS)
This dataframe cannot be changed, however what I would like to do is create a 'summary' dataframe from this one that sums together the total quantity for each category and type per day, as well as providing a total quantity on that day.
So using the above example:
For 2021-01-09
Total Quantity = 0.117
Total UKS = 0.052
Total USD = 0.056
Total UKZ = 0.001
Total UKY = 0.008
Does anyone have any advice on how to achieve this for all the days I have data for?
Solution 1:[1]
Here's one way you could do it using sqldf. I think the sql language is pretty easy to understand for beginners and gives you another generalized tool to utilize. Here the UNION is key to join the 'total' answer and the 'category' answer. BTW, I slightly modified your data.frame to work.
df = data.frame(date = c('2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09'),
category = c('UKS', 'USD', 'UKS', 'UKS', 'USD', 'USD', 'UKS', 'USD', 'UKZ', 'UKY'),
time = c('07:59:53', '08:00:03', '08:00:03', '08:00:03', '08:00:03', '08:00:04', '08:00:08', '08:00:11', '08:00:14', '08:00:15'),
quantity = c(0.001, 0.003, 0.018, 0.010, 0.043, 0.005, 0.023, 0.005, 0.001, 0.008),
cumvol = c(0.001, 0.004, 0.022, 0.032, 0.075, 0.080, 0.103, 0.108, 0.109, 0.117),
type = c('TSV', 'OSN', 'TSS', 'TSV', 'TSS', 'TSS', 'OSN', 'TSV', 'OSN', 'TSS'))
library('sqldf')
sqldf("select Date, 'ALL' as [Category],
sum(quantity) as [Quantity]
from df
group by Date
UNION
select Date, Category,
sum(quantity) as [Quantity]
from df
group by Date, category
order by sum(Quantity) desc")
OUTPUT:
date Category Quantity
1 2021-01-09 ALL 0.117
2 2021-01-09 USD 0.056
3 2021-01-09 UKS 0.052
4 2021-01-09 UKY 0.008
5 2021-01-09 UKZ 0.001
Solution 2:[2]
Although your question title includes "datatable", your input is a data.frame. Therefore I'll contribute a tidyverse approach here.
library(tidyverse)
df %>% group_by(date, category) %>%
summarize(Total_quantity = sum(quantity), .groups = "drop") %>%
group_by(date) %>%
mutate(Total_date = sum(Total_quantity))
# A tibble: 4 x 4
# Groups: date [1]
date category Total_quantity Date_total
<chr> <chr> <dbl> <dbl>
1 2021-01-09 UKS 0.052 0.117
2 2021-01-09 UKY 0.008 0.117
3 2021-01-09 UKZ 0.001 0.117
4 2021-01-09 USD 0.056 0.117
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
