'Can I make dataframe that summarises/aggregates data from a much larger one? [duplicate]

I've got a horrendously large dataframe (covering hundreds of days worth of data) that contains data of the following pattern:

df = data.frame(date = c('2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09'),
                category = c(UKS, USD, UKS, UKS, USD, USD, UKS, USD, UKZ, UKY),
                time = c(07:59:53, 08:00:03, 08:00:03, 08:00:03, 08:00:03, 08:00:04, 08:00:08, 08:00:11, 08:00:14, 08:00:15)
                quantity = c(0.001, 0.003, 0.018, 0.010, 0.043, 0.005, 0.023, 0.005, 0.001, 0.008) 
                cumvol = c(0.001, 0.004, 0.022, 0.032, 0.075, 0.080, 0.103, 0.108, 0.109, 0.117)
                type = c(TSV, OSN, TSS, TSV, TSS, TSS, OSN, TSV, OSN, TSS)

This dataframe cannot be changed, however what I would like to do is create a 'summary' dataframe from this one that sums together the total quantity for each category and type per day, as well as providing a total quantity on that day.

So using the above example:

For 2021-01-09 
Total Quantity = 0.117
Total UKS = 0.052
Total USD = 0.056
Total UKZ = 0.001
Total UKY = 0.008

Does anyone have any advice on how to achieve this for all the days I have data for?

r


Solution 1:[1]

Here's one way you could do it using sqldf. I think the sql language is pretty easy to understand for beginners and gives you another generalized tool to utilize. Here the UNION is key to join the 'total' answer and the 'category' answer. BTW, I slightly modified your data.frame to work.

df = data.frame(date = c('2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09'),
                category = c('UKS', 'USD', 'UKS', 'UKS', 'USD', 'USD', 'UKS', 'USD', 'UKZ', 'UKY'),
                time = c('07:59:53', '08:00:03', '08:00:03', '08:00:03', '08:00:03', '08:00:04', '08:00:08', '08:00:11', '08:00:14', '08:00:15'),
                quantity = c(0.001, 0.003, 0.018, 0.010, 0.043, 0.005, 0.023, 0.005, 0.001, 0.008), 
                cumvol = c(0.001, 0.004, 0.022, 0.032, 0.075, 0.080, 0.103, 0.108, 0.109, 0.117),
                type = c('TSV', 'OSN', 'TSS', 'TSV', 'TSS', 'TSS', 'OSN', 'TSV', 'OSN', 'TSS'))

library('sqldf')

sqldf("select Date, 'ALL' as [Category], 
             sum(quantity) as [Quantity]
      from df
      group by Date
      UNION
      select Date, Category,
             sum(quantity) as [Quantity]
      from df
      group by Date, category
      order by sum(Quantity) desc")

OUTPUT:

        date Category Quantity
1 2021-01-09      ALL    0.117
2 2021-01-09      USD    0.056
3 2021-01-09      UKS    0.052
4 2021-01-09      UKY    0.008
5 2021-01-09      UKZ    0.001

Solution 2:[2]

Although your question title includes "datatable", your input is a data.frame. Therefore I'll contribute a tidyverse approach here.

library(tidyverse)

df %>% group_by(date, category) %>% 
  summarize(Total_quantity = sum(quantity), .groups = "drop") %>% 
  group_by(date) %>% 
  mutate(Total_date = sum(Total_quantity)) 

# A tibble: 4 x 4
# Groups:   date [1]
  date       category Total_quantity Date_total
  <chr>      <chr>             <dbl>      <dbl>
1 2021-01-09 UKS               0.052      0.117
2 2021-01-09 UKY               0.008      0.117
3 2021-01-09 UKZ               0.001      0.117
4 2021-01-09 USD               0.056      0.117

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2