'Summarizing with data.table R - multiple mathematical operations and conditions

I want to summarize a table creating new columns using different mathematical operations and conditions.

I am using data.table because I am used to this package but I accept recommendations on different ones if any (maybe dplyr?).

this is an example of data frame:

id <- c(rep("A", 6), rep("B", 6), rep("C",6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30,6))
lon <- c(rep(0, 6), rep(180, 6), rep(270,6))
hight <- c(rep(seq(0,100, 20),3))
var1 <- rnorm(18, 50, 50)

df <- data.frame(id, lat, lon, hight, var1)

setDT(df)

beside the typical mathematical operations, such as mean, sd, and median, I would like to create a new column showing the value of var1 at a specific condition, such as hight == 0, 100, etc..

df.new <- df[, .(
  "var1_avg" = mean(var1, na.rm = T),
  "var1_sd" = sd(var1, na.rm = T),
  "var1_median" = median(var1, na.rm = T),
  "var1_min" = min(var1),
  
  #here I have the problems:
  "var1_0" =df[which(hight == 0),
           "var1"],
  "var1_100" =df[which(hight == 100),
               "var1"]
  
), by = c("lat", "lon")] 

I understand the concept of the error:

Error in `[.data.table`(df, , .(var1_avg = mean(var1, na.rm = T), var1_sd = sd(var1,  : 
  All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.

But I do not find an efficient solution to get my df.new

r


Solution 1:[1]

Here is a data.table version that seems more efficient than the proposed tidyverse approach:

library(data.table)
set.seed(123)
id <- c(rep("A", 6), rep("B", 6), rep("C",6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30,6))
lon <- c(rep(0, 6), rep(180, 6), rep(270,6))
hight <- c(rep(seq(0,100, 20),3))
var1 <- rnorm(18, 50, 50)
df <- data.table(id, lat, lon, hight, var1, key=c("lat", "lon"))

df[, .(
  "var1_avg" = mean(var1, na.rm = T),
  "var1_sd" = sd(var1, na.rm = T),
  "var1_median" = median(var1, na.rm = T),
  "var1_min" = min(var1),
  "var1_0"= var1[hight==0],
  "var1_100"= var1[hight==100]
), by = c("lat", "lon")]
#>    lat lon var1_avg  var1_sd var1_median  var1_min   var1_0  var1_100
#> 1: -30 270 52.28133 62.36118    62.78635 -48.33086 70.03857 -48.33086
#> 2:  45   0 72.35764 47.75012    54.99490  21.97622 21.97622 135.75325
#> 3:  50 180 47.06030 45.22337    47.85380 -13.25306 73.04581  67.99069

Created on 2022-04-04 by the reprex package (v2.0.1)

Solution 2:[2]

This will calculate the summary statistics e.g. mean or sd for every point (lat, lon) regardless of hight:

library(tidyverse)

id <- c(rep("A", 6), rep("B", 6), rep("C", 6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30, 6))
lon <- c(rep(0, 6), rep(180, 6), rep(270, 6))
hight <- c(rep(seq(0, 100, 20), 3))
var1 <- rnorm(18, 50, 50)
df <- data.frame(id, lat, lon, hight, var1)

df %>%
  group_by(lat, lon) %>%
  summarise(
    var1_avg = mean(var1, na.rm = TRUE),
    var1_sd = sd(var1, na.rm = TRUE),
    var1_median = median(var1, na.rm = TRUE)
  ) %>%
  left_join(
    df %>% filter(hight == 100) %>% transmute(lat, lon, var1_100 = var1)
  ) %>%
  left_join(
    df %>% filter(hight == 0) %>% transmute(lat, lon, var1_0 = var1)
  )
#> `summarise()` has grouped output by 'lat'. You can override using the `.groups`
#> argument.
#> Joining, by = c("lat", "lon")
#> Joining, by = c("lat", "lon")
#> # A tibble: 3 × 7
#> # Groups:   lat [3]
#>     lat   lon var1_avg var1_sd var1_median var1_100  var1_0
#>   <dbl> <dbl>    <dbl>   <dbl>       <dbl>    <dbl>   <dbl>
#> 1   -30   270     90.6    67.0        81.6    181.    5.51 
#> 2    45     0     43.3    40.5        49.6     36.6 -30.1  
#> 3    50   180     34.9    47.0        25.3     24.6   0.705

Created on 2022-04-04 by the reprex package (v2.0.0)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 user12728748
Solution 2 danlooo