'Summarizing with data.table R - multiple mathematical operations and conditions
I want to summarize a table creating new columns using different mathematical operations and conditions.
I am using data.table because I am used to this package but I accept recommendations on different ones if any (maybe dplyr?).
this is an example of data frame:
id <- c(rep("A", 6), rep("B", 6), rep("C",6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30,6))
lon <- c(rep(0, 6), rep(180, 6), rep(270,6))
hight <- c(rep(seq(0,100, 20),3))
var1 <- rnorm(18, 50, 50)
df <- data.frame(id, lat, lon, hight, var1)
setDT(df)
beside the typical mathematical operations, such as mean, sd, and median, I would like to create a new column showing the value of var1 at a specific condition, such as hight == 0, 100, etc..
df.new <- df[, .(
"var1_avg" = mean(var1, na.rm = T),
"var1_sd" = sd(var1, na.rm = T),
"var1_median" = median(var1, na.rm = T),
"var1_min" = min(var1),
#here I have the problems:
"var1_0" =df[which(hight == 0),
"var1"],
"var1_100" =df[which(hight == 100),
"var1"]
), by = c("lat", "lon")]
I understand the concept of the error:
Error in `[.data.table`(df, , .(var1_avg = mean(var1, na.rm = T), var1_sd = sd(var1, :
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
But I do not find an efficient solution to get my df.new
Solution 1:[1]
Here is a data.table version that seems more efficient than the proposed tidyverse approach:
library(data.table)
set.seed(123)
id <- c(rep("A", 6), rep("B", 6), rep("C",6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30,6))
lon <- c(rep(0, 6), rep(180, 6), rep(270,6))
hight <- c(rep(seq(0,100, 20),3))
var1 <- rnorm(18, 50, 50)
df <- data.table(id, lat, lon, hight, var1, key=c("lat", "lon"))
df[, .(
"var1_avg" = mean(var1, na.rm = T),
"var1_sd" = sd(var1, na.rm = T),
"var1_median" = median(var1, na.rm = T),
"var1_min" = min(var1),
"var1_0"= var1[hight==0],
"var1_100"= var1[hight==100]
), by = c("lat", "lon")]
#> lat lon var1_avg var1_sd var1_median var1_min var1_0 var1_100
#> 1: -30 270 52.28133 62.36118 62.78635 -48.33086 70.03857 -48.33086
#> 2: 45 0 72.35764 47.75012 54.99490 21.97622 21.97622 135.75325
#> 3: 50 180 47.06030 45.22337 47.85380 -13.25306 73.04581 67.99069
Created on 2022-04-04 by the reprex package (v2.0.1)
Solution 2:[2]
This will calculate the summary statistics e.g. mean or sd for every point (lat, lon) regardless of hight:
library(tidyverse)
id <- c(rep("A", 6), rep("B", 6), rep("C", 6))
lat <- c(rep(45, 6), rep(50, 6), rep(-30, 6))
lon <- c(rep(0, 6), rep(180, 6), rep(270, 6))
hight <- c(rep(seq(0, 100, 20), 3))
var1 <- rnorm(18, 50, 50)
df <- data.frame(id, lat, lon, hight, var1)
df %>%
group_by(lat, lon) %>%
summarise(
var1_avg = mean(var1, na.rm = TRUE),
var1_sd = sd(var1, na.rm = TRUE),
var1_median = median(var1, na.rm = TRUE)
) %>%
left_join(
df %>% filter(hight == 100) %>% transmute(lat, lon, var1_100 = var1)
) %>%
left_join(
df %>% filter(hight == 0) %>% transmute(lat, lon, var1_0 = var1)
)
#> `summarise()` has grouped output by 'lat'. You can override using the `.groups`
#> argument.
#> Joining, by = c("lat", "lon")
#> Joining, by = c("lat", "lon")
#> # A tibble: 3 × 7
#> # Groups: lat [3]
#> lat lon var1_avg var1_sd var1_median var1_100 var1_0
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -30 270 90.6 67.0 81.6 181. 5.51
#> 2 45 0 43.3 40.5 49.6 36.6 -30.1
#> 3 50 180 34.9 47.0 25.3 24.6 0.705
Created on 2022-04-04 by the reprex package (v2.0.0)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | user12728748 |
| Solution 2 | danlooo |
