'Problem with non-standard evaluation in disk.frame objects using data.table syntax
Problem
I'm currently trying to write a function that filters some rows of a disk.frame object using regular expressions. I, unfortunately, run into some issues with the evaluation of my search string in the filter function. My idea was to pass a regular expression as a string into a function argument (e.g. storm_name) and then pass that argument into my filtering call. I used the %like% function included in {data.table} for filtering rows.
My problem is that the storm_name object gets evaluated inside the disk.frame. However, since the storm_name is only included in the function environment, but not in the disk.frame object, I get the following error:
Error in .checkTypos(e, names_x) :
Object 'storm_name' not found amongst name, year, month, day, hour and 8 more
I already tried to evaluate the storm_nameobject in the parent frame using eval(sotm_name, env = parent.env()), but that also didn't help. Interestingly, this problem only occurs with {disk.frame} objects but not with {data.table} objects.
For now I found a solution using {dplyr} instead. However, I would be grateful for any ideas on how this problem could be solved with {data.table}.
Reproducible Example
# Load packages
library(data.table)
library(disk.frame)
# Create data table and diskframe object of storm data
storms_df <- as.disk.frame(storms)
storms_dt <- as.data.table(storms)
# Create search function
grep_storm_name <- function(dfr, storm_name){
dfr[name %like% storm_name]
}
# Check function with data.table object
grep_storm_name(storms_dt, "^A")
# Check function with diskframe object
grep_storm_name(storms_df, "^A")
Session Info
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_Sweden.1252 LC_CTYPE=English_Sweden.1252 LC_MONETARY=English_Sweden.1252
[4] LC_NUMERIC=C LC_TIME=English_Sweden.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] disk.frame_0.5.0 purrr_0.3.4 dplyr_1.0.7 data.table_1.14.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 benchmarkmeData_1.0.4 pryr_0.1.4 pillar_1.6.4
[5] compiler_4.1.0 iterators_1.0.13 tools_4.1.0 digest_0.6.27
[9] bit_4.0.4 jsonlite_1.7.2 lifecycle_1.0.1 tibble_3.1.6
[13] lattice_0.20-44 pkgconfig_2.0.3 rlang_0.4.12 Matrix_1.3-3
[17] foreach_1.5.1 rstudioapi_0.13 DBI_1.1.1 parallel_4.1.0
[21] bigassertr_0.1.4 bigreadr_0.2.4 httr_1.4.2 stringr_1.4.0
[25] globals_0.14.0 generics_0.1.1 fs_1.5.0 vctrs_0.3.8
[29] bit64_4.0.5 grid_4.1.0 tidyselect_1.1.1 glue_1.6.0
[33] listenv_0.8.0 R6_2.5.1 future.apply_1.7.0 parallelly_1.25.0
[37] fansi_1.0.0 magrittr_2.0.1 codetools_0.2-18 ellipsis_0.3.2
[41] fst_0.9.4 assertthat_0.2.1 future_1.21.0 benchmarkme_1.0.7
[45] utf8_1.2.2 stringi_1.7.6 doParallel_1.0.16 crayon_1.4.2
Solution 1:[1]
While I don't know the exact cause of this, it has to do with environments, search path, etc. For instance, these work:
storms_df[name %like% "^A"]
nm <- "^A"
storms_df[name %like% nm]
grep1 <- function(dfr, storm_name) { dfr[name %like% "^A"]; }
grep1(storms_df)
But this does not:
grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
grep2(storms_df, "^A")
# Error in .checkTypos(e, names_x) :
# Object 'storm_name' not found amongst name, year, month, day, hour and 8 more
We can work around this with eval(substitute(..)).
grep3 <- function(dfr, storm_name) {
eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name)))
}
grep3(storms_df, "^A")
# name year month day hour lat long status category wind pressure ts_diameter hu_diameter
# <char> <num> <num> <int> <num> <num> <num> <char> <ord> <int> <int> <num> <num>
# 1: Amy 1975 6 27 0 27.5 -79.0 tropical depression -1 25 1013 NA NA
# 2: Amy 1975 6 27 6 28.5 -79.0 tropical depression -1 25 1013 NA NA
# 3: Amy 1975 6 27 12 29.5 -79.0 tropical depression -1 25 1013 NA NA
# ...
(and grep3(storms_dt, "^A") works too)
This works by changing the symbol of storm_name inside the [-expression from storm_name to the literal string. Since this is done on the unevaluated expression, there are no lookups yet, no searching through this and inherited environments to find storm_name.
If you check it manually:
debug(grep3)
grep3(storms_df, "^A")
# debugging in: grep3(storms_df, "^A")
# debug at #1: {
# eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name)))
# }
# Browse[2]>
substitute(dfr[name %like% storm_name], list(storm_name = storm_name))
# dfr[name %like% "^A"]
I think it's something to do with how disk.frame is affecting the environment within [ and the calling/parent environments. Interestingly (to me), you can see that the search path for variables is not empty, it's just not what we would expect:
grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
grep2(storms_df, "^A")
# Error in .checkTypos(e, names_x) :
# Object 'storm_name' not found amongst name, year, month, day, hour and 8 more
### but let's pre-define `storm_name` outside of the function,
### then re-define the function (no change)
storm_name <- "^A"
grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
head(grep2(storms_df, "^A"), 2)
# name year month day hour lat long status category wind pressure ts_diameter hu_diameter
# <char> <num> <num> <int> <num> <num> <num> <char> <ord> <int> <int> <num> <num>
# 1: Amy 1975 6 27 0 27.5 -79 tropical depression -1 25 1013 NA NA
# 2: Amy 1975 6 27 6 28.5 -79 tropical depression -1 25 1013 NA NA
This seems to work, but we can see that it's using the external version of storm_name vice the parametric version, see that name is still starting with A despite the change to "^B".
head(grep2(storms_df, "^B"), 2)
# name year month day hour lat long status category wind pressure ts_diameter hu_diameter
# <char> <num> <num> <int> <num> <num> <num> <char> <ord> <int> <int> <num> <num>
# 1: Amy 1975 6 27 0 27.5 -79 tropical depression -1 25 1013 NA NA
# 2: Amy 1975 6 27 6 28.5 -79 tropical depression -1 25 1013 NA NA
Frankly, I don't understand enough of disk.frame's internals to know if this is a bug or a necessity due to what it must do for non-standard data.table-like evaluation of a not-totally-in-memory dataset.
If you're concerned with performance (fair question), the eval(substitute(..)) method does not appear to suffer much:
bench::mark(
raw = dfr[name %like% "^A"],
subst = eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name))),
iterations = 1000
)
# # A tibble: 2 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 raw 12.9ms 16.8ms 55.2 1.69MB 3.97 933 67 16.9s <data.table [990 x 13]> <Rprofmem [669 x 3]> <bench_tm [1,000]> <tibble [1,000 x 3]>
# 2 subst 12.8ms 15.8ms 60.5 1.69MB 3.25 949 51 15.7s <data.table [990 x 13]> <Rprofmem [669 x 3]> <bench_tm [1,000]> <tibble [1,000 x 3]>
In repeated benchmarks, I've actually seen subst slightly faster, suggesting that a portion of the performance difference is unrelated to the addition of eval(substitute(..)). This difference (55.2 to 60.5 `itr/sec`) is the worst I've seen it ... a repeat just now had 57.1 and 57.5, so I suggest that performance-degradation is not a concern.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
