'Create and fill new columns based on range information from two other columns
I have the following data:
df <- data.frame(group = c(1, 1, 1, 2, 2, 2),
start = c(2, 2, 2, 7, 7, 7),
stop = c(4, 7, 8, 7, 8, 9),
unstop = c(5, 7, 10, 7, 9, 10))
I now want to do the following:
- Create new columns that have the names "week_1", "week_2" ... "week_10", "week_n".
- Within each group for the FIRST ROW, I check which weeks the row was "active" in, i.e. it started in week 2, and stopped in week 4, so the row was active in week 2, 3, 4. I now want to populate the respective week columns with a 1.
- Within each group for ALL OTHER EXCEPT THE LAST ROW, I do the same check, but now populate based on the unstop value of that row and the stop value of the next row.
- Within each group for the LAST ROW, I do the same check, but now populate based on the range from unstop to 10 (which is the last week in my case).
I have a theoretical way. The problem is that my real-life data has 80k rows (consisting of 60k groups) and I'd need to create ~200 of such week-columns. Even filtering on 10 rows only takes ~30s for the code below.
So I'm looking for a more elegent/smarter/FASTER solution.
Expected outcome:
# A tibble: 6 × 14
# Groups: group [2]
group start stop unstop week_1 week_2 week_3 week_4 week_5 week_6 week_7 week_8 week_9 week_10
<dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 2 4 5 0 1 1 1 0 0 0 0 0 0
2 1 2 7 7 0 0 0 0 0 0 1 1 0 0
3 1 2 8 10 0 0 0 0 0 0 0 0 0 1
4 2 7 7 7 0 0 0 0 0 0 1 0 0 0
5 2 7 8 9 0 0 0 0 0 0 0 1 1 0
6 2 7 9 10 0 0 0 0 0 0 0 0 0 1
Below is how I would have approached it generally (of course not with manually defining each row_number. Apart from that, the code is also wrong and does not give the expected 0/1 values. It alsow throws many warnings. And finally, this code already runs a few seconds just for this small test data. It would run one month for my 80k/200col data set.
add_weeks <- as_tibble(as.list(setNames(rep(0L, 10),
paste0("week_", 1:10))))
df |>
bind_cols(add_weeks) |>
group_by(group) |>
mutate(across(num_range("week_", 1:10),
~ if_else(row_number() == 1 & str_extract(cur_column(), "\\d+$") %in% start:stop,
1L,
.)),
across(num_range("week_", 1:10),
~ if_else(row_number() == 2 & str_extract(cur_column(), "\\d+$") %in% unstop:lead(stop),
1L,
.)),
across(num_range("week_", 1:10),
~ if_else(row_number() == 3 & str_extract(cur_column(), "\\d+$") %in% unstop:10,
1L,
.)))
Solution 1:[1]
Now tested code. Implementation of strategy described in comment:
I’d make a matrix with names columns and assign with row and col indices. You can then either attach it as a matrix or convert to data frame.
Mat <- matrix(0, nrow(df), 10) # 200 for real case
maxwk <- 10
colnames(Mat) <- paste0("week", 1:maxwk)
# Add extra column that marks condition
# If there are always exactly 3 row per group just rep(1:3, ngrps)
# Need to define a value for cond that identifies the three possibilities:
df$cond <- rep(1:3, length=nrow(df)) # assume all groups have exactly 3:
for ( r in 1:nrow(df) ) {
# for first row in group
if( df$cond[r] == 1){
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1; next}
# second
if( df$cond[r] == 2){
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1; next}
# third
if( df$cond[r] == 3){
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1; next}
}
df
group start stop unstop cond
1 1 2 4 5 1
2 1 2 7 7 2
3 1 2 8 10 3
4 2 7 6 7 1
5 2 7 8 9 2
6 2 7 9 10 3
> Mat
week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
[1,] 0 1 1 1 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 1 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 1
[4,] 0 0 0 0 0 1 1 0 0 0
[5,] 0 0 0 0 0 0 0 1 1 0
[6,] 0 0 0 0 0 0 0 0 0 1
You could cbind these.
There might be performance improvements possible. Could use switch(cond, ...) to dispatch to the correct logic rather than the if( cond == .){ ., next} method. This should be much faster than code that uses ifelse or if_else. If you want to see how that's implemented, then endorse the general strategy with a checkmark and I'll spend the time to add the alternate code.
- ran a benchmark after setting up both methods for 100 week maximum. * The warnings are from the code in the question:
> perf_results <- microbenchmark(
+ first.method = do_first(df), sec.method=do_second(df), times=10)
There were 50 or more warnings (use warnings() to see the first 50)
> perf_results
Unit: microseconds
expr min lq mean median uq max neval
first.method 4385001.123 4416568.8 4581549.9624 4450691.5455 4615753.753 5350416.80 10
sec.method 146.432 149.6 181.6137 188.2125 193.307 243.47 10
I wanted to see if a switch method of selecting the proper algorithm for a row would improve performance. It did and to a degree that surprised me. The switch function is analogous to the case function in Pascal and many other languages. It has two forms whose behavior is different depending on whether the first argument, EXPR is numeric or character. Here, the "dispatch" version is chosen because the "cond" column is numeric.
do_third= function(df){ Mat <- matrix(0, nrow(df), 100) # 200 for real case
maxwk <- 100
colnames(Mat) <- paste0("week", 1:maxwk)
df$cond <- rep(1:3, length=nrow(df)) # assume all groups have exactly 3:
for( r in 1:nrow(df)) { switch( df[r,"cond"],
{ # for first row in each group of 3
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1 },
{ # second row in group
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1 },
{# third
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1 } ) }
}
New microbenchmark:
perf_results
Unit: nanoseconds
expr min lq mean median uq max neval cld
first.method 4304901359 4351893534 4387626725.8 4372151785 4416247096 4543314742 10 b
sec.method 162803 173855 2588492.1 215309 216878 24081195 10 a
third.meth 34 53 610.6 877 940 963 10 a
Solution 2:[2]
FWIW, I'm posting my own solution of it. Apparently, adding 200 cols to a 60k data frame based on some conditions is extremely slow. So what I did instead is:
- Add one chr column with the info about the weeks via
str_c. - Create a smaller data set that just has the grouping var and this new info.
- Then use
separate_rowson this week_info to get a long format data set. - Then use
pivot_widerand combine this info with the orginal data set.
Note that this approach works because I didn't mention in my initial post that I actually want to summarize the week info per group. So in the end I want to have one row per group. In the interest of keeping my question simple, I didn't add this to my question.
Having said that, the solution of @IRTFM is still considerably faster by a factor of 3.
df2 <- df |>
group_by(group) |>
mutate(lead_stop = lead(stop, default = 0),
n_rows = n(),
row_number = row_number()) |>
ungroup() |>
rowwise() |>
mutate(split_weeks = case_when(n_rows == 1 & row_number == 1 ~ str_c(start:stop, collapse = ","),
n_rows > 1 & row_number == 1 ~ str_c(c(start:stop, unstop:lead_stop), collapse = ","),
row_number == n_rows ~ str_c(unstop:10, collapse = ","),
TRUE ~ str_c(unstop:lead_stop, collapse = ",")))
df3 <- df2 |>
group_by(group) |>
summarize(split_weeks = unique(str_c(split_weeks, collapse = ","))) |>
separate_rows(split_weeks, sep = ",", convert = TRUE) |>
distinct() |>
mutate(value = 1L) |>
full_join(y = data.frame(split_weeks = 1:10)) |>
pivot_wider(names_from = split_weeks,
names_prefix = "week_",
values_from = value,
values_fill = 0L,
names_expand = TRUE) |>
filter(!is.na(group))
df4 <- df2 |>
ungroup() |>
select(-split_weeks, -n_rows) |>
pivot_wider(names_from = row_number, values_from = -group) |>
bind_cols(x = df3 |> select(-group), y = _)
Solution 3:[3]
library(tidyverse)
periods <- tibble(
group = c(1, 1, 1, 2, 2, 2),
start = c(2, 2, 2, 7, 7, 7),
stop = c(4, 7, 8, 7, 8, 9),
unstop = c(5, 7, 10, 7, 9, 10)
)
LAST <- 10
I think it makes sense to recode the intragroup start/stop/unstop logic to a single start/stop for each row. Let’s call them rstart/rstop. Using your rules, they can be created like this:
(periods <- periods %>%
group_by(group) %>%
transmute(
period = row_number(),
rstart = if_else(period == 1L, start, unstop),
rstop = if_else(period == 1L, stop, lead(stop, default = LAST))
) %>%
ungroup()
)
#> # A tibble: 6 x 4
#> group period rstart rstop
#> <dbl> <int> <dbl> <dbl>
#> 1 1 1 2 4
#> 2 1 2 7 8
#> 3 1 3 10 10
#> 4 2 1 7 7
#> 5 2 2 9 9
#> 6 2 3 10 10
Now, we can generate the active stretches by group_by -> summarise. Here we
also add a indicator column active to show that the given weeks are active
(periods <- periods %>%
group_by(group, period) %>%
summarise(
weeks = rstart:rstop,
active = 1L,
.groups = "drop"
)
)
#> # A tibble: 9 x 4
#> group period weeks active
#> <dbl> <int> <int> <int>
#> 1 1 1 2 1
#> 2 1 1 3 1
#> 3 1 1 4 1
#> 4 1 2 7 1
#> 5 1 2 8 1
#> 6 1 3 10 1
#> 7 2 1 7 1
#> 8 2 2 9 1
#> 9 2 3 10 1
To have the non-observed weeks be present in the output after pivot_wider,
we can convert the week column to a factor and add the missing levels with
fct_expand. I’ve also added fct_inseq to make sure that the columns are
ordered as expected in the output. Once that is done, we can use pivot_wider
to get the wide format. Note the names_expand = TRUE argument which gives us
the levels we added to the week column.
periods %>%
mutate(
weeks = as_factor(weeks) %>%
fct_expand(as.character(1:LAST)) %>%
fct_inseq()
) %>%
pivot_wider(
names_from = weeks,
names_expand = TRUE,
values_from = active,
values_fill = 0L,
names_prefix = "week"
)
#> # A tibble: 6 x 12
#> group period week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
#> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 1 0 1 1 1 0 0 0 0 0 0
#> 2 1 2 0 0 0 0 0 0 1 1 0 0
#> 3 1 3 0 0 0 0 0 0 0 0 0 1
#> 4 2 1 0 0 0 0 0 0 1 0 0 0
#> 5 2 2 0 0 0 0 0 0 0 0 1 0
#> 6 2 3 0 0 0 0 0 0 0 0 0 1
Created on 2022-05-09 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | deschen |
| Solution 3 | Peter H. |
