'Conditionally mutate dataframe based on multiple conditions R

I have seen some similar questions, but none of them was exactly the same as the thing I want to do - which is why I am asking.

I have a dataframe (dummy_data) which contains indices of some observations (obs) regarding given subjects (ID). The dataframe consists only the meaningful data (in other words: the desired conditions are met). The last column in this example data contains the total number of observations (total_obs).

ID <-c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
obs <- c(1,2,3,5,6,3,4,5,7,8,9,12,16,1,2,4,5,6,7,8,2,4,6,7,8,10,13,14,15,3,4,6,7,11)
total_obs <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))

dummy_data <- data.frame(ID, obs, total_obs)

I would like to create a new column (interval) with 3 possible values: "start", "center", "end" based on following condition(s):

it should split total number of observations (total_obs) into 3 groups (based on indices - from 1st to the last - which is the value stored in the total_obs column) and assign the interval value according to the indices stored in obs column.

Here is the expected output:

ID <- c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
segment <- c(1,2,3,5,6,  3,4,5,7,8,9,12,16, 1,2,4,5,6,7,8, 2,4,6,7,8,10,13,14,15, 3,4,6,7,11)
total_segments <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
interval <- c("start","start","center","end","end","start","start","start","center","center","center","end","end","start","start","center","center","center","end","end","start","start","start","center","center","center","end","end","end", "start","start","center","center","end")

wanted_data <- data.frame(ID, segment, total_segments, interval)

I would like to use use dplyr::ntile() with dplyr::mutate() and dplyr::case_when() but I could not make my code function properly. Any solutions?



Solution 1:[1]

You just need dplyr::mutate() and dplyr::case_when(). The following should give you something to work off of.

dummy_data %>%
    mutate(interval = case_when(obs < (total_obs/3) ~ "start",
                                obs < 2*(total_obs/3) ~ "center",
                                TRUE ~ "end"))
# TRUE ~ "end" is the 'else' case when everything else is false

Which gives slightly different results. I think more careful deliberation should be made regarding where the endpoints are for each interval, but if you know what you are doing, using a combination of <=, %/%, and ceil() should give you the result you desire.

Solution 2:[2]

First, because dummy_data$obs is identical withwanted_data$segment, and dummy_data$total_obs is identical with wanted_data$total_segments, you just need to rename these columns.

For the interval column, here is one approach of creating it:

  • group the data based on segment column
  • create a column, say tile, and fill it with ntile(segment) results.
  • create interval column, and use case_when to fill it with the category labels created from tile. It means, fill interval with "start" when tile = 1, "center" when 2, and "end" when 3.
  • drop the tile column.
wanted_data <- dummy_data %>% 
  rename(segment = obs, total_segments = total_obs) %>%
  group_by(total_segments) %>% 
  mutate(tile = ntile(segment, 3)) %>% 
  mutate(interval = case_when(tile == 1~"start", 
                              tile == 2~"center", 
                              tile == 3~"end")) %>% 
  select(-tile)

wanted_data 

# A tibble: 34 × 4
# Groups:   total_segments [5]
   ID       segment total_segments interval
   <chr>      <dbl>          <dbl> <chr>   
 1 item_001       1              6 start   
 2 item_001       2              6 start   
 3 item_001       3              6 center  
 4 item_001       5              6 center  
 5 item_001       6              6 end     
 6 item_452       3             16 start   
 7 item_452       4             16 start   
 8 item_452       5             16 start   
 9 item_452       7             16 center  
10 item_452       8             16 center  
# … with 24 more rows

It's slightly different from wanted_data$interval that you showed because based on your comment, you said that the division into categories is just as dplyr::ntile() does.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 philiptomk
Solution 2 Abdur Rohman