'Check dataframe with different functions (dplyr)

I am trying to write some functions that take a dataframe and check whether certain variables fulfill certain criteria. For each check I would like to create a new variable "check_" giving the result of the check. Unfortunately, I still struggle to get it right. Can someone help me?

# Some sample data
dat <- data.frame(Q1_1 = c(1, 1, 2, 5, 2, 1),
                  Q1_2 = c(1, 2, 3, 5, 1, 3),
                  Q1_3 = c(4, 3, 3, 5, 1, 3),
                  Q1_4 = c(4, 2, 2, 5, 1, 2),
                  Q1_5 = c(2, 2, 1, 5, 5, 4),
                  Q2_1 = c(1, 2, 1, 2, 1, 2),
                  Q2_2 = c(2, 1, 1, 1, 2, 1),
                  Q2_3 = c(1, 1, 1, 2, 2, 1),
                  age = c(22,36,20,27,13, 9))


# Some checker-functions

check_age <- function(.df, agevar = "age"){
  #' Function should check if the age value is within a certain range
  #' and create a new variable "check_age" giving the result of the check

  .df %>% mutate(check_age = ifelse(age > 100, FALSE, TRUE),
                 check_age = ifelse(age < 4, FALSE, TRUE))
  ???
}

check_sameAnswers <- function(.df, varname = "Q1_"){
  #' Function should check whether all sub Of a question (e.g. Q1_1 to Q1_5) have the
  #' same values and create a new variable "check_sameAnswers" giving the result of the check.
  #' It should be TRUE if Q1_1, Q1_2, ... have the value 5 for example, otherwise FALSE
  
  ???
}


# Apply checker functions to dataframe in "dplyr-style"
dat <- dat %>% 
          check_age(agevar = "age") %>%
          check_sameAnswers(varname = "Q1_")


Solution 1:[1]

You can embrace the argument to use variables (from data masking) in your function

Functions

library(dplyr)

check_age <- function(data, age_var, start = 0, end = 0){
  data %>% 
  mutate(between = ifelse({{age_var}} >= start & {{age_var}} <= end,T,F))
}

check_sameAnswers <- function(data, cols){
  data %>% 
  rowwise() %>% 
  mutate(same = length(unique(c_across(starts_with(cols)))) == 1) %>% 
  ungroup()
}

Use

dat %>% 
  check_age(age, 30, 40) %>% 
  check_sameAnswers(cols="Q1")
# A tibble: 6 × 11
   Q1_1  Q1_2  Q1_3  Q1_4  Q1_5  Q2_1  Q2_2  Q2_3   age between same 
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>   <lgl>
1     1     1     4     4     2     1     2     1    22 FALSE   FALSE
2     1     2     3     2     2     2     1     1    36 TRUE    FALSE
3     2     3     3     2     1     1     1     1    20 FALSE   FALSE
4     5     5     5     5     5     2     1     2    27 FALSE   TRUE 
5     2     1     1     1     5     1     2     2    13 FALSE   FALSE
6     1     3     3     2     4     2     1     1     9 FALSE   FALSE

Solution 2:[2]

I think the problem is in your ifelse statement. Try this:

check_age <- function(.df, agevar = "age"){
  #' Function should check if the age value is within a certain range
  #' and create a new variable "check_age" giving the result of the check
  
  .df %>% mutate(check_age = ifelse(age > 100 | age < 4, FALSE, TRUE))

}

check_sameAnswers <- function(.df, varname = "Q1_"){
  #' Function should check whether all sub Of a question (e.g. Q1_1 to Q1_5) have the
  #' same values and create a new variable "check_sameAnswers" giving the result of the check.
  #' It should be TRUE if Q1_1, Q1_2, ... have the value 5 for example, otherwise FALSE
  .df %>% mutate(sameAnswers = ifelse(length(unique(dat$Q1_2)) == 1, TRUE, FALSE))
}


# Apply checker functions to dataframe in "dplyr-style"
dat <- dat %>% 
  check_age(agevar = "age") %>%
  check_sameAnswers(varname = "Q1_")
dat

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andre Wildberg
Solution 2