'problem with `replace_na()` from tidyr package

I wrote a function that has five arguments to calculate random numbers from a normal distribution. It has two steps:

  1. replace NA with 0 in tibble column
  2. replace 0 with a random number

My problems are:

  1. line three doesn't replace NA value with 0
  2. line five doesn't replace 0 with a random number

I have this error :

! Must subset columns with a valid subscript vector.
x Subscript `col` has the wrong type `function`.
 It must be logical, numeric, or character.

here is my code :

whithout=function(col,min,max,mean,sd){
  for(i in 1:4267){
      continuous_dataset=continuous_dataset %>% replace_na(continuous_dataset[,col]=0)
      if(is.na(continuous_dataset[,col])){
         continuous_dataset[i,col]=round(rtruncnorm(1,min,max,mean,sd))    
    }
  }
}


Solution 1:[1]

There's no need to write a function that loops across both columns and observations.

I assume you have no zeroes in your dataset to begin with. In which case, I can skip replacing NA with 0 and go straight to genereating the replacement value.

My solution is based on the tidyverse.

First, generate some test data.

library(tidyverse)

set.seed(123)
df <- tibble(x=runif(5), y=runif(5), z=runif(5))
df$x[3] <- NA
df$y[4] <- NA
df$z[5] <- NA
df
# A tibble: 5 × 3
       x       y      z
   <dbl>   <dbl>  <dbl>
1  0.288  0.0456  0.957
2  0.788  0.528   0.453
3 NA      0.892   0.678
4  0.883 NA       0.573
5  0.940  0.457  NA    

Now solve the problem.

df %>% 
  mutate(
    across(
      everything(), 
      function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x), 
      mean=500, 
      sd=100
    )
  )
# A tibble: 5 × 3
        x        y       z
    <dbl>    <dbl>   <dbl>
1   0.288   0.0456   0.957
2   0.788   0.528    0.453
3 669.      0.892    0.678
4   0.883 629.       0.573
5   0.940   0.457  467.   

By avoiding looping through columns and rows, the code is more compact, more robust and (though I've not tested) faster.

If you don't want to process every column, simply replace everything() with a vector of columns that you do want to process. For example

df %>% 
  mutate(
    across(
      c(x, y), 
      function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x), 
      mean=500, 
      sd=100
    )
  )
# A tibble: 5 × 3
        x        y      z
    <dbl>    <dbl>  <dbl>
1   0.288   0.0456  0.957
2   0.788   0.528   0.453
3 669.      0.892   0.678
4   0.883 629.      0.573
5   0.940   0.457  NA    

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Limey