'replacing missing values in r

I need help in replacing missing values in the following dummy file. The following rule need to be followed when replacing a missing value.

  1. If the value is the same on both sides of the column where the cell has a missing value, the missing value should be replaced with the value on either side.

  2. If the value is the same on both sides of a column where two adjacent cells have a missing value, the missing value should be replaced with the value on either side.

  3. If the value same on both sides of the column where 3, 4 or more adjacent cells have missing value, the missing value should be replaced with the value on either side of it

  4. If the value in 2007 Colum missing, then it should be replaced with the value of 2008 and 2009 if they are the same

  5. If a value in the 2017 Column is missing, it should be replaced with the values from 2016 and 2015, if they are the same.

  6. If the value is not the same on both sides of the column containing the missing value, the missing value should be replaced with the most frequently occurring value between 2007 and 2017 columns.

  7. If 2007 and 2008 missing, replace both missing value with 2009 if 2009==2010==2011

  8. If 2007, 2008 and 2009 missing, replace all three-missing value with 2010 if 2010==2011==2012

  9. If 2007, 2008, 2009 and 2010 missing, replace all four-missing value with 2011 if 2011==2012==2013

  10. If 2017 and 2016 missing, replace both missing value with 2015 if 2015==2014==2013

  11. If 2017, 2016 and 2015 missing, replace all three-missing value with 2014 if 2014==2013==2012

  12. If 2017, 2016, 2015 and 2014 missing, replace all four-missing value with 2013 if 2013==2012==2011

  13. create new variable of count of unique value during 2007 and 2017 for every case

dummy data is below

dput(gb)
structure(list(ID = 1:20, X2007 = c("a1", "v1", "", "e1", "d1", 
"g1", "t1", "b2w", "p1", "q1", "sd1", "fr4", "fr6", "gt7", "", 
"ju8", "ki9", "lo9", "", "i88"), X2008 = c("a1", "v1", "c1", 
"e1", "d1", "", "t1", "b2w", "", "", "", "", "", "", "", "", 
"", "", "", ""), X2009 = c("a1", "", "c1", "", "", "d1", "t1", 
"", "p1", "", "sd1", "", "fr6", "", "hj7", "ju8", "ki9", "lo9", 
"k99", "i88"), X2010 = c("a1", "", "", "e1", "", "d1", "", "", 
"p1", "", "sd1", "", "fr6", "gt7", "hj7", "", "ki9", "", "k99", 
""), X2011 = c("", "v1", "", "", "", "d1", "", "b2w", "p1", "q1", 
"sd1", "", "fr6", "gt7", "hj7", "", "ki9", "", "k99", ""), X2012 = c("a1", 
"v1", "c1", "e1", "", "", "", "b2w", "p1", "q1", "sd1", "", "fr6", 
"gt7", "hj7", "ju8", "ki9", "lo9", "k99", ""), X2013 = c("b1 ", 
"", "c1", "e1", "d1", "", "t1", "", "p1", "q1", "sd1", "fr4", 
"fr6", "gt7", "hj7", "ju8", "ki9", "lo9", "k99", ""), X2014 = c("", 
"v1", "", "", "d1", "g1", "t1", "", "", "q1", "", "fr4", "", 
"gt7", "", "ju8", "", "lo9", "", "i88"), X2015 = c("b3", "b6", 
"", "", "d1", "g1", "t1", "", "", "q1", "", "fr4", "", "", "", 
"ju8", "", "lo9", "", "i88"), X2016 = c("b4", "b6", "", "", "d1", 
"g1", "t1", "b2w", "", "", "", "fr4", "", "", "", "", "", "lo9", 
"", "i88"), X2017 = c("b5", "b6", "c1", "e1", "d1", "g1", "", 
"", "", "", "", "fr4", "", "", "", "", "", "lo9", "", "i88")), class = "data.frame", row.names = c(NA, 
-20L))


Solution 1:[1]

Here is a possible approach:

  1. Pivot longer
  2. change "" to NA, and trim white space
  3. group by ID
  4. Within each group:
  • use zoo::locf, get the "adjacent" values for any sequence of consecutive missing
  • get the most common value (I create a little helper function for this)
  • apply the rules using case_when()
  1. Pivot back to wide
most_common <- function(v) {
  tv = table(v[!is.na(v)])
  names(tv)[which.max(tv)]
}
gb %>% 
  pivot_longer(-ID,names_prefix = "X",names_transform = as.integer, names_to="year") %>% 
  mutate(value=if_else(value=="", as.character(NA), trimws(value))) %>% 
  group_by(ID) %>% 
  mutate(
    prevv=zoo::na.locf(value, na.rm=F),
    nextv=zoo::na.locf(value, na.rm=F, fromLast=T),
    mostf = most_common(value),
    # Now replace the NA
    nvalue = case_when(
      !is.na(value)~value, 
      prevv==nextv~prevv,
      is.na(value) & year==2007 & lead(value,1) == lead(value,2)~lead(value,1),
      is.na(value) & year==2017 & lag(value,1) == lag(value,2)~lag(value,1),
      TRUE~mostf
    )) %>% 
  pivot_wider(ID, names_from=year,names_prefix = "X", values_from=nvalue)

Output:

      ID X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015 X2016 X2017
   <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1     1 a1    a1    a1    a1    a1    a1    b1    a1    b3    b4    b5   
 2     2 v1    v1    v1    v1    v1    v1    v1    v1    b6    b6    b6   
 3     3 c1    c1    c1    c1    c1    c1    c1    c1    c1    c1    c1   
 4     4 e1    e1    e1    e1    e1    e1    e1    e1    e1    e1    e1   
 5     5 d1    d1    d1    d1    d1    d1    d1    d1    d1    d1    d1   
 6     6 g1    g1    d1    d1    d1    g1    g1    g1    g1    g1    g1   
 7     7 t1    t1    t1    t1    t1    t1    t1    t1    t1    t1    t1   
 8     8 b2w   b2w   b2w   b2w   b2w   b2w   b2w   b2w   b2w   b2w   b2w  
 9     9 p1    p1    p1    p1    p1    p1    p1    p1    p1    p1    p1   
10    10 q1    q1    q1    q1    q1    q1    q1    q1    q1    q1    q1   
11    11 sd1   sd1   sd1   sd1   sd1   sd1   sd1   sd1   sd1   sd1   sd1  
12    12 fr4   fr4   fr4   fr4   fr4   fr4   fr4   fr4   fr4   fr4   fr4  
13    13 fr6   fr6   fr6   fr6   fr6   fr6   fr6   fr6   fr6   fr6   fr6  
14    14 gt7   gt7   gt7   gt7   gt7   gt7   gt7   gt7   gt7   gt7   gt7  
15    15 hj7   hj7   hj7   hj7   hj7   hj7   hj7   hj7   hj7   hj7   hj7  
16    16 ju8   ju8   ju8   ju8   ju8   ju8   ju8   ju8   ju8   ju8   ju8  
17    17 ki9   ki9   ki9   ki9   ki9   ki9   ki9   ki9   ki9   ki9   ki9  
18    18 lo9   lo9   lo9   lo9   lo9   lo9   lo9   lo9   lo9   lo9   lo9  
19    19 k99   k99   k99   k99   k99   k99   k99   k99   k99   k99   k99  
20    20 i88   i88   i88   i88   i88   i88   i88   i88   i88   i88   i88  

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 langtang