'How to replace NA values with average of precedent and following values, in R

I currently have a dataset that has more or less the following characteristics:

Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)

What I would like to do is find a command/write a function that fills only the NAs for each country with:

  1. if NA Observation is for the first year (2010) fills it with the next non-NA Observation;

  2. if NA Observation is for the last year (2014) fills it with the previous available period's Observation.

3.1 if NA Observation is for years between the first and last fills is with the average of the 2 closest periods.

3.2 However, if there are 2 or more consecutive NAs, (let's take 2 as an example) first fill the first with the preceding Observation and the second with the same method as (3.1)

As an illustration, the previous dataset should finally be:

Observation2 <- c(2, 5, 5, 3.5 ,2,3,2, 2,2,3,1,1)
df2 <- data.frame(Country, Year, Observation2)

I hope I was sufficiently clear. It is very specific but I hope someone can help.

Feel free to ask any questions about it if you do not understand.



Solution 1:[1]

Input. There is some question of whether alternation of country names as mentioned in the comments under the question and shown in the Note at the end was intended but at any rate assume that each subsequence of increasing years is a separate group and group by them, grp. (If it was intended that the first 6 entries in Country be Honduras the last 6 be Belize then we could replace the group_by(...) with group_by(Country) in the code below.)

Clarification of Question. We assume that the question is asking that within group:

  1. Leading NAs are to be replaced with the first non-NA.
  2. Trailing NAs are to be replaced with the last non-NA.
  3. If there is one consecutive NA surrounded by non-NAs it is replaced by the prior non-NA.
  4. If there are two consecutive NA's then the first is replaced with the prior non-NA and the second is filled in with the average of the prior non-NA and next non-NA.
  5. The question does not address the situation of 3+ consecutive NAs so maybe this never occurs but just in case it does what the code should do is fill in the first NA with the prior non-NA and the remainder should be filled in using linear interpolation.

Code. Now for each group, replace any NA with the prior value. Then use linear interpolation on what is left via na.approx using rule=2 to extend the ends. Finally only keep desired columns.

dplyr clashes. Note that lag and filter in dplyr collide in an incompatible way with the functions of the same name in base R so we exclude them and use dplyr:: prefix if we want to access them.

library(dplyr, exclude = c("lag", "filter"))
library(zoo)

df2 <- df %>%
  # group_by(Country) %>%
  group_by(grp = cumsum(c(TRUE, diff(Year) < 0))) %>%
  mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
                          na.approx(rule = 2)) %>%
  ungroup %>%
  select(Country, Year, Observation2)

identical(df2$Observation2, Observation2)
## [1] TRUE

Note

We used this input taken from the question.

Country <- rep(c("Honduras", "Belize"),6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)

df

giving:

    Country Year Observation
1  Honduras 2010           2
2    Belize 2011           5
3  Honduras 2012          NA
4    Belize 2014          NA
5  Honduras 2015           2
6    Belize 2016           3
7  Honduras 2010          NA
8    Belize 2011          NA
9  Honduras 2012           2
10   Belize 2014           3
11 Honduras 2015           1
12   Belize 2016          NA

Added

In a comment the poster added another example. We run it here. This is the same code incorporating the simplification to group_by discussed in the first paragraph above. (That does not change the result.)

Country <- rep(c("Honduras", "Belize"),each=6) 
Year <- rep(c(2010,2011,2012,2014,2015,2016),2) 
Observation <- c(2, 5, NA, NA,2,3, NA, NA,2, NA,1,NA)
df <- data.frame(Country, Year, Observation)

df2 <- df %>%
  group_by(Country) %>%
  mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
                          na.approx(rule = 2)) %>%
  ungroup %>%
  select(Country, Year, Observation2)

df2

giving:

# A tibble: 12 x 3
   Country   Year Observation2
   <chr>    <dbl>        <dbl>
 1 Honduras  2010          2  
 2 Honduras  2011          5  
 3 Honduras  2012          5  
 4 Honduras  2014          3.5
 5 Honduras  2015          2  
 6 Honduras  2016          3  
 7 Belize    2010          2  
 8 Belize    2011          2  
 9 Belize    2012          2  
10 Belize    2014          2  
11 Belize    2015          1  
12 Belize    2016          1  

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1