'How to replace NA values with average of precedent and following values, in R
I currently have a dataset that has more or less the following characteristics:
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
What I would like to do is find a command/write a function that fills only the NAs for each country with:
if NA Observation is for the first year (2010) fills it with the next non-NA Observation;
if NA Observation is for the last year (2014) fills it with the previous available period's Observation.
3.1 if NA Observation is for years between the first and last fills is with the average of the 2 closest periods.
3.2 However, if there are 2 or more consecutive NAs, (let's take 2 as an example) first fill the first with the preceding Observation and the second with the same method as (3.1)
As an illustration, the previous dataset should finally be:
Observation2 <- c(2, 5, 5, 3.5 ,2,3,2, 2,2,3,1,1)
df2 <- data.frame(Country, Year, Observation2)
I hope I was sufficiently clear. It is very specific but I hope someone can help.
Feel free to ask any questions about it if you do not understand.
Solution 1:[1]
Input. There is some question of whether alternation of country names as mentioned in the comments under the question and shown in the Note at the end was intended but at any rate assume that each subsequence of increasing years is a separate group and group by them, grp. (If it was intended that the first 6 entries in Country be Honduras the last 6 be Belize then we could replace the group_by(...) with group_by(Country) in the code below.)
Clarification of Question. We assume that the question is asking that within group:
- Leading NAs are to be replaced with the first non-NA.
- Trailing NAs are to be replaced with the last non-NA.
- If there is one consecutive NA surrounded by non-NAs it is replaced by the prior non-NA.
- If there are two consecutive NA's then the first is replaced with the prior non-NA and the second is filled in with the average of the prior non-NA and next non-NA.
- The question does not address the situation of 3+ consecutive NAs so maybe this never occurs but just in case it does what the code should do is fill in the first NA with the prior non-NA and the remainder should be filled in using linear interpolation.
Code. Now for each group, replace any NA with the prior value. Then use linear interpolation on what is left via na.approx using rule=2 to extend the ends. Finally only keep desired columns.
dplyr clashes. Note that lag and filter in dplyr collide in an incompatible way with the functions of the same name in base R so we exclude them and use dplyr:: prefix if we want to access them.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
df2 <- df %>%
# group_by(Country) %>%
group_by(grp = cumsum(c(TRUE, diff(Year) < 0))) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
identical(df2$Observation2, Observation2)
## [1] TRUE
Note
We used this input taken from the question.
Country <- rep(c("Honduras", "Belize"),6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
df
giving:
Country Year Observation
1 Honduras 2010 2
2 Belize 2011 5
3 Honduras 2012 NA
4 Belize 2014 NA
5 Honduras 2015 2
6 Belize 2016 3
7 Honduras 2010 NA
8 Belize 2011 NA
9 Honduras 2012 2
10 Belize 2014 3
11 Honduras 2015 1
12 Belize 2016 NA
Added
In a comment the poster added another example. We run it here. This is the same code incorporating the simplification to group_by discussed in the first paragraph above. (That does not change the result.)
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5, NA, NA,2,3, NA, NA,2, NA,1,NA)
df <- data.frame(Country, Year, Observation)
df2 <- df %>%
group_by(Country) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
df2
giving:
# A tibble: 12 x 3
Country Year Observation2
<chr> <dbl> <dbl>
1 Honduras 2010 2
2 Honduras 2011 5
3 Honduras 2012 5
4 Honduras 2014 3.5
5 Honduras 2015 2
6 Honduras 2016 3
7 Belize 2010 2
8 Belize 2011 2
9 Belize 2012 2
10 Belize 2014 2
11 Belize 2015 1
12 Belize 2016 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
