'Extracting the Latest Available Year Data
I have a fairly large annual dataset in long format and with very large amount of missing values. I am trying to extract the data points for each column for the latest available year.
Input:
| ID | Year | x | y |
|---|---|---|---|
| 1 | 2017 | 1 | NA |
| 1 | 2018 | NA | NA |
| 1 | 2019 | 3 | NA |
| 1 | 2020 | NA | c |
data.frame(ID=c(1,1,1,1),
Year =c(2017, 2018, 2019, 2020),
x=c(1, NA, 3, NA),
y=c(NA, NA, NA, "c")
)
Output:
| ID | x | y |
|---|---|---|
| 1 | 3 | c |
data.frame(ID=1,
x=3,
y="c"
)
Many thanks in advance for your help.
Solution 1:[1]
1) Assuming the rows are sorted by year within ID which is the case in the question's example -- if not sorted then sort it first using arrange(ID, Year) -- remove Year, group by ID, fill in each remaining column and take the last row of the group.
library(dplyr, exclude = c("lag", "filter"))
library(tidyr)
DF %>%
select(-Year) %>%
group_by(ID) %>%
fill %>%
slice_tail %>%
ungroup
giving:
# A tibble: 1 x 3
ID x y
<dbl> <dbl> <chr>
1 1 3 c
2) na.locf0 from zoo would also work and gives the same result.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
DF %>%
select(-Year) %>%
group_by(ID) %>%
mutate(across(.fns = na.locf0)) %>%
slice_tail %>%
ungroup
Solution 2:[2]
You can try:
library(dplyr)
library(tidyr)
dfx %>%
pivot_longer(-c(ID, Year),
values_transform = list(value = as.character)) %>%
na.omit() %>%
group_by(ID, name) %>%
filter(Year == max(Year)) %>%
dplyr::select(-Year) %>%
pivot_wider(values_from = value, names_from = name)
# # A tibble: 1 x 3
# ID x y
# <dbl> <chr> <chr>
# 1 1 3 c
Solution 3:[3]
You may want to address a few things to your question to generate an appropriate response.
Separate the logic of the question into a Reprex. The question is a little unclear in how you want to get your output without selecting them manually.
Show/explain ways that you have attempted the problem as well so people don't waste their time or feel like you haven't tried. From what you have there i'll give it a go to see if anything helps you.
df <- data.frame(ID=c(1,1,1,1),
Year =c(2017, 2018, 2019, 2020),
x=c(1, NA, 3, NA),
y=c(NA, NA, NA, "c")
)
# Remove year like in example?
df <- df %>%
select(., -Year) %>%
filter(, !is.na(y))
# Get values you want?
> df.x <- df %>%
select(x) %>%
na.omit() %>%
as.double()
# Put together
df[2] <- df.x
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | AlexB |
| Solution 3 |
