'Create new variable in from subset of existing columns and if value present create duplicate rows in R
I have a list of columns where three of them relate to year, which identifies the occurrence of that data being collected for that year (1 or -1 means to be collected). I would like to create / manipulate the dataframe to have 'Year' as a new variable, which would replicate the other columns if it is to be collected for multiple years. What is the most efficient way to do this using dplyr / tidyverse?
I am aware of perhaps using pivot_longer but I assume I made need also to use case_when or similar. Any help appreciated.
Programme <- c(1, 2, 2, 3, 4)
Function <- c("Inv", "Inv", "Mac", "Inv", "Inv")
Year2020 <- c(1, 1, 1, -1, 1)
Year2021 <- c(1, 1, *NA*, *NA*, *NA*)
Year2022 <- c(*NA*, *NA*, *NA*, -1, -1)
df <- data.frame(Programme, Function, Year2020, Year2021, Year2022)
So what I am trying to produce is this:
Year <- c(2020, 2021, 2020, 2021, 2020, 2020, 2022, 2020, 2022)
Programme <- c(1, 1, 2, 2, 2, 3, 3, 4, 4)
Function <- c("Inv", "Inv", "Inv", "Inv", "Mac", "Inv", "Inv", "Inv", "Inv")
df <- data.frame(Year, Programme, Function)
Solution 1:[1]
Update see comments Darren Tsai:"Adding names_transform = list(Year = parse_number) into pivot_longer() can save you the mutate line. Besides, in tidyr there is a more flexible function than na.omit() to treat NA, i.e. drop_na()"
library(tidyr)
library(dplyr)
df %>%
pivot_longer(
cols = starts_with("Year"),
names_to = "Year",
names_transform = list(Year = parse_number)
) %>%
drop_na() %>%
dplyr::select(Year, Programme, Function)
Alternative pivot method using names_pattern and regex "(\\D+)(\\d+)":
library(tidyr)
library(dplyr)
df %>%
pivot_longer(
cols = starts_with("Year"),
names_to = c(".value", "Year1"),
names_pattern = "(\\D+)(\\d+)"
) %>%
na.omit() %>%
dplyr::select(Year = Year1, Programme, Function)
OR using parse_number from readr package:
library(tidyverse)
df %>%
pivot_longer(
cols = starts_with("Year"),
names_to = "Year"
) %>%
mutate(Year = parse_number(Year)) %>%
na.omit() %>%
dplyr::select(Year, Programme, Function)
Year Programme Function
<chr> <dbl> <chr>
1 2020 1 Inv
2 2021 1 Inv
3 2020 2 Inv
4 2021 2 Inv
5 2020 2 Mac
6 2020 3 Inv
7 2022 3 Inv
8 2020 4 Inv
9 2022 4 Inv
Solution 2:[2]
Using dplyr, tidyr and stringr packages:
library(tidyverse)
df |>
pivot_longer(3:5,
names_to = "Year") |>
filter(value == 1 | value == -1) |>
mutate(Year = str_remove(Year,"^Year")) |>
select(Year, Programme, Function)
Output:
# A tibble: 9 x 3
Year Programme Function
<chr> <dbl> <chr>
1 2020 1 Inv
2 2021 1 Inv
3 2020 2 Inv
4 2021 2 Inv
5 2020 2 Mac
6 2020 3 Inv
7 2022 3 Inv
8 2020 4 Inv
9 2022 4 Inv
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | MonJeanJean |


