'Create new column based on presence/absence of string in other column by group
I have this dataset about vessels locations, where the same "id" can correspond to two levels. Corresponds to a defined category, such as "fishing" and may also appear as "unspecified". I would like to create a new column, where whenever an "id" appears as "unspecified" and also as another category (in other rows), "unspecified" is replaced by that category.
#dataset example
library(dplyr)
levels <- c("passenger", "passenger", "unspecified", "passenger", "passenger",
"passenger", "passenger", "passenger", "passenger", "passenger",
"unspecified", "passenger", "fishing", "unspecified", "fishing",
"fishing", "fishing","unspecified", "fishing", "fishing",
"unspecified","fishing", "fishing", "fishing", "unspecified",
"unspecified", "unspecified")
id <- c("844", "844", "844", "844", "844","844", "844", "844", "844", "844",
"844", "844", "845", "845", "845", "845", "845","845", "845", "845",
"845","845", "845", "845", "825", "825", "825")
lat <- c(-30.6456, -29.5648, -27.6667, -31.5587, -30.6934, -29.3147, -23.0538,
-26.5877, -26.6923, -23.40865, -23.1143, -23.28331, -31.6456, -24.5648,
-27.6867, -31.4587, -30.6784, -28.3447, -23.0466, -27.5877, -26.8524,
-23.8855, -24.1143, -23.5874, -23.5259, -22.8788, -22.1324)
long <- c(-50.4879, -49.8715, -51.8716, -50.4456, -50.9842, -51.9787, -41.2343,
-40.2859, -40.19599, -41.64302, -41.58042, -41.55057, -50.4576, -48.8715,
-51.4566, -51.4456, -50.4477, -50.9937, -41.4789, -41.3859, -40.2536,
-41.6502, -40.5442, -41.4057, -40.4058, -42.4877, -41.4545)
df <- tibble(levels = as.factor(levels), id = as.factor(id), lat, long)
Here is my expected output:
> output %>% print(n = 27)
# A tibble: 27 x 5
levels id lat long new_colum
<fct> <fct> <dbl> <dbl> <fct>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified
It's for when the same id is unspecified but it's also belongs some other level. When it's just unspecified it stays that way, when it also belongs to other level replaces unspecified with this one.
Solution 1:[1]
You can replace "unspecified" with NA, fill these NA with the previous value in the same id group, and then replace the remaining NA back to "unspecified".
library(tidyverse)
df %>%
mutate(new_colum = na_if(levels, "unspecified")) %>%
group_by(id) %>%
fill(new_colum) %>%
ungroup() %>%
replace_na(list(new_colum = "unspecified"))
# A tibble: 27 × 5
levels id lat long new_colum
<fct> <fct> <dbl> <dbl> <fct>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified
Solution 2:[2]
You can use an ifelse statement to specify condition where there is only one type of levels AND at the same time that levels equals to "unspecified", do not change these records. Otherwise, change it to another levels that is not "unspecified".
library(dplyr)
df %>%
group_by(id) %>%
mutate(new_column = ifelse(n_distinct(levels) == 1 & levels == "unspecified",
as.character(levels),
as.character(levels)[levels != "unspecified"]))
# A tibble: 27 × 5
# Groups: id [3]
levels id lat long new_column
<fct> <fct> <dbl> <dbl> <chr>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified
Solution 3:[3]
I believe this is the simplest/most intuitive solution:
The case_when statement checks each group - if all levels are "unspecified", then new_column becomes "unspecified", otherwise if it contains either "fishing" or "passenger", it becomes one of those.
library(dplyr)
df %>%
group_by(id) %>%
mutate(new_column = case_when(
all(levels == "unspecified") ~ "unspecified",
any(levels == "fishing") ~ "fishing",
any(levels == "passenger") ~ "passenger"
)
)
# A tibble: 27 × 5
# Groups: id [3]
levels id lat long new_column
<fct> <fct> <dbl> <dbl> <chr>
1 passenger 844 -30.6 -50.5 passenger
2 passenger 844 -29.6 -49.9 passenger
3 unspecified 844 -27.7 -51.9 passenger
4 passenger 844 -31.6 -50.4 passenger
5 passenger 844 -30.7 -51.0 passenger
6 passenger 844 -29.3 -52.0 passenger
7 passenger 844 -23.1 -41.2 passenger
8 passenger 844 -26.6 -40.3 passenger
9 passenger 844 -26.7 -40.2 passenger
10 passenger 844 -23.4 -41.6 passenger
11 unspecified 844 -23.1 -41.6 passenger
12 passenger 844 -23.3 -41.6 passenger
13 fishing 845 -31.6 -50.5 fishing
14 unspecified 845 -24.6 -48.9 fishing
15 fishing 845 -27.7 -51.5 fishing
16 fishing 845 -31.5 -51.4 fishing
17 fishing 845 -30.7 -50.4 fishing
18 unspecified 845 -28.3 -51.0 fishing
19 fishing 845 -23.0 -41.5 fishing
20 fishing 845 -27.6 -41.4 fishing
21 unspecified 845 -26.9 -40.3 fishing
22 fishing 845 -23.9 -41.7 fishing
23 fishing 845 -24.1 -40.5 fishing
24 fishing 845 -23.6 -41.4 fishing
25 unspecified 825 -23.5 -40.4 unspecified
26 unspecified 825 -22.9 -42.5 unspecified
27 unspecified 825 -22.1 -41.5 unspecified
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Darren Tsai |
| Solution 2 | |
| Solution 3 |
