'Summarising movements of individuals spread over several rows
I am a newly self-taught user of R and require assistance.
I am working with a dataset that has captured location of residence and whether the locality is metropolitan, regional or rural over 7 years (2015-2021) for a subset of a population. Each individual has a unique ID and each year is on a new row (ie. each ID has 7 rows). I am trying to figure out how many individuals have remained in the same location, how many have moved and where they moved to.
I am really struggling to figure out what I need to do to get the required outputs, but I assume there is a way to get a summary table that has number of individuals who havent moved (+- where they are located) and number of individuals that have moved (+- where they have moved to).
Your assistance would be greatly appreciated.
Dummy dataset:
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
)
Cheers in advance.
Solution 1:[1]
You can use the lead function to add columns containing the persons' location in the following year. Using mutate across, you can apply the lead to two columns simultaneously. You can then make a row-wise comparisons and look for moves before summarising.
#Group by individual before applying the lead function
#Apply the lead function to the two listed columns and add "nextyear" as a suffix
#Add a logical column which returns TRUE if any change of residence or locality is detected.
#summarise the date by individual by retaining the location with the max year.
stack%>%
unite(col="Location", c(Residence, Locality), sep="-")%>%
group_by(ID)%>%
mutate(across(c("Year", "Location"), list(nextyear= lead)),
Move=Location!=Location_nextyear)%>%
filter(!is.na(Year_nextyear))%>%
mutate(nb.of.moves=sum(Move, na.rm=TRUE))%>%
slice_max(Year)%>%
select(ID, last.location=Location_nextyear, nb.of.moves)
# A tibble: 3 x 3
# Groups: ID [3]
ID last.location nb.of.moves
<chr> <chr> <int>
1 a Sydney-Metro 0
2 b Sydney-Metro 3
3 c Dubbo-Regional 0
Solution 2:[2]
Here is another tidyverse option and using cumsum. We can get the cumulative sum to show how many times each person moves (if they do). Then, we can slice the last row, and get the count of each location. The change column indicates how many times they moved. However, it's unclear what you want the final product to look like.
library(tidyverse)
stack %>%
group_by(ID) %>%
mutate(
change = cumsum(case_when(
paste0(Residence, Locality) != lag(paste0(Residence, Locality)) ~ TRUE,
TRUE ~ FALSE
))
) %>%
slice(n()) %>%
ungroup %>%
count(Residence, Locality, change)
Output
Residence Locality change n
<chr> <chr> <int> <int>
1 Dubbo Regional 0 1
2 Sydney Metro 0 1
3 Sydney Metro 3 1
Solution 3:[3]
Using data.table.
library(data.table)
setDT(stack) # convert to data.table
setorder(stack, ID, Year) # assure rows are in correct order
stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
## ID lengths values
## 1: a 7 Sydney, Metro
## 2: b 1 Sydney, Metro
## 3: b 4 Orange, Regional
## 4: b 1 Broken Hill, Rural
## 5: b 1 Sydney, Metro
## 6: c 7 Dubbo, Regional
So a stayed in Sydney for 7 years, b stayed in Sydney for 1 year then moved to Orange for 4 years, then moved to Broken Hill for 1 year, then moved back to Sydney for 1 year.
To determine how many times each person moved:
result <- stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
result[, .(N=.N-1), by=.(ID)]
## ID N
## 1: a 0
## 2: b 3
## 3: c 0
So a and c did not move at all, and b moved 3 times.
Solution 4:[4]
Similar to what @Dealec did, I used the lag function from dplyr instead.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
) %>%
clean_names()
results <- stack %>%
mutate(location = paste(residence, locality, sep = "_")) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(
row = row_number(),
movement = case_when(
row == 1 ~ NA_character_,
location == lag(location, n = 1) ~ "no_movement",
TRUE ~ location
)
) %>%
ungroup() %>%
select(-row)
results
#> # A tibble: 21 x 6
#> id year residence locality location movement
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 2015 Sydney Metro Sydney_Metro <NA>
#> 2 a 2016 Sydney Metro Sydney_Metro no_movement
#> 3 a 2017 Sydney Metro Sydney_Metro no_movement
#> 4 a 2018 Sydney Metro Sydney_Metro no_movement
#> 5 a 2019 Sydney Metro Sydney_Metro no_movement
#> 6 a 2020 Sydney Metro Sydney_Metro no_movement
#> 7 a 2021 Sydney Metro Sydney_Metro no_movement
#> 8 b 2015 Sydney Metro Sydney_Metro <NA>
#> 9 b 2016 Orange Regional Orange_Regional Orange_Regional
#> 10 b 2017 Orange Regional Orange_Regional no_movement
#> # ... with 11 more rows
results %>%
count(year, movement) %>%
pivot_wider(names_from = movement,
values_from = n) %>%
clean_names()
#> # A tibble: 7 x 6
#> year na no_movement orange_regional broken_hill_rural sydney_metro
#> <chr> <int> <int> <int> <int> <int>
#> 1 2015 3 NA NA NA NA
#> 2 2016 NA 2 1 NA NA
#> 3 2017 NA 3 NA NA NA
#> 4 2018 NA 3 NA NA NA
#> 5 2019 NA 3 NA NA NA
#> 6 2020 NA 2 NA 1 NA
#> 7 2021 NA 2 NA NA 1
#tracking movement from a location
from_location <- stack %>%
mutate(location = paste(residence, locality, sep = "_")) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(
row = row_number(),
movement_from = case_when(
row == 1 ~ NA_character_,
location == lag(location, n = 1) ~ "no_movement",
TRUE ~ lag(location, n = 1)
)
) %>%
ungroup() %>%
select(-row)
from_location %>%
count(year, movement_from) %>%
pivot_wider(names_from = movement_from,
names_prefix = "from_",
values_from = n) %>%
clean_names()
#> # A tibble: 7 x 6
#> year from_na from_no_movement from_sydney_metro from_orange_regional
#> <chr> <int> <int> <int> <int>
#> 1 2015 3 NA NA NA
#> 2 2016 NA 2 1 NA
#> 3 2017 NA 3 NA NA
#> 4 2018 NA 3 NA NA
#> 5 2019 NA 3 NA NA
#> 6 2020 NA 2 NA 1
#> 7 2021 NA 2 NA NA
#> # ... with 1 more variable: from_broken_hill_rural <int>
Created on 2022-04-28 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | AndrewGB |
| Solution 3 | jlhoward |
| Solution 4 |
