'Summarising movements of individuals spread over several rows

I am a newly self-taught user of R and require assistance.

I am working with a dataset that has captured location of residence and whether the locality is metropolitan, regional or rural over 7 years (2015-2021) for a subset of a population. Each individual has a unique ID and each year is on a new row (ie. each ID has 7 rows). I am trying to figure out how many individuals have remained in the same location, how many have moved and where they moved to.

I am really struggling to figure out what I need to do to get the required outputs, but I assume there is a way to get a summary table that has number of individuals who havent moved (+- where they are located) and number of individuals that have moved (+- where they have moved to).

Your assistance would be greatly appreciated.

Dummy dataset:

stack <-  tribble(
    ~ID, ~Year, ~Residence, ~Locality,
    #--/--/--/----
    "a", "2015", "Sydney", "Metro",
    "a", "2016", "Sydney", "Metro",
    "a", "2017", "Sydney", "Metro",
    "a", "2018", "Sydney", "Metro",
    "a", "2019", "Sydney", "Metro",
    "a", "2020", "Sydney", "Metro",
    "a", "2021", "Sydney", "Metro",
    "b", "2015", "Sydney", "Metro",
    "b", "2016", "Orange", "Regional",
    "b", "2017", "Orange", "Regional",
    "b", "2018", "Orange", "Regional",
    "b", "2019", "Orange", "Regional",
    "b", "2020", "Broken Hill", "Rural",
    "b", "2021", "Sydney", "Metro",
    "c", "2015", "Dubbo", "Regional",
    "c", "2016", "Dubbo", "Regional",
    "c", "2017", "Dubbo", "Regional",
    "c", "2018", "Dubbo", "Regional",
    "c", "2019", "Dubbo", "Regional",
    "c", "2020", "Dubbo", "Regional",
    "c", "2021", "Dubbo", "Regional",
  )

Cheers in advance.

r


Solution 1:[1]

You can use the lead function to add columns containing the persons' location in the following year. Using mutate across, you can apply the lead to two columns simultaneously. You can then make a row-wise comparisons and look for moves before summarising.

#Group by individual before applying the lead function
#Apply the lead function to the two listed columns and add "nextyear" as a suffix
#Add a logical column which returns TRUE if any change of residence or locality is detected.
#summarise the date by individual by retaining the location with the max year.

stack%>%
  unite(col="Location", c(Residence, Locality), sep="-")%>%
  group_by(ID)%>%
  mutate(across(c("Year", "Location"), list(nextyear= lead)),
         Move=Location!=Location_nextyear)%>%
  filter(!is.na(Year_nextyear))%>%
  mutate(nb.of.moves=sum(Move, na.rm=TRUE))%>%
  slice_max(Year)%>%
  select(ID, last.location=Location_nextyear, nb.of.moves)

# A tibble: 3 x 3
# Groups:   ID [3]
  ID    last.location  nb.of.moves
  <chr> <chr>                <int>
1 a     Sydney-Metro             0
2 b     Sydney-Metro             3
3 c     Dubbo-Regional           0

Solution 2:[2]

Here is another tidyverse option and using cumsum. We can get the cumulative sum to show how many times each person moves (if they do). Then, we can slice the last row, and get the count of each location. The change column indicates how many times they moved. However, it's unclear what you want the final product to look like.

library(tidyverse)

stack %>% 
  group_by(ID) %>% 
  mutate(
    change = cumsum(case_when(
      paste0(Residence, Locality) != lag(paste0(Residence, Locality)) ~ TRUE,
      TRUE ~ FALSE
    ))
  ) %>% 
  slice(n()) %>% 
  ungroup %>% 
  count(Residence, Locality, change)

Output

  Residence Locality change     n
  <chr>     <chr>     <int> <int>
1 Dubbo     Regional      0     1
2 Sydney    Metro         0     1
3 Sydney    Metro         3     1

Solution 3:[3]

Using data.table.

library(data.table)
setDT(stack)               # convert to data.table
setorder(stack, ID, Year)  # assure rows are in correct order
stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
##    ID lengths             values
## 1:  a       7      Sydney, Metro
## 2:  b       1      Sydney, Metro
## 3:  b       4   Orange, Regional
## 4:  b       1 Broken Hill, Rural
## 5:  b       1      Sydney, Metro
## 6:  c       7    Dubbo, Regional

So a stayed in Sydney for 7 years, b stayed in Sydney for 1 year then moved to Orange for 4 years, then moved to Broken Hill for 1 year, then moved back to Sydney for 1 year.

To determine how many times each person moved:

result <- stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
result[, .(N=.N-1), by=.(ID)]
##    ID N
## 1:  a 0
## 2:  b 3
## 3:  c 0

So a and c did not move at all, and b moved 3 times.

Solution 4:[4]

Similar to what @Dealec did, I used the lag function from dplyr instead.

library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

stack <-  tribble(
  ~ID, ~Year, ~Residence, ~Locality,
  #--/--/--/----
  "a", "2015", "Sydney", "Metro",
  "a", "2016", "Sydney", "Metro",
  "a", "2017", "Sydney", "Metro",
  "a", "2018", "Sydney", "Metro",
  "a", "2019", "Sydney", "Metro",
  "a", "2020", "Sydney", "Metro",
  "a", "2021", "Sydney", "Metro",
  "b", "2015", "Sydney", "Metro",
  "b", "2016", "Orange", "Regional",
  "b", "2017", "Orange", "Regional",
  "b", "2018", "Orange", "Regional",
  "b", "2019", "Orange", "Regional",
  "b", "2020", "Broken Hill", "Rural",
  "b", "2021", "Sydney", "Metro",
  "c", "2015", "Dubbo", "Regional",
  "c", "2016", "Dubbo", "Regional",
  "c", "2017", "Dubbo", "Regional",
  "c", "2018", "Dubbo", "Regional",
  "c", "2019", "Dubbo", "Regional",
  "c", "2020", "Dubbo", "Regional",
  "c", "2021", "Dubbo", "Regional",
) %>% 
  clean_names()

results <- stack %>%
  mutate(location = paste(residence, locality, sep = "_")) %>%
  arrange(id, year) %>%
  group_by(id) %>%
  mutate(
    row = row_number(),
    movement = case_when(
      row == 1 ~ NA_character_,
      location == lag(location, n = 1) ~ "no_movement",
      TRUE ~ location
    )
  ) %>% 
  ungroup() %>% 
  select(-row)

results
#> # A tibble: 21 x 6
#>    id    year  residence locality location        movement       
#>    <chr> <chr> <chr>     <chr>    <chr>           <chr>          
#>  1 a     2015  Sydney    Metro    Sydney_Metro    <NA>           
#>  2 a     2016  Sydney    Metro    Sydney_Metro    no_movement    
#>  3 a     2017  Sydney    Metro    Sydney_Metro    no_movement    
#>  4 a     2018  Sydney    Metro    Sydney_Metro    no_movement    
#>  5 a     2019  Sydney    Metro    Sydney_Metro    no_movement    
#>  6 a     2020  Sydney    Metro    Sydney_Metro    no_movement    
#>  7 a     2021  Sydney    Metro    Sydney_Metro    no_movement    
#>  8 b     2015  Sydney    Metro    Sydney_Metro    <NA>           
#>  9 b     2016  Orange    Regional Orange_Regional Orange_Regional
#> 10 b     2017  Orange    Regional Orange_Regional no_movement    
#> # ... with 11 more rows

results %>% 
  count(year, movement) %>% 
  pivot_wider(names_from = movement,
              values_from = n) %>% 
  clean_names()
#> # A tibble: 7 x 6
#>   year     na no_movement orange_regional broken_hill_rural sydney_metro
#>   <chr> <int>       <int>           <int>             <int>        <int>
#> 1 2015      3          NA              NA                NA           NA
#> 2 2016     NA           2               1                NA           NA
#> 3 2017     NA           3              NA                NA           NA
#> 4 2018     NA           3              NA                NA           NA
#> 5 2019     NA           3              NA                NA           NA
#> 6 2020     NA           2              NA                 1           NA
#> 7 2021     NA           2              NA                NA            1

#tracking movement from a location

from_location <- stack %>%
  mutate(location = paste(residence, locality, sep = "_")) %>%
  arrange(id, year) %>%
  group_by(id) %>%
  mutate(
    row = row_number(),
    movement_from = case_when(
      row == 1 ~ NA_character_,
      location == lag(location, n = 1) ~ "no_movement",
      TRUE ~ lag(location, n = 1)
    )
  ) %>% 
  ungroup() %>% 
  select(-row)


from_location %>% 
  count(year, movement_from) %>% 
  pivot_wider(names_from = movement_from,
              names_prefix = "from_",
              values_from = n) %>% 
  clean_names()
#> # A tibble: 7 x 6
#>   year  from_na from_no_movement from_sydney_metro from_orange_regional
#>   <chr>   <int>            <int>             <int>                <int>
#> 1 2015        3               NA                NA                   NA
#> 2 2016       NA                2                 1                   NA
#> 3 2017       NA                3                NA                   NA
#> 4 2018       NA                3                NA                   NA
#> 5 2019       NA                3                NA                   NA
#> 6 2020       NA                2                NA                    1
#> 7 2021       NA                2                NA                   NA
#> # ... with 1 more variable: from_broken_hill_rural <int>

Created on 2022-04-28 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 AndrewGB
Solution 3 jlhoward
Solution 4