'Can I vectorise this hamming distance calculation between rows of a dataframe?
I have a dataframe which contains data on employees from 2010 to 2017. I want to compute, for each year and employee, the hamming distance (e.g. the number of mismatches) between the data in the row and an arbitrary year, x.
test <- data.frame(
name = c("A", "A", "A", "A", "A", "A", "A", "A"),
year = seq(2010, 2017),
favourite_colour = sample(c("Blue",
"Green",
"Red"), 8, TRUE),
favourite_fruit = sample(c("Apple",
"Banana"), 8, TRUE)
)
E.g. for an employee A, I want to create a new column, distance, which gives the hamming distance between the employee's 'favourite colour' and 'favourite fruit', and those values in 2017.
I understand that I can achieve this by creating a separate dataset, containing only the values for 2017, and then left joining - then I can do a manual, column-by-column comparison - but I have a lot of variables to compare, and it seems like there should be a better way.
EDIT FOR CLARIFICATION:
I mean, for example, that if the rows were:
2010 / Blue / Apple
2011 / Green / Apple
...
2017 / Green / Banana
The distance score for 2010 should be 2, as neither blue or apple match their respective values in 2017.
The distance score for 2011 should be 1, as blue does not match the respective value in 2017, but green does.
Solution 1:[1]
How about this:
library(tidyverse)
set.seed(123)
test <- data.frame(
name = c("A", "A", "A", "A", "A", "A", "A", "A"),
year = seq(2010, 2017),
favourite_colour = sample(c("Blue",
"Green",
"Red"), 8, TRUE),
favourite_fruit = sample(c("Apple",
"Banana"), 8, TRUE)
)
test %>%
pivot_longer(favourite_colour:favourite_fruit, names_to="var", values_to="vals") %>%
group_by(name, var) %>%
mutate(comp = vals[which(year== 2017)]) %>%
ungroup() %>%
group_by(name, year) %>%
summarise(dist = sum(comp != vals))
#> `summarise()` has grouped output by 'name'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 3
#> # Groups: name [1]
#> name year dist
#> <chr> <int> <int>
#> 1 A 2010 1
#> 2 A 2011 1
#> 3 A 2012 2
#> 4 A 2013 1
#> 5 A 2014 2
#> 6 A 2015 0
#> 7 A 2016 1
#> 8 A 2017 0
Created on 2022-05-12 by the reprex package (v2.0.1)
It doesn't give exactly the same result because without a random number seed, I can't generate the same data, but this should work.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | DaveArmstrong |
