'Convert a row into a combine, c() as a vector in r and then use vectors to calculate the cosine similarity [duplicate]
Hello I have a very large data frame and it is a partial part:
v1 <- c('i1', 'i10', 'i11')
v2 <- c(0.11, 0.07, 0.114)
v3 <- c(0.07, 0.08, 0.03)
df <- data.frame(cbind(v1, v2, v3))
How can I write some codes to convert each row into a combined vector, x <- c()?
that is, my expected output should be and the variable names need to be from column V1 :
i1 <- c(0.11014318, 0.07302843, 0.01360761, 0.10619829, 0.14513045)
i10 <- c(0.07360007, 0.08013833, 0.13104657, 0.13174247, 0.14256615)
i11 <- c(0.11418245, 0.03300573, 0.11425297, 0.13686428, 0.03367279)
After converting each row into a vector, I need to compute the cosine similarity among these vectors so that's why I need to split each row and save them as vectors with names from the first column V1.
library(lsa)
cosine(i1, i10)
cosine(i1, i11)
cosine(i10, i11)
The following question
Hello SamR. Thanks for your kind help but I do not know why it does not work when adding more columns V4 and V5 and one more row with the ID i12? Thanks so much for your patience and help.
data_matrix <- function(df){
data_matrix <- tail(t(df), -1) |>
sapply(as.numeric) |>
matrix(
nrow = ncol(df)-1,
ncol = nrow(df),
dimnames = list(
seq_len(nrow(df)-1), # rows
df[,1] # columns
)
)
}
v1 <- c('i1', 'i10', 'i11', 'i12')
v2 <- c(0.11, 0.07, 0.114, 0.67)
v3 <- c(0.07, 0.08, 0.03, 087)
v4 <- c(0.12, 0.13, 0.14, 0.18)
v5 <- c(0.19, 0.21, 0.22, 0.22)
df <- data.frame(cbind(v1, v2, v3, v4, v5))
df
data_matrix(df)
It just returns the error:
Error in matrix(sapply(tail(t(df), -1), as.numeric), nrow = ncol(df) - :
length of 'dimnames' [1] not equal to array extent
Solution 1:[1]
Another approach would be to use apply over each row, which allows you to set the environment directly:
apply(df, 1, function(x) assign(x[1], tail(x, -1), envir = globalenv()))
However I agree with @danlooo's comment: I can't think of any reason that you would want to do this.
Edit: how to calculate cosine similarity matrix (following comment)
If you want to calculate a cosine similarity matrix it's better to start off with a matrix than to clutter up your global environment, and then have to do a potentially large combination of pairwise calculations.
First get the data into the right format, a numeric matrix with column names which are the first column of your data frame:
data_matrix <- tail(t(df), -1) |>
sapply(as.numeric) |>
matrix(
nrow = ncol(df) - 1,
ncol = nrow(df),
dimnames = list(
seq_len(ncol(df)-1), # rows
df[,1] # columns
)
)
data_matrix
# i1 i10 i11
# 1 0.11 0.07 0.114
# 2 0.07 0.08 0.030
Then it is straightforward to calculate the cosine similarity:
library(lsa)
cosine(data_matrix)
# i1 i10 i11
# i1 1.0000000 0.9595950 0.9525148
# i10 0.9595950 1.0000000 0.8283488
# i11 0.9525148 0.8283488 1.0000000
Solution 2:[2]
You can use and split or asplit to split the rows, with setNames to set names of the list elements with your first column, and then use list2env to add elements of the list to the global environment:
l <- setNames(split(df[-1], seq(nrow(df))), df[,1])
# $i1
# v2 v3
# 1 0.11 0.07
#
# $i10
# v2 v3
# 2 0.07 0.08
#
# $i11
# v2 v3
# 3 0.114 0.03
list2env(l, .GlobalEnv)
other splitting options include asplit and row:
asplit(df[-1], 1)
split(df[-1], row(df[-1])[, 1])
as.list(as.data.frame(t(df[, -1])))
Solution 3:[3]
You can go through all rows with lapply() and index your df.
After this, you can use the list2env function from @Maël to save the elements in the list to the global environment.
setNames(lapply(1:nrow(df), function(x) df[x, -1]), df[, 1])
$i1
v2 v3
1 0.11 0.07
$i10
v2 v3
2 0.07 0.08
$i11
v2 v3
3 0.114 0.03
Solution 4:[4]
Another variation of previous answers:
lapply(seq_len(nrow(df)), \(.) assign(df$v1[.], unlist(df[.,-1]), envir = .GlobalEnv))
That is, for each (lapply) row (seq_len(nrow(df)), \(.)), transform all the columns up to the first into vectors (unlist(df[.,-1])), and then assign those vectors to the first column strings (unlist(df[.,-1])) in the global environment (envir = .GlobalEnv).
And faster, improving also @SamR solution (in which transforming the df to an array, all numeric data become character):
list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv)
But not faster than @Maël solutions
v1 <- paste0("i", 1:1e+3)
lapply(2:200, \(.) assign(paste0("v", .), rnorm(1e+3), envir = .GlobalEnv))
df <- do.call("data.frame", args = sapply(ls(pattern = "^v\\d+$"), get, envir = .GlobalEnv, simplify = FALSE))
microbenchmark::microbenchmark(
list2env(setNames(as.list(as.data.frame(t(df[, -1]))), df[, 1]), .GlobalEnv),
list2env(setNames(asplit(df[-1], 1), df[, 1]), .GlobalEnv),
list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv),
check = "equal")
Unit: milliseconds
expr min lq mean median uq max neval
list2env(setNames(as.list(as.data.frame(t(df[, -1]))), df[, 1]), .GlobalEnv) 5.548269 5.731607 9.444446 5.864418 6.114002 37.83762 100
list2env(setNames(asplit(df[-1], 1), df[, 1]), .GlobalEnv) 7.421431 7.568999 9.336666 7.639897 7.800458 31.90791 100
list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv) 8.031275 8.201781 9.796997 8.332828 8.512478 34.35403 100
The other solutions by @Maël (using split(df[-1], seq(nrow(df))) and split(df[-1], row(df[-1])[, 1])) and the solution by @benson23 setNames(lapply(1:nrow(df), function(x) df[x, -1]), df[, 1]) produce data.frame outputs instead of vectors.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | benson23 |
| Solution 4 |

