'Convert a row into a combine, c() as a vector in r and then use vectors to calculate the cosine similarity [duplicate]

Hello I have a very large data frame and it is a partial part:

v1 <- c('i1', 'i10', 'i11')
v2 <- c(0.11, 0.07, 0.114)
v3 <- c(0.07, 0.08, 0.03)
df <- data.frame(cbind(v1, v2, v3))

How can I write some codes to convert each row into a combined vector, x <- c()?

that is, my expected output should be and the variable names need to be from column V1 :

i1 <- c(0.11014318, 0.07302843, 0.01360761, 0.10619829, 0.14513045)
i10 <- c(0.07360007, 0.08013833, 0.13104657, 0.13174247, 0.14256615)
i11 <- c(0.11418245, 0.03300573, 0.11425297, 0.13686428, 0.03367279)

After converting each row into a vector, I need to compute the cosine similarity among these vectors so that's why I need to split each row and save them as vectors with names from the first column V1.

library(lsa)
cosine(i1, i10)
cosine(i1, i11)
cosine(i10, i11)

The following question

Hello SamR. Thanks for your kind help but I do not know why it does not work when adding more columns V4 and V5 and one more row with the ID i12? Thanks so much for your patience and help.

data_matrix <- function(df){
  data_matrix  <- tail(t(df), -1) |>
    sapply(as.numeric) |>
    matrix(
        nrow = ncol(df)-1, 
        ncol = nrow(df), 
        dimnames = list(
            seq_len(nrow(df)-1), # rows
            df[,1] # columns
        )
    ) 
}

v1 <- c('i1', 'i10', 'i11', 'i12')
v2 <- c(0.11, 0.07, 0.114, 0.67)
v3 <- c(0.07, 0.08, 0.03, 087)
v4 <- c(0.12, 0.13, 0.14, 0.18)
v5 <- c(0.19, 0.21, 0.22, 0.22)
df <- data.frame(cbind(v1, v2, v3, v4, v5))
df

data_matrix(df)

It just returns the error:

Error in matrix(sapply(tail(t(df), -1), as.numeric), nrow = ncol(df) -  : 
  length of 'dimnames' [1] not equal to array extent

r dataframe cosine-similarity

Solution 1:^[1]

Another approach would be to use apply over each row, which allows you to set the environment directly:

apply(df, 1, function(x) assign(x[1], tail(x, -1), envir = globalenv()))

However I agree with @danlooo's comment: I can't think of any reason that you would want to do this.

Edit: how to calculate cosine similarity matrix (following comment)

If you want to calculate a cosine similarity matrix it's better to start off with a matrix than to clutter up your global environment, and then have to do a potentially large combination of pairwise calculations.

First get the data into the right format, a numeric matrix with column names which are the first column of your data frame:

data_matrix  <- tail(t(df), -1) |>
    sapply(as.numeric) |>
    matrix(
        nrow = ncol(df) - 1, 
        ncol = nrow(df), 
        dimnames = list(
            seq_len(ncol(df)-1), # rows
            df[,1] # columns
        )
    ) 

data_matrix
#     i1  i10   i11
# 1 0.11 0.07 0.114
# 2 0.07 0.08 0.030

Then it is straightforward to calculate the cosine similarity:


library(lsa)
cosine(data_matrix)

#            i1       i10       i11
# i1  1.0000000 0.9595950 0.9525148
# i10 0.9595950 1.0000000 0.8283488
# i11 0.9525148 0.8283488 1.0000000

Solution 2:^[2]

You can use and split or asplit to split the rows, with setNames to set names of the list elements with your first column, and then use list2env to add elements of the list to the global environment:

l <- setNames(split(df[-1], seq(nrow(df))), df[,1])

# $i1
#     v2   v3
# 1 0.11 0.07
# 
# $i10
#     v2   v3
# 2 0.07 0.08
# 
# $i11
#      v2   v3
# 3 0.114 0.03

list2env(l, .GlobalEnv)

other splitting options include asplit and row:

asplit(df[-1], 1)
split(df[-1], row(df[-1])[, 1])
as.list(as.data.frame(t(df[, -1])))

Solution 3:^[3]

You can go through all rows with lapply() and index your df.

After this, you can use the list2env function from @Maël to save the elements in the list to the global environment.

setNames(lapply(1:nrow(df), function(x) df[x, -1]), df[, 1])

$i1
    v2   v3
1 0.11 0.07

$i10
    v2   v3
2 0.07 0.08

$i11
     v2   v3
3 0.114 0.03

Solution 4:^[4]

Another variation of previous answers:

lapply(seq_len(nrow(df)), \(.) assign(df$v1[.], unlist(df[.,-1]), envir = .GlobalEnv))

That is, for each (lapply) row (seq_len(nrow(df)), \(.)), transform all the columns up to the first into vectors (unlist(df[.,-1])), and then assign those vectors to the first column strings (unlist(df[.,-1])) in the global environment (envir = .GlobalEnv).

And faster, improving also @SamR solution (in which transforming the df to an array, all numeric data become character):

list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv)

But not faster than @Maël solutions

v1 <- paste0("i", 1:1e+3)
lapply(2:200, \(.) assign(paste0("v", .), rnorm(1e+3), envir = .GlobalEnv))
df <- do.call("data.frame", args = sapply(ls(pattern = "^v\\d+$"), get, envir = .GlobalEnv, simplify = FALSE))
microbenchmark::microbenchmark(
    list2env(setNames(as.list(as.data.frame(t(df[, -1]))), df[, 1]), .GlobalEnv), 
    list2env(setNames(asplit(df[-1], 1), df[, 1]), .GlobalEnv), 
    list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv), 
    check = "equal")
Unit: milliseconds
                                                                                          expr      min       lq     mean   median       uq      max neval
             list2env(setNames(as.list(as.data.frame(t(df[, -1]))), df[, 1]), .GlobalEnv) 5.548269 5.731607 9.444446 5.864418 6.114002 37.83762   100
                               list2env(setNames(asplit(df[-1], 1), df[, 1]), .GlobalEnv) 7.421431 7.568999 9.336666 7.639897 7.800458 31.90791   100
 list2env(setNames(apply(df[-1], 1, identity, simplify = FALSE), nm = df$v1), .GlobalEnv) 8.031275 8.201781 9.796997 8.332828 8.512478 34.35403   100

The other solutions by @Maël (using split(df[-1], seq(nrow(df))) and split(df[-1], row(df[-1])[, 1])) and the solution by @benson23 setNames(lapply(1:nrow(df), function(x) df[x, -1]), df[, 1]) produce data.frame outputs instead of vectors.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2
Solution 3	benson23
Solution 4

'Convert a row into a combine, c() as a vector in r and then use vectors to calculate the cosine similarity [duplicate]

Solution 1:[1]

Edit: how to calculate cosine similarity matrix (following comment)

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]