'Scraping Oxford5000 words and obtaining two equivalent word lists

I'd like to get two lists of 25 nouns each. The lists have to be balanced in the average length of words and in average semantic distance. It would also be ideal if I can get a balance of categories within each list (i.e., people, places, things, concepts).

So far I have been able to get a list of 25 nouns at CEFR level A1 from the Oxford5000, and obtain semantic distances between each of the selected words (see below).

I need some help in adjusting this so that it gives me two lists of different words, and so that:

  1. The lists have comparable average length of words
  2. The lists have comparable average semantic distance (i.e., the values in lsa.mat)
  3. All semantic distances are greater than 0.7

It would also be great if the words in the two lists had:

  1. The same average ranking of how common they are
  2. A balance of categories (i.e., people, places, things, concepts)

...though I don't know where I could find data on that.

Here's what I have so far:

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(rvest)
library(purrr)
library(magrittr)
library(dplyr)
library(tidyr)
library(LSAfun)

url <- "https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000" 

url %>%
  map(. %>%
    read_html() %>%
      html_nodes(".belong-to , .pos , a") %>%
      html_text()
  ) %>%
  unlist() -> words_typesA1
wordtypes.tbl <- words_typesA1 %>% tibble()
full.first <- which(wordtypes.tbl$. == "a")
full.last <- which(wordtypes.tbl$. == "zone")
full.last <- full.last+2

  # These cases are hidden in the online list and doesn't have associated CEFR level
missing <- which(wordtypes.tbl$. == "accounting")
missing2 <- which(wordtypes.tbl$. == "angrily")
missing3 <- which(wordtypes.tbl$. == "cleaning")
missing4 <- which(wordtypes.tbl$. == "feeding")
missing5 <- which(wordtypes.tbl$. == "major")

wordtypes.tbl <- wordtypes.tbl[-c(missing, missing+1, missing2, missing2+1, missing3, missing3+1, missing4, missing4+1, missing5[2], missing5[2]+1), ]

word.types.full <- wordtypes.tbl[c(full.first:full.last),]

oxford5000 <- word.types.full %>% 
  mutate(ind = rep(c(1:3), length.out = n())) %>% 
  group_by(ind) %>% 
  mutate(id = row_number()) 

oxford5000 <- oxford5000 %>% dplyr::rename("word" = ".")

oxford5000 <- oxford5000 %>% 
 pivot_wider(names_from = ind, values_from = word) %>% 
  select(-id)

oxford5000 <- oxford5000 %>% 
  rename("Word" = "1", "Type" = "2", "CEFR" = "3")

oxford5000$WordLength <- nchar(oxford5000$Word)
a1nouns <- oxford5000 %>% filter(Type == "noun" & CEFR == "a1") %>% select(Word)

a1nouns
cues <- sample(a1nouns$Word, 25, replace = F)
cues.mat <- expand.grid(cues, cues)

cues.mat <- cues.mat[,c(2,1)]

cues.mat <- cues.mat %>% filter(Var2 != Var1)

cues.mat <- cues.mat %>% 
  mutate(case = row_number()) 

cues.mat$case <- as.factor(cues.mat$case)

x <- cues.mat %>% select(!case) %>% as.matrix() %>% list()
cuenames <- c(rep(cues[1], 24), rep(cues[2], 24), rep(cues[3], 24), rep(cues[4], 24), rep(cues[5], 24), rep(cues[6], 24), rep(cues[7], 24), rep(cues[8], 24), rep(cues[9], 24), rep(cues[10], 24), rep(cues[11], 24), rep(cues[12], 24), rep(cues[13], 24), rep(cues[14], 24), rep(cues[15], 24), rep(cues[16], 24), rep(cues[17], 24), rep(cues[18], 24), rep(cues[19], 24), rep(cues[20], 24), rep(cues[21], 24), rep(cues[22], 24), rep(cues[23], 24), rep(cues[24], 24), rep(cues[25], 24))

cues.tib <- cues.mat %>% 
  mutate(name = paste0(Var2, sep = "_", Var1)) 

names <- cues.tib$name

cues.tib <- cues.tib %>% 
  select(Var2, Var1) %>% 
  pivot_longer(everything()) %>% 
  mutate(id = paste0(name, sep=".", 1:n())) %>% 
  select(-name) %>% 
  pivot_wider(values_from = value, names_from = id)
  
# TASA corpus downloaded from https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces#:~:text=TASA

load(file = "../data/TASA.rda")

n <- cues.tib %>% nrow()

mat <- matrix(ncol=600, nrow=n)
k=1
for(i in seq(from=1, to=1199, by=2)){
  inc=i+1
  x = cues.tib[,i:inc]
  for (j in 1:nrow(x)){
word1 = tolower(x[j,1])
word2 = tolower(x[j,2])
output = Cosine(word1, word2, tvectors = TASA)
mat[j,k] <- c(1-output)
  }
  k=k+1
}

# See semantic distances between each of the selected words.

lsa.mat <- mat %>% as.data.frame()

names(lsa.mat) <- names

lsa.mat


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source