Category "nlp"

Tokenization of Compound Words not Working in Quanteda

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempti

How are the TokenEmbeddings in BERT created?

In the paper describing BERT, there is this paragraph about WordPiece Embeddings. We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocab

How do I know the order of the classes in a CatBoost classifier weights?

This is a pretty dumb question, but I couldn't find anywhere, so I will take my chances in here... I'm building a classifier using CatBoost. Since this is a NLP

TypeError: "hypothesis" expects pre-tokenized hypothesis (Iterable[str]):

I am trying to calculate the Meteor score for the following: print (nltk.translate.meteor_score.meteor_score( ["this is an apple", "that is an apple"], "an

NLP textEmbed function

I am trying to run the textEmbed function in R. Set up needed: require(quanteda) require(quanteda.textstats) require(udpipe) require(reticulate) #udpi

How to Vectorize python function

I have made a resume parser but to parse my resumes, I am using a for loop to run my parse function over each resume. Is there a way to vectorize this approach?

How to store Bag of Words or Embeddings in a Database

I would like to store vector features, like Bag-of-Words or Word-Embedding vectors of a large number of texts, in a dataset, stored in a SQL Database. What're t

R: Correct Way to Calculate Cosine Similarity?

I am working with the R programming language. I have the following data: text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their

Error 'power iteration failed to converge within 100 iterations') when I tried to summarize a text document using python networkx

I got an PowerIterationFailedConvergence:(PowerIterationFailedConvergence(...), 'power iteration failed to converge within 100 iterations') when I tried to summ

Continual pre-training vs. Fine-tuning a language model with MLM

I have some custom data I want to use to further pre-train the BERT model. I’ve tried the two following approaches so far: Starting with a pre-trained BER

How to get up and running with spaCy for Vietnamese?

I success with English python -m spacy download en_core_web_lg python -m spacy download en_core_web_sm python -m spacy download en I read https://spacy.io/mod

Definition of downstream tasks in NLP

What does downstream tasks terminology mean in NLP? I saw this terminology used in several articles but I can't understand the idea behind it.

How to fix LDA model coherence score runtime Error?

text='Alice is a student.She likes studying.Teachers are giving a lot of homewok.' I am trying to get topics from a simple text(like above) with coherance scor

Follow-up question regarding a Keras model issue

So about a week ago I posted this question: Issues running a Keras model with custom layers. The suggestion there was to try to make this question smaller and t

Extracting names from a text file using Spacy

I have a text file which contains lines as shown below: Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST The patient was referred by Dr. J

How do I remove nonsensical or incomplete words from a corpus?

I am using some text for some NLP analyses. I have cleaned the text taking steps to remove non-alphanumeric characters, blanks, duplicate words and stopwords, a

Spacy train ner using multiprocessing

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least

Tokenizing an HTML document

I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token. Here's my code: import spacy from spacy.symbols impo

Embedding 3D data in Pytorch

I want to implement character-level embedding. This is usual word embedding. Word Embedding Input: [ [‘who’, ‘is’, ‘this&rsquo

Tensorflow-addons seq2seq - start and end tokens in BaseDecoder or BasicDecoder

I am writing code inspired from https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/BasicDecoder. In the translation/generation we instantiate a Basic