'Should I leave periods in text when training fasttext model?

I have a dataset:

   text
Market regularisation. Researching the marketplace and recommending the most appropriate products
Advising clients on investments, taxes, estate planning. Meeting with clients to establish their needs
...

I want to get embeddings of texts in each row using fasttext. Before doing that, I do some preprocessing (lemmatisation, lower,...) and than I join sentences in each row together with space. However, Im not sure maybe the model will train better, if I leave periods between sentences in each row?



Solution 1:[1]

Questions like this are best answered by trying it both ways, & seeing which one scores better on your evaluations.

The answer might vary based on your data, what you're hoping your model will usefully reflect, & the exact details of your particular punctuation-handling choices – which you haven't described in detail.

Much published word-vector work doesn't bother doing much with punctuation except ensuring it's not left as cruft attached to true word-tokens, & that's the step I'd expect to have the biggest positive effect.

Often, the punctuation is kept as pseudo-words, which receive their own vectors during training. But other times, it's stripped entirely so the training texts are true words only.

I've not noticed a strong consensus practice either way, which is why I wouldn't be surprised if the benefits either way are small, & project-dependent.

Separately: lemmatization is often superfluous, or even a bad idea, if you have plenty of training data. When you have plenty of training data, each variant of related words can get good, and even distinctly-useful, vectors without the added complication of coalescing related words into shared tokens.

Algorithms like word2vec & FastText inherently need lots of training data, so if your data is so thin lemmatization is helpful, you might already be set up for bigger problems. (Getting more data is usually a better goal than more tricky preprocessing, like lemmatization, on meager data.)

And, FastText specifically tries to learn from word substrings, which gives it a leg up on understanding variant-forms of words, and even unseen variants of known words: varying only by a few characters, or typos.

But that subword learning depends on many variants of similarly-written words providing patterns-of-subword-meaning to observe. So lemmatization could be especially problematic for FastText, hiding inflections that FastText could learn from.

I'd suggest leaving lemmatization out by default, then only adding it back if you run a test that shows it helps on your end-results.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1