'Best way to handle dialectal speech with Spacy NLP

It seems Spacy NLP is having issues with non-standard verb conjugation and spoken text/slang vocabulary.

For example, I'm working with captions from telenovela series from Colombia. There is a lot of usage of 'vos' and its conjugations which, I assume did not exist in the training data. 'Vos' and verbs conjugated to this form are getting incorrectly tagged as proper nouns.

This is the case with a lot of local slang as well.

I can't train a custom model right now due to lacked of tagged data. I don't have the resources or linguistics expertise to manually annotate things.

I know I can just make a custom rule for 'vos' and set it to be a pronoun instead of a proper noun but as far as I can tell, Spacy at this time is unable to retroactively correct tokenizaton and dependency guesses based on a manual POS correction for a subset of words (in which 'vos' would be included).

What solutions can I use to work around this issue? My application requires an accurate detection of verbs and their conjugation forms.

Example of issue:

Vos es que me quieres matar al niño o qué?

Edit: This seem to have be resolved when using transformer model for Spanish. I used es_news_core_lg at time of posting. What causes this difference?

Thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source