'Best practice for dealing with NLP input in multiple languages for combined text analysis?
As part of a university research project, I scraped job posts for 4 professions in Germany. Because I could not get enough job posts in only 1 language in the time frame I have, I decided to scrape for both English and German posts.
I already went through the whole NLP workflow with both the English and the German text (tokenize, lemmatize, POS, stopwords,...) using different tools due to the language being different.
Now I would need to extract the most common skills required for each profession and differences between them.
I realize that this is a problem I should have predicted, but now I have two corpuses in two different languages which have to be analyzed together.
What do you suggest is the best way to reach a scientifically sound end result with input data in two languages?
So far, no good solution came to my mind:
- translate the German input to English and treat with the rest
- translate the German input after processing word by word
- manually map English and German words
Solution 1:[1]
I work at a company that analyses news agency data in various languages. All our analytics process English texts only. Foreign language input is machine translated — this gives good results.
I would suggest that for job adverts this should also work, as it is a very restricted domain. You're not looking at literature or peotry where it would cause a real problem.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Oliver Mason |
