'Best practice for dealing with NLP input in multiple languages for combined text analysis?

As part of a university research project, I scraped job posts for 4 professions in Germany. Because I could not get enough job posts in only 1 language in the time frame I have, I decided to scrape for both English and German posts.

I already went through the whole NLP workflow with both the English and the German text (tokenize, lemmatize, POS, stopwords,...) using different tools due to the language being different.

Now I would need to extract the most common skills required for each profession and differences between them.

I realize that this is a problem I should have predicted, but now I have two corpuses in two different languages which have to be analyzed together.

What do you suggest is the best way to reach a scientifically sound end result with input data in two languages?

So far, no good solution came to my mind:

  • translate the German input to English and treat with the rest
  • translate the German input after processing word by word
  • manually map English and German words


Solution 1:[1]

I work at a company that analyses news agency data in various languages. All our analytics process English texts only. Foreign language input is machine translated — this gives good results.

I would suggest that for job adverts this should also work, as it is a very restricted domain. You're not looking at literature or peotry where it would cause a real problem.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Oliver Mason