'How to sed replace UTF-8 characters with HTML entities in another file?

I'm running cygwin under windows 10

Have a dictionary file (1-dictionary.txt) that looks like this:

labelling   labeling
flavour flavor
colour  color
organisations   organizations
végétales   végétales
contr?lée   contrôlée
"   "

The separators between are TABs (\ts).

The dictionary file is encoded as UTF-8.

Want to replace words and symbols in the first column with words and HTML entities in the second column.

My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.

Sample text looks like this:

Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system

I run the following sed one-liner in a shell script (./3-script.sh):

sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt

The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.

However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:

vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)

If i use only the specific symbol (not the full word) I get results like this:

vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e

The ASCII quote symbol is appended with &#x0022; - it is not replaced.

Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.

The expected output would look like this:

v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e

How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?

sed


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source