'Translating XLIFF files using BeautifulSoup

I am translating Xliff file using BeautifulSoup and googletrans packages. I managed to extract all strings and translate them and managed to replace strings by creating new tag with a translations, e.g.

<trans-unit id="100890::53706_004">
<source>Continue in store</source>
<target>Kontynuuj w sklepie</target>
</trans-unit>

The problem appears when the source tag has other tags inside.

e.g.

<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>

There are different numbers of these tags and different order of where string appears. E.g. <source> text1 <x /> <x/> text2 <x/> text3 </source>. Each x tag is unique with different id and attributes.

Is there a way to modify the text inside the tag without having to create a new tag? I was thinking I could extract x tags and its attributes but the order or string and x tag in different code lines differs a lot I'm not sure how to do that. Maybe there is other package better suited for translating xliff files?



Solution 1:[1]

You can use for-loop to work with all children in source.
And you can duplicate them with copy.copy(child) and append to target.
At the same time you can check if child is NavigableString and convert it.


text = '''<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>'''

conversions = {
    'Choose your product': 'Wybierz swój produkt',
    'From a list: ': 'Z listy: ',
}

from bs4 import BeautifulSoup as BS
from bs4.element import NavigableString
import copy

#soup = BS(text, 'html.parser')  # it has problem to parse it
#soup = BS(text, 'html5lib')     # it has problem to parse it
soup = BS(text, 'lxml')

# create `<target>`
target = soup.new_tag('target')

# add `<target>` after `<source>
source = soup.find('source')
source.insert_after('', target)

# work with children in `<source>`
for child in source:
    print('type:', type(child))

    # duplicate child and add to `<target>`
    child = copy.copy(child)
    target.append(child)

    # convert text and replace in child in `<target>`        
    if isinstance(child, NavigableString):
        new_text = conversions[child.string]
        child.string.replace_with(new_text)

print('--- target ---')
print(target)
print('--- source ---')
print(source)
print('--- soup ---')
print(soup)

Result (little reformated to make it more readable):

type: <class 'bs4.element.Tag'>
type: <class 'bs4.element.NavigableString'>
type: <class 'bs4.element.Tag'>
type: <class 'bs4.element.NavigableString'>

--- target ---

<target>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Wybierz swój produkt
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  Z listy: 
</target>

--- source ---

<source>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Choose your product
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  From a list: 
</source>

--- soup ---

<html><body>
<source>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Choose your product
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  From a list: 
</source>
<target>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Wybierz swój produkt
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  Z listy: 
</target>
</body></html>

Solution 2:[2]

To extract the two text entries from within <source>, you could use the following approach:

from bs4 import BeautifulSoup
import requests

html = """<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>"""

soup = BeautifulSoup(html, 'lxml')
print(list(soup.source.stripped_strings))

Giving you:

['Choose your product', 'From a list:']

Solution 3:[3]

I would recommend not to parse XLIFF files with a generic XML parser. Instead, try to find a specialized XLIFF toolkit. There are a few python projects around, but I don't have experience with them (me: Java guy mostly).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Martin Evans
Solution 3