'extract attributes from span

I have a lxml file and I need content from there. The file structure looks like this:

<span class="ocr_line" id="line_1_1" title="bbox 394 185 1993 247">
      <span class="ocrx_word" id="word_1_1" title="bbox 394 191 535 242; x_entity company_name 0 ; baseline 394 242.21 535 242.21; x_height 208.14; x_style sansSerif bold none">1908</span>
        
  

I want to extract all but just <span class ="ocrx_word" I got this line already with:

with open("/home/neichfel/Documents/test.xml", "r") as file:
    # Read each line in the file, readlines() returns a list of lines
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")
    ocrx_words = bs_content.findAll("span", {"class": "ocrx_word"})
print(ocrx_words)

Now I'm struggling since days with the rest. I need from this (ocrx_words) list the element "title" with the content of X_entity and the text inside from span. Sometimes x_entity is empty and sometimes there is something inside. The text from span I found already with

lines_structure = []

for line in ocrx_words:
    line_text = line.text.replace("\n", " ").strip()
    lines_structure.append(line_text)

print(lines_structure)

But what I wanna have in the end is a list with

x_entity | text afterwards I convert it into a df, but this I already know how to do. Its just extracting this x_entity :(

Sorry, for maybe mess information I'm new in programming but maybe you can help me out! Thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source