'Parsing HTML into sentences - how to handle tables/lists/headings/etc?

How do you go about parsing an HTML page with free text, lists, tables, headings, etc., into sentences?

Take this wikipedia page for example. There is/are:

After messing around with the python NLTK, I want to test out all of these different corpus annotation methods (from http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to-include):

  • Word Tokenization: The orthographic form of text does not unambiguously identify its tokens. A tokenized and normalized version, in addition to the conventional orthographic version, may be a very convenient resource.
  • Sentence Segmentation: As we saw in Chapter 3, sentence segmentation can be more difficult than it seems. Some corpora therefore use explicit annotations to mark sentence segmentation.
  • Paragraph Segmentation: Paragraphs and other structural elements (headings, chapters, etc.) may be explicitly annotated.
  • Part of Speech: The syntactic category of each word in a document.
  • Syntactic Structure: A tree structure showing the constituent structure of a sentence.
  • Shallow Semantics: Named entity and coreference annotations, semantic role labels.
  • Dialogue and Discourse: dialogue act tags, rhetorical structure

Once you break a document into sentences it seems pretty straightforward. But how do you go about breaking down something like the HTML from that Wikipedia page? I am very familiar with using HTML/XML parsers and traversing the tree, and I have tried just stripping the HTML tags to get the plain text, but because punctuation is missing after HTML is removed, NLTK doesn't parse things like table cells, or even lists, correctly.

Is there some best-practice or strategy for parsing that stuff with NLP? Or do you just have to manually write a parser specific to that individual page?

Just looking for some pointers in the right direction, really want to try this NLTK out!



Solution 1:[1]

Sounds like you're stripping all HTML and generating a flat document, which confuses the parser since the loose pieces are stuck together. Since you are experienced with XML, I suggest mapping your inputs to a simple XML structure that keeps the pieces separate. You can make it as simple as you want, but perhaps you'll want to retain some information. E.g., it may be useful to flag titles, section headings etc. as such. When you've got a workable XML tree that keeps the chunks separate, use XMLCorpusReader to import it into the NLTK universe.

Solution 2:[2]

I had to write rules specific to the XML docs I was analyzing.

What I did was to have a mapping of html tags to segments. This mapping was based on studying several docs/pages and determining what the html tags represent. Ex. <h1> is a phrase segment; <li> are paragraphs; <td> are tokens

If you want to work with XML, you can represent the new mappings as tags. Ex. <h1> to <phrase>; <li> to <paragraph>; <td> to <token>

If you want to work on plain text, you can represent the mappings as a set of chars (ex. [PHRASESTART][PHRASEEND]), just like POS or EOS labeling.

Solution 3:[3]

You can use tools like python-goose which aims at extracting articles from html pages.

Otherwise I made the following small program that gives kind of good results:

from html5lib import parse


with open('page.html') as f:
    doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()
body = html.xpath('//body')[0]


def sanitize(element):
    """Retrieve all the text contained in an element as a single line of
    text. This must be executed only on blocks that have only inlines
    as children
    """
    # join all the strings and remove \n
    out = ' '.join(element.itertext()).replace('\n', ' ')
    # replace multiple space with a single space
    out = ' '.join(out.split())
    return out


def parse(element):
    # those elements can contain other block inside them
    if element.tag in ['div', 'li', 'a', 'body', 'ul']:
        if element.text is None or element.text.isspace():
            for child in element.getchildren():
                yield from parse(child)
        else:
            yield sanitize(element)
    # those elements are "guaranteed" to contains only inlines
    elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        yield sanitize(element)
    else:
        try:
            print('> ignored', element.tag)
        except:
            pass


for e in filter(lambda x: len(x) > 80, parse(body)):
    print(e)

Solution 4:[4]

As alexis answered, python-goose may be a good option.

There is also HTML Sentence Tokenizer, a (new) library which aims to solve this exact issue. Its syntax is very simple. In one line, parsed_sentences = HTMLSentenceTokenizer().feed(example_html_one), you can get the sentences in an HTML page stored in the array parsed_sentences.

Solution 5:[5]

I had the same problem and ended up solving it like this:

import lxml.html
import spacy
nlp = spacy.load("en_core_web_lg")

text = """
<div>
    text here
    <h1>some header</h1>
    <p>a paragraph with something <span>bold</span> in it. Another sentence here. </p>
    <div>another div 
        <div>with a div inside</div>
    </div>
    more text here
</div>
"""


results = []

def convert(item):
    results.append(item.text.strip())
    
    for child in item.getchildren():
        if child.tag in ['h1','h2','h3','h4','h5','h6','p']:
            results.append(''.join(child.itertext()).strip())
        else:
            convert(child)
            
        tail = item.tail.strip()
    
        if tail:
            results.append(tail)
        
soup = lxml.html.fromstring(text)
convert(soup)

sentences = []
for result in results:
    doc = nlp(result)
    sentences = sentences + list(sent.text for sent in doc.sents)

Which gives as output:

['text here',
 'some header',
 'a paragraph with something bold in it.',
 'Another sentence here.',
 'another div',
 'with a div inside',
 'more text here']

Thanks to @furas for their help with the recursion! Extracting text with parent tag type from HTML using Python

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 alexis
Solution 2 ezio808
Solution 3 amirouche
Solution 4 BlueOxile
Solution 5 Muriel