'In Python, how do I extract multiple blocks of text that begin with same pattern, but no distinct end?

Given a test string:

teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'

I want to create a list of results like this:

result=['chapter 1 Here is a block of text from chapter one.','chapter 2 Here is another block of text from the second chapter.','chapter 3 Here is the third and final block of text.']

Using re.findall('chapter [0-9]',teststr)

I get ['chapter 1', 'chapter 2', 'chapter 3']

That's fine if all I wanted were the chapter numbers, but I want the chapter number plus all the text up to the next chapter number. In the case of the last chapter, I want to get the chapter number and the text all the way to the end.

Trying re.findall('chapter [0-9].*',teststr) yields the greedy result: ['chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.']

I'm not great with regular expressions so any help would be appreciated.



Solution 1:[1]

In general, an extraction regex looks like

(?s)pattern.*?(?=pattern|$)

Or, if the pattern is at the start of a line,

(?sm)^pattern.*?(?=\npattern|\Z)

Here, you could use

re.findall(r'chapter [0-9].*?(?=chapter [0-9]|\Z)', text)

See this regex demo. Details:

  • chapter [0-9] - chapter + space and a digit
  • .*? - any zero or more chars, as few as possible
  • (?=chapter [0-9]|\Z) - a positive lookahead that matches a location immediately followed with chapter, space, digit, or end of the whole string.

Here, since the text starts with the keyword, you may use

import re
teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'
my_result = [x.strip() for x in re.split(r'(?!^)(?=chapter \d)', teststr)]
print( my_result )
# => ['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']

See the Python demo. The (?!^)(?=chapter \d) regex means:

  • (?!^) - find a location that is not at the start of string and
  • (?=chapter \d) - is immediately followed with chapter, space and any digit.

The pattern is used to split the string at the found locations, and does not consume any chars, hence, the results are stripped from whitespace in a list comprehension.

Solution 2:[2]

If you don't have to use a regex, try this:

def split(text):
    chapters = []

    this_chapter = ""
    for i, c in enumerate(text):
        if text[i:].startswith("chapter ") and text[i+8].isdigit():
            if this_chapter.strip():
                chapters.append(this_chapter.strip())
            this_chapter = c
        else:
            this_chapter += c

    chapters.append(this_chapter.strip())

    return chapters

print(split('chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'))

Output:

['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']

Solution 3:[3]

You're looking for re.split. Assuming up to 99 chapters:

import re
teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'

chapters = [i.strip() for i in re.split('chapter \d{1,2}', teststr)[1:]]

Output:

['Here is a block of text from chapter one.',
 'Here is another block of text from the second chapter.',
 'Here is the third and final block of text.']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Ed Ward
Solution 3