'What is a regex expression to find three asterisks with whitespace on either side: " *** "?
My goal is to use a regular expression in order to discard the header and footer information from a Project Gutenberg UTF-8 encoded text file.
Each book contains a 'start line' like so:
[...]
Character set encoding: UTF-8
Produced by: Emma Dudding, John Bickers, Dagny and David Widger
*** START OF THE PROJECT GUTENBERG EBOOK GRIMMS’ FAIRY TALES ***
Grimms’ Fairy Tales
By Jacob Grimm and Wilhelm Grimm
[...]
The footers look pretty similar:
Taylor, who made the first English translation in 1823, selecting about
fifty stories ‘with the amusement of some young friends principally in
view.’ They have been an essential ingredient of children’s reading ever
since.
*** END OF THE PROJECT GUTENBERG EBOOK GRIMMS’ FAIRY TALES ***
Updated editions will replace the previous one--the old editions will
be renamed.
My idea is to use these triple asterisk markers to discard headers and footers, since such an operation is useful for any Gutenberg release.
What is a good way to do this with regex?
Solution 1:[1]
I found that this string suited my purposes in finding both headers, and I don't anticipate any collisions in the body of the novels etc:
Solution
/^\*\*\*.*\*\*\*$/m
Explanation
The ^ matches the start of the line, the \* is necessary to escape the asterisk which normally has a special purpose; to keep it simple, .* matches anything in the middle, and the $ matches the end of line. The m is for multiline mode, since Gutenberg works contain regularly spaced \n newlines.
Caveat
I can imagine that if any text had a line such as:
Gadzooks!
******
I woke up in a daze...
we would end up with an accidental match. Room for improvement, but for now that may be premature optimization.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Kiteration |
