'How would I match all "quote blocks" in plaintext e-mail in PHP PCRE?
I'm trying to match all the quotes in the following example e-mail message:
> Don't forget to buy eggiweggs on the way home.
I shall not.
> Also remember to brush your shoes.
Will do.
> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.
I see what you mean. They sure make a mess.
That means I want to match these three strings:
> Don't forget to buy eggiweggs on the way home.
And:
> Also remember to brush your shoes.
And:
> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.
I don't understand how I can do this, since if I use the s flag to span multiple lines, which is required for this, I cannot refer to ^ and $ to mean "beginning of line" and "end of line" -- instead, they mean "beginning of string" and "end of string".
So if I do: #^(> .+?)$#us, it will match everything after/with the first quote.
And if I do: #^(> .+?)$#um, it will match only the first quote's first line and nothing else.
This is frustrating. I really have no idea how to solve it. I've searched online before asking and found zero even remotely relevant pages as usual.
Solution 1:[1]
My idea is to split the string based on the line breaks. maybe this will help you?
foreach(explode("\n", $string) as $key=>$val) {
if(preg_match('/^(>.*)$/', $val, $match))
echo $match[1] . PHP_EOL;
}
output:
> Don't forget to buy eggiweggs on the way home.
> Also remember to brush your shoes.
> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.
edit: i tried something else... but it is not perfect
preg_match_all("/(>[^\n]+)/sm", $string, $match);
print_r($match);
output
Array
(
[0] => > Don't forget to buy eggiweggs on the way home.
[1] => > Also remember to brush your shoes.
[2] => > > > And clean up after the pigs.
[3] => > > But I have no pigs.
[4] => > Yes, you do. Your kids.
)
Solution 2:[2]
Explicitly match the end of the quote
Rather than attempting to tweak greedy/none-greedy behavior it's easiest to solve regex problems if you can match something unambiguous at the start/end of the match.
When a quoted block continues on multiple lines, the last characters of the line are [newline] + >, when a quoted block ends the last characters of the quote are [newline] + [not >] - This logic/pattern allows finding the whole quoted blocks.
From this logic we arrive at:
/(> .+?)\n[^>]/s
This is:
/ # Regex start delimiter
( # Start of capturing group
> # Literal >
# Literal space
. # Any character, including newlines
+? # At least once, none-greedy match
) # End of Capturing group
\n # Newline
[^>] # anything _except_ a literal >
/ # Regex end delimiter
s # PCRE_DOTALL flag (makes . also match newlines)
The use of a none-greedy match here prevents the regex matching the start of the first quote all the way to the end of the last quote.
Choosing between (or combining) PCRE_DOTALL and PCRE_MULTILINE depends on the strategy employed - here the intent is only to modify the behavior of .. More info in the docs.
If the source text is coming from windows, you may wish to use \R (as noted in a different answer).
Demonstration
<?php
$input = <<<STUFF
> Don't forget to buy eggiweggs on the way home.
I shall not.
> Also remember to brush your shoes.
Will do.
> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.
I see what you mean. They sure make a mess.
STUFF;
$regex = "/(> .+?)\n[^>]/s";
preg_match_all($regex, $input, $matches);
print_r($matches);
Results in:
Array
(
[0] => Array
(
[0] => > Don't forget to buy eggiweggs on the way home.
[1] => > Also remember to brush your shoes.
[2] => > > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.
)
[1] => Array
(
[0] => > Don't forget to buy eggiweggs on the way home.
[1] => > Also remember to brush your shoes.
[2] => > > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.
)
)
Why don't the attempts in the question work?
So if I do:
#^(> .+?)$#us
The s modifier only affects ^ and $.
This regex is anchored to the start and end of each line (does a none-greedy match, but the . will not match a newline anyway) - hence it matches each quoted line individually.
And if I do:
#^(> .+?)$#um
The m modifier only affects ..
It has no effect on ^ or $ - so as noted in the question this can at most produce one match.
Flags are not mutually exclusive, and can be used in combination.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
