'Findall vs search for overwriting groups in Python
I found topic Capturing group with findall? but unfortunately it is more basic and covers only groups that do not overwrite themselves.
Please let's take a look at the following example:
S = "abcabc" # string used for all the cases below
1. Findall - no groups
print re.findall(r"abc", S) # ['abc', 'abc']
General idea: No groups here so I expect findall to return a list of all matches - please confirm.
In this case: Findall is looking for abc, finds it, returns it, then goes on and finds the second one.
2. Findall - one explicit group
print re.findall(r"(abc)", S) # ['abc', 'abc']
General idea: Some groups here so I expect findall to return a list of all groups - please confirm.
In this case: Why two results while there is only one group? I understand it this way:
findallis looking forabc,finds it,
places it in the group memory buffer,
returns it,
findallstarts to look forabcagain, and so on...
Is this reasoning correct?
3. Findall - overwriting groups
print re.findall(r"(abc)+", S) # ['abc']
This looks similar to the above yet returns only one abc. I understand it this way:
findallis looking forabc,finds it,
places it in the group memory buffer,
does not return it because the RE itself demands to go on,
finds another
abc,places it in the group memory buffer (overwrites previous
abc),string ends so searching ends as well.
Is this reasoning correct? I am very specific here so if there is anything wrong (even tiny detail) then please let me know.
4. Search - overwriting groups
Search scans through a string looking for a single match, so re.search(r"(abc)", S) and re.search(r"(abc)", S) rather obviously return only one abc, then let me get right to:
re.search(r"(abc)+", S)
print m.group() # abcabc
print m.groups() # ('abc',)
a) Of course the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()? And that is why nothing gets overwritten for this method?
In fact, this grouping feature of parentheses is completely unnecessary here - in such cases I just want to use parentheses to stress what needs to be taken together when repeating things without creating any regex groups.
b) Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?
Solution 1:[1]
At first, let me state some facts:
- A match value (
match.group()) is the (sub)text that meets the whole pattern defined in a regular expression. Matches can contain zero or more capture groups. - A capture value (
match.group(1..n)) is a part of the match (that can also be equal to the whole match if the whole pattern is enclosed into a capture group) that is matched with a parenthesized pattern part (a part of the pattern enclosed into a pair of unescaped parentheses). - Some languages can provide access to the capture collection, i.e. all the values that were captured with a quantified capture group like
(\w{3})+. In Python, it is possible with PyPi regex module, in .NET, with a CaptureCollection, etc.
1: No groups here so I expect
findallto return a list of all matches - please confirm.
- True, only if there are capturing groups are defined in the pattern,
re.findallreturns a list of captured submatches. In case ofabc,re.findallreturns a list of matches.
2: Why two results while there is only one group?
- There are two matches,
re.findall(r"(abc)", S)finds two matches inabcabc, and each match has one submatch, or captured substring, so the resulting array has 2 elements (abcandabc).
3: Is this reasoning correct?
- The
re.findall(r"(abc)+", S)is looking for a match in the formabcabcabcand so on. It will match it as a whole and will keep the lastabcin the capture group 1 buffer. So, I think your reasoning is correct. RE itself demands to go on can be precised as since the matching is not yet complete (as there are still characters for the regex engine to test for a match).
4: the whole match is
abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) form.group()?
- No, the last group value is kept in this case. If you change your regex to
(\w{3})+and the string toabcedfyou will feel the difference as the output for that case will beedf. And that is why nothing gets overwritten for this method? - So, you are wrong, the preceding capture group value is overwritten with the following ones.
5: Can anyone explain a mechanism behind returning
abcabc(in terms of buffers and so on) similarly like I did in bullet 3?
The re.search(r"(abc)+", S) will match abcabc (match, not capture) because
abcabcis searched forabcfrom left to right. RE findsabcat the start and tries to find anotherabcright from the location after the firstc. RE puts theabcinto Capture group buffer 1.- RE finds the 2nd
abc, rewrites the capture group #1 buffer with it. Tries to find anotherabc. - No more
abcis found - return the matched value found :abcabc.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Community |
