'Findall vs search for overwriting groups in Python

I found topic Capturing group with findall? but unfortunately it is more basic and covers only groups that do not overwrite themselves.

Please let's take a look at the following example:

S = "abcabc"  # string used for all the cases below

1. Findall - no groups

print re.findall(r"abc", S) # ['abc', 'abc']

General idea: No groups here so I expect findall to return a list of all matches - please confirm.

In this case: Findall is looking for abc, finds it, returns it, then goes on and finds the second one.

2. Findall - one explicit group

print re.findall(r"(abc)", S) # ['abc', 'abc']

General idea: Some groups here so I expect findall to return a list of all groups - please confirm.

In this case: Why two results while there is only one group? I understand it this way:

  • findall is looking for abc,

  • finds it,

  • places it in the group memory buffer,

  • returns it,

  • findall starts to look for abc again, and so on...

Is this reasoning correct?

3. Findall - overwriting groups

print re.findall(r"(abc)+", S) # ['abc']

This looks similar to the above yet returns only one abc. I understand it this way:

  • findall is looking for abc,

  • finds it,

  • places it in the group memory buffer,

  • does not return it because the RE itself demands to go on,

  • finds another abc,

  • places it in the group memory buffer (overwrites previous abc),

  • string ends so searching ends as well.

Is this reasoning correct? I am very specific here so if there is anything wrong (even tiny detail) then please let me know.

4. Search - overwriting groups

Search scans through a string looking for a single match, so re.search(r"(abc)", S) and re.search(r"(abc)", S) rather obviously return only one abc, then let me get right to:

re.search(r"(abc)+", S)
print m.group()  # abcabc
print m.groups() # ('abc',)

a) Of course the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()? And that is why nothing gets overwritten for this method?

In fact, this grouping feature of parentheses is completely unnecessary here - in such cases I just want to use parentheses to stress what needs to be taken together when repeating things without creating any regex groups.

b) Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?



Solution 1:[1]

At first, let me state some facts:

  • A match value (match.group()) is the (sub)text that meets the whole pattern defined in a regular expression. Matches can contain zero or more capture groups.
  • A capture value (match.group(1..n)) is a part of the match (that can also be equal to the whole match if the whole pattern is enclosed into a capture group) that is matched with a parenthesized pattern part (a part of the pattern enclosed into a pair of unescaped parentheses).
  • Some languages can provide access to the capture collection, i.e. all the values that were captured with a quantified capture group like (\w{3})+. In Python, it is possible with PyPi regex module, in .NET, with a CaptureCollection, etc.

1: No groups here so I expect findall to return a list of all matches - please confirm.

  • True, only if there are capturing groups are defined in the pattern, re.findall returns a list of captured submatches. In case of abc, re.findall returns a list of matches.

2: Why two results while there is only one group?

  • There are two matches, re.findall(r"(abc)", S) finds two matches in abcabc, and each match has one submatch, or captured substring, so the resulting array has 2 elements (abc and abc).

3: Is this reasoning correct?

  • The re.findall(r"(abc)+", S) is looking for a match in the form abcabcabc and so on. It will match it as a whole and will keep the last abc in the capture group 1 buffer. So, I think your reasoning is correct. RE itself demands to go on can be precised as since the matching is not yet complete (as there are still characters for the regex engine to test for a match).

4: the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()?

  • No, the last group value is kept in this case. If you change your regex to (\w{3})+ and the string to abcedf you will feel the difference as the output for that case will be edf. And that is why nothing gets overwritten for this method? - So, you are wrong, the preceding capture group value is overwritten with the following ones.

5: Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?

The re.search(r"(abc)+", S) will match abcabc (match, not capture) because

  1. abcabc is searched for abc from left to right. RE finds abc at the start and tries to find another abc right from the location after the first c. RE puts the abc into Capture group buffer 1.
  2. RE finds the 2nd abc, rewrites the capture group #1 buffer with it. Tries to find another abc.
  3. No more abc is found - return the matched value found : abcabc.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community