'Lua pattern matching: When can anchors be safely omitted?
The reference manual describes pattern & anchors as follows:
A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.
Clearly, if a pattern ends with .* or .+ (no matter whether inside a capture group), a trailing $ anchor may be safely omitted, as the entire remaining sequence will be matched either way by the last greedy quantifier; for .-, the anchor may not be omitted though, as that wouldn't force it to match all characters to the end.
But not for the "beginning" of string anchor, it seems the same holds: ^.* and ^.+ can simply be converted into .* and .+ respectively. However, surprisingly, it seems that this time - perhaps due to the way patterns are implemented - ^.- can indeed be simplified to .-, at least from my testing. Even though the docs state:
a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
If it isn't anchored, the pattern matching could start at a later position, thus matching a shorter sequence for .- - yet this isn't happening:
$ lua
Lua 5.3.4 Copyright (C) 1994-2017 Lua.org, PUC-Rio
> ("00000000000000000000000001"):match".-1"
00000000000000000000000001
> ("00000000000000000000000001"):match"^.-1"
00000000000000000000000001
>
Is this somehow guaranteed or specified behavior, or is it just "undefined" behavior and should the anchor ^ still be used to stay on the safe side should the implementation change?
Solution 1:[1]
There are two things you need to bear in mind when using Lua patterns (and any patterns in general):
- There are pattern strings that are used to match specific texts
- There are libraries, methods or functions in programming languages that parse the pattern strings and extract/replace/remove/split the input strings based on the incoming pattern logic.
Thus, please make sure you understand what your pattern does and how a specific function/method uses the pattern.
If you use match and ^.-1, the result will be a substring that matches at the start of string (^), then has any zero or more chars as few as possible up to the leftmost occurrence of 1. The ^ is a pattern part that guarantees that matching starts only at the start of string. However, match only searches for a single match (it is not gmatch) and . in Lua patterns matches any char (including line break chars). Thus, .-1 with match will yield the same match.
Once you use gmatch to find multiple matches, ^.-1 and .-1 patterns will start making difference.
If you use it in a replacing/removing context, the difference will be visible at once, too, since by default, these methods - and string.gsub is not an exception - replace all found matches: "Its basic use is to substitute the replacement string for all occurrences of the pattern inside the subject string" (see 20.1 – Pattern-Matching Functions).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
