'regex - capture type specifiers in format string
Consider a printf-like function's format argument. Something like:
printf("Hello %s, your rating is %i%%", name, percentage);
I want to use regex to capture all the type specifiers (in the above case, %s and %i and not %%).
I've started with a naïve (%[^%]) pattern, but it wrongly captures things like %%f instead of "escaping" it. Off course, %%%f should be interpreted as an escaped "%" and then a specifier.
I figured I need some more complex pattern (maybe lookbehind?), but could not sort it out. Any suggestions?
Side note: I know my pattern does not handle length specifiers and other formatting flags such as %2f etc., but that's fine with me since my goal is mainly to enumerate and count the format specifiers.
Solution 1:[1]
On one hand, you can't skip characters without accidentally catching things like %%f, so you have to either use match or put a ^ (caret) in the beginning of your regex. On the other hand, in this case you can't use findall. Since there's no matchall function, the simplest will be to write your own loop:
REG = re.compile('([^%]|%%)*(%[^%])') # a bunch of (non-% or %%), and then (% followed by non-%).
def find_type_specifiers(st):
retval = []
pos = 0 # where to start searching for next time
while True:
match = REG.match(st, pos)
if match is None:
return retval
retval.append(match.group(2))
pos = match.end()
Of course, you can change what you append to retval if e.g. you're also interested in locations of specifiers. Or change to a counter if you only want the amount.
Solution 2:[2]
I think this is the general case PCRE for regex's based on "official" documentation:
\%[0 #+-]?[0-9*]*\.?\d*[hl]{0,2}[jztL]?[diuoxXeEfgGaAcpsSn%]
Technically [hl]{0,2} cannot precede [jztL] so if you want to be more strict then use this:
\%[0 #+-]?[0-9*]*\.?\d*([hl]{0,2}|[jztL])?[diuoxXeEfgGaAcpsSn%]
Also not all length specifiers (hljztL) work with all format specifiers so the regex is a bit looser than necesary, but still sufficient in most cases. For example %Ls is at least meaningless, possibly invalid.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Hetzroni |
| Solution 2 |
