'regex - capture type specifiers in format string

Consider a printf-like function's format argument. Something like:

printf("Hello %s, your rating is %i%%", name, percentage);

I want to use regex to capture all the type specifiers (in the above case, %s and %i and not %%).

I've started with a naïve (%[^%]) pattern, but it wrongly captures things like %%f instead of "escaping" it. Off course, %%%f should be interpreted as an escaped "%" and then a specifier.

I figured I need some more complex pattern (maybe lookbehind?), but could not sort it out. Any suggestions?

Side note: I know my pattern does not handle length specifiers and other formatting flags such as %2f etc., but that's fine with me since my goal is mainly to enumerate and count the format specifiers.



Solution 1:[1]

On one hand, you can't skip characters without accidentally catching things like %%f, so you have to either use match or put a ^ (caret) in the beginning of your regex. On the other hand, in this case you can't use findall. Since there's no matchall function, the simplest will be to write your own loop:

REG = re.compile('([^%]|%%)*(%[^%])')  # a bunch of (non-% or %%), and then (% followed by non-%).
def find_type_specifiers(st):
    retval = []
    pos = 0  # where to start searching for next time
    while True:
        match = REG.match(st, pos)
        if match is None:
            return retval
        retval.append(match.group(2))
        pos = match.end()

Of course, you can change what you append to retval if e.g. you're also interested in locations of specifiers. Or change to a counter if you only want the amount.

Solution 2:[2]

I think this is the general case PCRE for regex's based on "official" documentation:

\%[0 #+-]?[0-9*]*\.?\d*[hl]{0,2}[jztL]?[diuoxXeEfgGaAcpsSn%]

Technically [hl]{0,2} cannot precede [jztL] so if you want to be more strict then use this:

\%[0 #+-]?[0-9*]*\.?\d*([hl]{0,2}|[jztL])?[diuoxXeEfgGaAcpsSn%]

Also not all length specifiers (hljztL) work with all format specifiers so the regex is a bit looser than necesary, but still sufficient in most cases. For example %Ls is at least meaningless, possibly invalid.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Hetzroni
Solution 2