'Is it possible to write Python regexp with something like AND operator?

I can't find a nice way to write several regexp into one such that input string is checked against all sub-regexps like this:

def match(input_str: str, regexp: str) -> bool:
    ...

print(match('abaaca', '.*aba.*<AND>.*aca.*'))  # True
print(match('abaca', '.*aba.*<AND>.*aca.*'))  # True, it doesn't matter that one letter a is shared
print(match('abac', '.*aba.*<AND>.*aca.*'). # False

Is there any way to do it better than parsing regexp to see if there is <AND> in it, split the string into several sub-regexps and match in cycle?

UPD: to be clear, I am looking for a way to use it as a full-featured operator, in cases like ((a<AND>b)|(c<AND>d))<AND>e which will match all of the strings abe, bae, cde and dce. Not only one <AND> but several, mixed with parentheses.



Solution 1:[1]

A Regex solution using positive lookahead groups (?=<sub>) which prevent characters to be consumed

import re

def match(input_str: str, regexp: str) -> bool:
    return re.search("".join([f"(?={sub})" for sub in re.split('<AND>', regexp)]), input_str) != None

print(match('abaaca', '.*aba.*<AND>.*aca.*'))  # True
print(match('abaca', '.*aba.*<AND>.*aca.*'))  # True, it doesn't matter that one letter a is shared
print(match('abac', '.*aba.*<AND>.*aca.*')) # False

=>

True
True
False

The oneliner is equivalent to

def match(input_str: str, regexp: str):
    subs = re.split('<AND>', regexp)             # getting the sub patterns

    # next 3 lines create a pattern from the sub patterns
    pattern = ""
    for sub in subs:
        pattern = pattern + "(?=" + sub +  ")"   # positive lookahead syntax

    matches = re.search(pattern, input_str)
    return matches != None

For the example pattern '.*aba.*<AND>.*aca.*' the modified pattern is (?=.*aba.*)(?=.*aca.*)

Solution 2:[2]

You can use a function to check do all patterns match the string

import re

def matchall(patterns, string):
    return all([re.search(pattern, string) for pattern in patterns])

print(matchall([".*aba.*", ".*aca.*"], "abaaca"))  # True

Edit: 10.06.2022

Using regex lookahead

(?=.*aba.*)(?=.*aca.*).*

Explanation

  • (?= Lookahead assertion - assert that the following regex matches
    • .*aba.* Match but not capture the .*aba.* substrings
  • ) Close lookahead
  • (?= Lookahead assertion - assert that the following regex matches
    • .*aca.* Match but not capture the .*aca.* substrings
  • ) Close lookahead
  • .* Match the whole string if the previous lookarounds both matched

See the regex demo

Solution 3:[3]

it doesn't matter that one letter a is shared

No, you can't do this with only one regex. From the documentation for match():

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

In other words, the entire regex must match at the beginning of the string. Even if you changed to re.search(), you will still have to match the entire regex somewhere in the input string. And re.findall() searches for non-overlapping matches.

Solution 4:[4]

basically the regex checks strings for aba.*aca OR aca.*aba. the lookbehind is necessary because there might be an a that's part of both subpatterns

import re

regex = r"aba.*(?<=a)ca|aca.*(?<=a)ba"
for s in ['abaaca', 'abaca', 'abac', 'aaacacabbaba', 'abababaca', 'abbbacaaaba']:
    print(s, '=>', bool(re.search(regex, s)))

output:

abaaca => True
abaca => True
abac => False
aaacacabbaba => True
abababaca => True
abbbacaaaba => True

Solution 5:[5]

Building on Artyom Vancyan answer I would iterate over a list of compiled regular expressions as it will give you a big performance gain if the function is called many times.

import re
expressions = [re.compile(r'abaaca'), re.compile(r'abaca'), re.compile(r'abac')]
def match_expressions(expressions, string_to_match):
    return all([expression.search(string_to_match) for expression in expressions])

Solution 6:[6]

import re


def match(input_str: str, regexp: str) -> bool:
    pattern = "".join(
        [f"(?={condition})" for condition in regexp.split("<AND>")]
    )

    return bool(re.findall(pattern, input_str))

print(match("abaaca", ".*aba.*<AND>.*aca.*"))  # True
print(match("abaca", ".*aba.*<AND>.*aca.*"))  # True, it doesn't matter that one letter a is shared
print(match("abac", ".*aba.*<AND>.*aca.*"))  # False

Solution 7:[7]

The following pattern matches almost all.

# Regex If order is important, i.e. should start with aba
pattern = r'.*ab(a.*a|a)ca.*' 
# Regex If order is not important, i.e. It can start with aba | aca
pattern = r'.*a(b(a.*a|a)c|c(a.*a|a)b)a.*'
# OUTPUTS
#False inputs
string = ['abac','aba_ca','acab','_ab_ca_','acab','aca ba','_ababa_test_aba_']
print(re.search(pattern, string[0])) # O/P False
# True inputs
string = ["abaca",'acaba','aca_test_aba','_aba_test_aca_','acaaba','abaaca']
print(re.search(pattern, string[0])) # O/P True

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Diego Queiroz
Solution 2
Solution 3 Code-Apprentice
Solution 4
Solution 5 Joaquim Procopio
Solution 6
Solution 7