'regex for combining length, inclusion and exclusion?
A search on SO with just [regex]
gave me 249'446 hits and a search with [regex] inclusion exclusion
gave me 47 hits but I guess none of the latter (maybe some of the former?) fit my case.
I am also aware, e.g. about this regex page https://www.regular-expressions.info/refquick.html, but I guess there might be a regex concept which I am not yet familiar with and would be grateful for hints.
Here is a minimal example of what I am trying to do with a given list of strings.
Find all items which:
- have a fixed defined number of characters, i.e. length
- must include all characters from a certain list (doesn't matter at what position and if multiple times)
- must NOT include any characters from a certain list
Constructs like: [ei^no]{4}
, ((?![no])[ei]){4}
and a lot of other more complex trials didn't give the desired results.
Hence, I currently implemented this as a 3 step process with checking the length, doing a search and a match. This looks pretty cumbersome and inefficient to me.
Is there a more efficient way to do this?
Script:
import re
items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve']
count = 4
mustContain = 'ei' # all of these charactes at least once
mustNotContain = 'no' # none of those chars
hits1 = []
for item in items:
if len(item)==count:
hits1.append(item)
print("Hits1:",hits1)
hits2 = []
for hit in hits1:
regex = '[{}]'.format(mustContain)
if re.search(regex,hit):
hits2.append(hit)
print("Hits2:", hits2)
hits3 = []
for hit in hits2:
regex = '[{}]'.format(mustNotContain)
if re.match(regex,hit):
hits3.append(hit)
print("Hits3:", hits3)
Result:
Hits1: ['four', 'five', 'nine']
Hits2: ['five', 'nine']
Hits3: ['five']
Solution 1:[1]
If you are interested in a regex approach, you can create a single dynamic pattern that looks like:
^(?=.{4}$)(?![^no\n]*[no])(?=[^e\n]*e)[^i\n]*i.*$
Explanation
^
Start of string(?=.{4}$)
Assert 4 characters(?![^no\n]*[no])
Assert no occurrence ofn
oro
to the right using a leading negated character class(?=[^e\n]*e)
Assert ane
char to the right[^i\n]*i
Match any char excepti
and then matchi
.*
Match the rest of the line$
end of string
See a regex demo and a Python demo.
Example
import re
items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'tree']
hits = [item for item in items if re.match(r"(?=.{4}$)(?![^no\n]*[no])(?=[^e\n]*e)[^i\n]*i.*$", item)]
print(hits)
Output
['five']
Using a variation of all
and a list comprehension:
items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'tree']
count = 4
mustContain = ["e", "i"] # all of these characters at least once
mustNotContain = ["n", "o"] # none of those chars
hits = [
item for item in items if
len(item) == count and
all([c in item for c in mustContain]) and
all([c not in item for c in mustNotContain])
]
print(hits)
Output
['five']
See a Python demo.
Solution 2:[2]
Apparently, the "trick" which I was missing was the "Positive lookahead" (?=regex)
.
I guess the regex in @Thefourthbird's solution can be shortened,
unless I overlooked something and somebody will prove me wrong.
The regex for the included characters can be generated dynamically.
The regex for the original minimal example of the question would be:
^(?=.{4}$)(?!.*[no])(?=.*e)(?=.*i)
Script: (dynamically generated regex)
import re
items = ['one', 'two', 'three', 'four', 'five', 'six',
'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve',
'tree', 'mean', 'mine', 'fine', 'dime', 'eire']
count = 4
mustContain = 'ei' # all of these characters at least once
mustNotContain = 'no' # none of those chars
hits = []
regex1 = '^(?=.{' + str(count) + '}$)' # limit number of chars
regex2 = '(?!.*[' + mustNotContain + '])' if mustNotContain else '' # excluded chars
regex3 = ''.join(['(?=.*{})'.format(c) for c in mustContain]) # included chars
regex = regex1 + regex2 + regex3
for item in items:
if re.match(regex,item,re.IGNORECASE):
hits.append(item)
print("Hits:", hits)
Result:
Hits: ['five', 'dime', 'eire']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | theozh |