'How to find pattern upper case sentence
I have a texte file like following
FAKE ET FAKE
1, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
FAKE-ET-FAKE.fr
FAKE AGAIN
2, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
FAKE-AGAIN.fr
STILL FAKE AGAIN ANOTHER
2, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
STILL-FAKE-AGAIN-ANOTHER.fr
with a regex I want to extract the header of each paragraph.
I know that the pattern of the header is to be upper case separated with space but the number of upper case words and spaces is different
I have tried this but problem is I do not manage to make it work wathever the number of pattern "UPPER UPPER UPPER ..."
here what I have tried:
regex = r'[A-Z]+\s[A-Z]+'
re.findall(regex, text)
Here I would only find "FAKE AGAIN" in my example.
I have tried
regex = r'([A-Z]+\s[A-Z]+)+'
to say that this pattern of UPPER\s can reproduce but did not work
Solution 1:[1]
With x.txt containing your text, the following worked fine for me:
egrep -e '^[A-Z][A-Z ]*$' x.txt
Solution 2:[2]
Use
regex = r'[A-Z]+(?:[^\S\n][A-Z]+)+'
See regex proof.
EXPLANATION
- [A-Z]+ - Match a single character in the range between A (index 65) and Z (index 90) (case sensitive) between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Non-capturing group (?:[^\S\n][A-Z]+)+
- + - matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Match a single character not present in the list below [^\S\n]
- \S - matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
- \n - matches a line-feed (newline) character (ASCII 10)
- Match a single character present in the list below [A-Z]
- + - matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
- A-Z - matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ronald |
| Solution 2 | Ryszard Czech |
