'How can I convert `A_B_C_DEF` to `ABC_DEF`?
I have strings of this form:
A_B_CDEF_GHI
A_B_C_DEF_G_H_I
ABC_D_E_F_GHI
ABCDEFG_H_I
A_B_C
I need to convert those to the following:
AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC
So the rules are:
(._){2,}should be converted toXXX_if it's not at the end of the string.- If
(_.){2,}occurs at the end of a string, it should be converted to_XXX. - If
(_.){2,}.is the entire string, all underscores should be removed.
I've gotten to (((.)_){2,}), which does match the first rule, but how can I replace it with the non-underscore characters it found?
The
pythontag is present because that's where the code is, and I know regex dialects depend on the language.
Solution 1:[1]
The dot in your example code matches any character including an underscore. You can make the pattern a bit more specific instead.
You can get all of the double A-Z matches out of the way, and capture the single A-Z followed by _ and A-Z in a group.
Then for the capture group replace the _ with an empty string.
_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z]))+)
_?[A-Z]{2,}_?Match 2 or more occurences of A-Z surrounded by optional underscores|or(Capture group 1[A-Z]Match a single A-Z(?:_[A-Z](?![A-Z]))+Repeat 1+ times_and A-Z asserting not A-Z to the right
)Close group 1
See a regex demo and a Python demo
For example:
import re
pattern = r'_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z]))+)'
s = ("A_B_CDEF_GHI\n"
"A_B_C_DEF_G_H_I\n"
"ABC_D_E_F_GHI\n"
"ABCDEFG_H_I\n"
"A_B_C")
res = re.sub(pattern, lambda x: x.group(1).replace("_", "") if x.group(1) else x.group(), s)
print(res)
Output
AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC
A bit broader match instead of characters A-Z could be using a negated character class matching any char except a whitespace char or underscore
_?[^_\s]{2,}_?|([^_\s](?:_[^_\s](?![^_\s]))+)
Solution 2:[2]
Here's a solution without regular expressions:
def convert(s: str) -> str:
""" https://stackoverflow.com/q/71578300 """
def _get_combined_parts() -> Iterator[str]:
"""
yields the ``_``-separated parts of ``s``
where subsequent single-character parts have been combined
"""
combined_part = ""
for part in s.split("_"):
if len(part) <= 1:
combined_part += part
else:
if combined_part:
yield combined_part
yield part
combined_part = ""
if combined_part:
yield combined_part
return "_".join(_get_combined_parts())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
