'How to remove numbers from string but keep specific groups of numbers?
I want to use python regular expression to remove numbers from string from keep number 754 and 1231 as they are related to tax section code 754 and sec code 1231. For example, I have the text data below:
test="""Dividends 9672
Dividends 9680
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment - 2015
M-1 Section 754 Stock Basis Adjustment - 2015
Section 754 Stock Basis Adjustment - 2018
M-1 Section 754 Stock basis adjustment - 2018
"""
and I want the output to be:
Dividends
Dividends
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment
M- Section 754 Stock Basis Adjustment
Section 754 Stock Basis Adjustment
M- Section 754 Stock basis adjustment
my solution is:
test=re.sub(r'[^(754)(1231)A-Za-z]','',test)
print(test)
but it doesn't look at 754 or 1231 as whole group and only removes digit 6,8,9.
Solution 1:[1]
You can use
re.sub(r'(754|1231)|[^A-Za-z\s]', r'\1', text)
See the regex demo.
Here, (754|1231) matches and captures into Group 1 a 754 or 1231 digit sequences, and then |[^A-Za-z\s] matches any char other than an ASCII letter or any Unicode whitespace, and the matches are replaced with Group 1 value (i.e. what was captured remains in the string).
Note: if the numbers are to be matched as exact numbers use digit boundaries:
re.sub(r'(?<!\d)(754|1231)(?!\d)|[^A-Za-z\s]', r'\1', text)
Solution 2:[2]
You could write the following.
rgx = r' *-? *(?<!\d)(?!(?:754|1231)(?!\d))\d+'
re.sub(rgx, '', test)
Note that this removes all unwanted spaces and hyphens as well as digits and that, for example, '7541' is matched and replaced with an empty string.
The regular expression can be broken down as follows (I've replaced the initial space with a character class containing a space so that it is visible.)
[ ]*-? * # match >= 0 spaces, optionally followed by a hyphen,
# followed by >= 0 spaces
(?<!\d) # negative lookbehind asserts that preceding character is
# not a digit
(?! # begin negative lookahead
(?:754|1231) # match '754' or '1231'
(?!\d) # negative lookahead asserts that next character is
# not a digit
) # end negative lookahead
\d+ # match >= 1 digits
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
| Solution 2 |
