'Splitting string in multiple places Python

I am trying to get a set of numbers out of a string. The numbers are nestled between characters.

Here is an example: NC123456Sarah Von Winkle

  • NC is the only part of the string that is a guarantee
  • 123456 is the number I want to extract
  • Sarah Von Winkle is the name, it can be anything

So I cannot just split at 'S' and 'C' to try and grab the digits.

Code

Nothing tried so far.

Problem

I have no idea how to approach this.

How can I split the string to get only the digits in the middle?



Solution 1:[1]

You can use Regex for this:

import re
s='NC123456Sarah Von Winkle'
m=''.join(re.findall(r'NC(\d+).*',s))
print(int(m))

Solution 2:[2]

You can try re, which is the standard library of Python.

import re

sample_string = "NC123456Sarah Von Winkle"
result_digits = re.findall(r"\d+", sample_string, flags=0)

Then your result should be ['123456']. If you want just an integer instead of a string, you can convert it with int(result_digits[0]).

Solution 3:[3]

Use the regex module :

import re
s = "NC123456Sarah Von Winkle"
t = re.findall("[0-9]+",s)
print(t)

This will give :

['123456']

The regular-expression (pattern) is composed of:

  • character-range [0-9] will find all occurrences of any digit between 0 to 9 in the string s
  • quantifier + indicates, we are searching for at least one occurrence of the pattern before (e.g. [0-9]).

Solution 4:[4]

To match and capture (= extract) the number, you can use a regular-expression.

TL;DR: I would recommend re.match(r'NC(\d+)', s).group(1) (details in the last section).

Regex to match a number

To match a number with a minimum length of 1 digit, use the regular-expression (patter) \d+' for one or many digits, optionally inside a capturing-group as (\d+)` where:

  1. \d is a character class (meta-character) for digits (of range 0-9)
  2. + is a quantifier matching if at least one occurrence of preceding pattern was found
  3. ( and ) form a capturing-group of the enclosed sub-regex

Test your regex on regex101 or regexplanet and choose the right flavor/language/engine (here: Python).

In Python use the built-in regex module re. Define the regex as raw-string like r'\d+'.

Find to extract only the number or empty list

Either function re.findall to find a list of occurrences:

import re

s = 'NC123456Sarah Von Winkle'
pattern = r'\d+'
occurrences = re.findall(pattern, s)

print(occurrences)

Prints:

['123456']

The first number occurrences[0] is yours if not empty:

if len(occurrences) == 0:
    print('no number found in: ' + s)
else:
    number =  occurrences[0]

Split to get all parts

Or function re.split to split the string into parts:

import re

s = 'NC123456Sarah Von Winkle'
pattern = r'(\d+)'
parts = re.split(pattern, s)

print(parts)

Prints:

['NC', '123456', 'Sarah Von Winkle']

Note: without the capture-group (i.e. without parentheses ()) the output would be just: ['NC', 'Sarah Von Winkle'] (excluding the splitter-pattern)

Here you would get the number in second part parts[1] as long as non-number-prefix like "NC" is guaranteed and followed by a number.

Extract with a capturing-group

Use the group function together with a regex containing a capturing-group:

import re

s = 'NC123456Sarah Von Winkle'
capture_number_pattern = re.compile(r'NC(\d+)')
extracted = capture_number_pattern.match(s).group(1)

print(extracted)

Prints:

123456

Note: re.compile returns a compiled pattern. This can optimize performance when pattern is re-used multiple times and improve readability of the code.

Pay attention: To make your matching robust and defensive test if there is a match, otherwise an error is raised at runtime, see Python shell:

>>> extracted = capture_number_pattern.match('NCHelloWorld2022').group(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

You can test if a match was found or fail-fast if match is None:

s = 'NCHelloWorld2022'
match = capture_number_pattern.match(s)
if not match:
    print('No number found in:' + s)
else:
    print(match.group(1))

prints:

No number found in:NC123456Sarah Von Winkle

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wasif
Solution 2 Dharman
Solution 3 hc_dev
Solution 4