'How to remove space from number followed by unit or dimensions?

Here is the input string

string1 = 0.9% SODIUM CHLORIDE 8290306544 FLUSH 0.9 % SYRINGE 10 ML
string2 = 0.9% SODIUM CHLORIDE 8290-3071-44 FLUSH 0.9 % SYRINGE 10 MM
string3 = 0.9% SODIUM CHLORIDE 290306544 FLUSH 0.9 % SYRINGE 10 cm

These are three string that I'm working on, so here I want two remove space from number followed by unit/dimension/mesurments and % as well, eg- 10 ML => 10ML but 8290306544FLUSH this is wrong. and second thing is if there is 10 digit number then make format like 4 digit - 4 digit - 2 digit. eg- 8290-3065-44 and if there is 9 digit the add zero at first and make it in format. eg- 290306544 => 0290306544 => 0290-3065-44

I want output like

string1 = 0.9% SODIUM CHLORIDE 8290-3065-44 FLUSH 0.9% SYRINGE 10ML
string2 = 0.9% SODIUM CHLORIDE 8290-3071-44 FLUSH 0.9% SYRINGE 76MM
string3 = 0.9% SODIUM CHLORIDE 0290-3065-44 FLUSH 0.9% SYRINGE 65cm

how I make python function for this



Solution 1:[1]

This code may help you.

# pip install quantities
from quantities import units
string1 ='0.9% SODIUM CHLORIDE 8290306544 FLUSH 0.9 % SYRINGE 10 ML'
string2 = '0.9% SODIUM CHLORIDE 8290-3071-44 FLUSH 0.9 % SYRINGE 10 MM'
string3 = '0.9% SODIUM CHLORIDE 290306544 FLUSH 0.9 % SYRINGE 10 cm'

def string_formater(string):
    unit_symbols = [u.symbol for _, u in units.__dict__.items() if isinstance(u, type(units.deg))] # list of all units

    string = string.strip().split(' ') # strip remove unwanted spaces and split make a list.
 

    for a in string:
        if a.lower() in unit_symbols or a.upper() in unit_symbols: # if a is a unit then combine it with his previous value example '10','cm' then it becomes '10cm'.
            index = string.index(a)
            string[index-1] = string[index-1]+ string[index]
            del string[index]

    def number_formater(num):
        num = list(num)
        num.insert(4,'-')
        num.insert(9,'-')
        return(''.join(num)) # return the formated number with dash('-')

    for a in string:
        if a.isdigit():
            if len(a) == 9:
                index = string.index(a)
                a = '0'+a
                string[index] = number_formater(a)
            elif len(a) == 10:
                index = string.index(a)
                string[index] = number_formater(a)

    return(' '.join(string))



print(string_formater(string1)) # 0.9% SODIUM CHLORIDE 8290-3065-44 FLUSH 0.9% SYRINGE 10ML
print(string_formater(string2)) # 0.9% SODIUM CHLORIDE 8290-3071-44 FLUSH 0.9% SYRINGE 76MM
print(string_formater(string3)) # 0.9% SODIUM CHLORIDE 0290-3065-44 FLUSH 0.9% SYRINGE 65cm

Solution 2:[2]

One other way:

import re
string1 = '0.9% SODIUM CHLORIDE 8290306544 FLUSH 0.9 % SYRINGE 10 ML'
string2 = '0.9% SODIUM CHLORIDE 8290-3071-44 FLUSH 0.9 % SYRINGE 10 MM'
string3 = '0.9% SODIUM CHLORIDE 290306544 FLUSH 0.9 % SYRINGE 10 cm'

def repl(x):
   print(x)
   s =x.group(1)
   if s is not None:
       t = ('0' + s if len(s) == 9  else s)
       return f'{t[:4]}-{t[4:6]}-{t[6:]}'
   s1 = x.group(2)
   if s1 is not None:
       return s1.replace(' ', '')

def my_fun(string):
    return re.sub(r'(\b\d{9,10}\b)|(\d{1,3} [%a-zA-Z]{1,2})', repl, string)

my_fun(string1)
Out[]: '0.9% SODIUM CHLORIDE 8290-30-6544 FLUSH 0.9% SYRINGE 10ML'

my_fun(string2)
Out[]: '0.9% SODIUM CHLORIDE 8290-3071-44FLUSH 0.9% SYRINGE 10MM'

my_fun(string3)
Out[]: '0.9% SODIUM CHLORIDE 0290-30-6544 FLUSH 0.9% SYRINGE 10cm'

Solution 3:[3]

You could use a specific pattern to capture either 9 or 10 digits with capture groups, or match digits followed by a percentage sign or units.

Then you can make use of re.sub with a callback function checking for the existence of the capture groups. If there are there, return the number formatted with the hyphens, else remove the whitespace chars from the match.

(?i)\b(\d{1,2})?(\d{4})(\d{4})\b|\b\d+\s+(?:M[ML]|cm|%)

Explanation

  • (?i) Inline modifier for a case insensitive match
  • \b(\d{1,2})? A word boundary to prevent a partial word match, and capture 1-2 digits in group 1
  • (\d{4})(\d{4}) Capture group 2 and group 3 matching 4 digits each
  • \b A word boundary
  • | Or
  • \b\d+ A word boundary, then match 1+ digits
  • \s+(?:M[ML]|cm|%) Match 1+ whitspace chars followed by either a unit or a percentage sign (You can extend the alternation of the units with the ones you want to allow)

Example code

import re

pattern = r"(?i)\b(\d{1,2})?(\d{4})(\d{4})\b|\b\d+\s+(?:M[ML]|cm|%)"

s = ("0.9% SODIUM CHLORIDE 8290306544 FLUSH 0.9 % SYRINGE 10 ML\n"
     "0.9% SODIUM CHLORIDE 8290-3071-44 FLUSH 0.9 % SYRINGE 10 MM\n"
     "0.9% SODIUM CHLORIDE 290306544 FLUSH 0.9 % SYRINGE 10 cm\n")    

def replacement(m):
    if m.group(1):
        nrs = "-".join(m.groups())
        return "0" + nrs if len(m.group(1)) == 1 else nrs
    return re.sub(r"\s+", "", m.group())

print(re.sub(pattern, replacement, s))

Output

0.9% SODIUM CHLORIDE 82-9030-6544 FLUSH 0.9% SYRINGE 10ML
0.9% SODIUM CHLORIDE 8290-3071-44 FLUSH 0.9% SYRINGE 10MM
0.9% SODIUM CHLORIDE 02-9030-6544 FLUSH 0.9% SYRINGE 10cm

See a regex demo and a Python demo

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 The fourth bird