'Parsing fractions with Python
I'm working with a dataframe with medicinal products and I have to extract the dosage out of the name (string), and later change the original product name with the reduced form of the dosage.
Example of what I have:
Name
'Prenoxad 2mg/2ml solution for injection pre-filled syringes'
I want to have, stored in a new column:
Name_reduced
'Prenoxad 1mg/ml solution for injection pre-filled syringes'
I asked for help doing this here, and got:
import re
from fractions import Fraction
def replace_ratio(ratio):
fraction = Fraction(int(ratio.group(1)), int(ratio.group(2)))
numerator = fraction.numerator
denominator = "" if fraction.denominator == 1 else fraction.denominator
return f"{numerator}ml/{denominator}mg"
def process_text(text):
return re.sub("(\d+)mg/(\d+)ml", lambda ratio: replace_ratio(ratio), text)
print(process_text("Prenoxad 2mg/2ml solution for injection pre-filled syringes"))
# -> Prenoxad 1ml/mg solution for injection pre-filled syringes
print(process_text("Prenoxad 10mg/3ml solution for injection pre-filled syringes"))
# -> Prenoxad 10ml/3mg solution for injection pre-filled syringes
print(process_text("Prenoxad 120mg/6ml solution for injection pre-filled syringes"))
# -> Prenoxad 20ml/mg solution for injection pre-filled syringes
This works pretty well except for very specific cases, such as:
- Floats are not well dealt with:
'Oxybutynin 2.5mg/5ml oral solution sugar free' is transformed into 'Oxybutynin 2.1mg/ml oral solution sugar free'
This happens because the algorithm only considers the decimal values of the float, instead of the whole float.
Ideally, the output should be 'Oxybutynin 2.5mg/5ml oral solution sugar free'.
- Some products have dosage information for more than one substance:
'Co-amoxiclav 250mg/62mg/5ml oral suspension sugar free' means that 5 ml reconstituted suspension contain 250 mg of amoxicillin and 62 mg of clavulanic acid. For this particular product the function outputs 'Co-amoxiclav 250mg/62mg/5ml oral suspension sugar free', but if the product was named 'Co-amoxiclav 250mg/60mg/5ml oral suspension sugar free' it would return 'Co-amoxiclav 250mg/12mg/ml oral suspension sugar free' because it only recognizes the last 2 values.
To solve this, maybe the best option is to create an exception that everytime the product has information for more than one substance (measured in mg), the function doesn't run over that product.
- Last one is for the cases where there's a blank space:
print(process_text("Prenoxad 120mg/6ml solution for injection pre-filled syringes"))
#Prenoxad 20ml/mg solution for injection pre-filled syringes
print(process_text("Prenoxad 120mg /6ml solution for injection pre-filled syringes"))
#Prenoxad 120mg /6ml solution for injection pre-filled syringes
print(process_text("Prenoxad 120mg/ 6ml solution for injection pre-filled syringes"))
#Prenoxad 120mg/ 6ml solution for injection pre-filled syringes
How can I solve these 3 problems?
Solution 1:[1]
Your task is all about parsing text and you are on the right way with regular expressions. But you should modify your regex a bit.
The expression r"(\d+)mg/(\d+)ml" is good, but it is not enough.
To couple with whitespace you need to add
\s*To couple with dots use
[\d\.]+rather then just\d+So, modified regex should be something like
([\d\.]+)mg\s*/\s*([\d\.]+)mlWhat about complicated portions - like
"1mg/1mg/1mg": in functionreplace_ratioyou can check number of matches withlen(ratio)
There is a nice site regex101.com to debug regexes.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ????????? ? |
