'Parse ½ as 0.5 in Python 2.7

I am scraping this link with BeautifulSoup4

I am parsing page HTML like this

page = BeautifulSoup(page.replace('ISO-8859-1', 'utf-8'),"html5lib")

You can see the values like these -4 -115 (separated by -)

I want both values in a list so I am using this regex.

value = re.findall(r'[+-]?\d+', value)

It works perfectly but not for these values +2½ -102, I only get [-102]

To tackle this, I tried this too

value = value.replace("½","0.5")
value = re.findall(r'[+-]?\d+', value)

but this gives me error about encoding saying I have to set encoding of my file.

I also tried setting encoding=utf-8 at top of file but still gives same error.

I need to ask how do I convert ½ to 0.5



Solution 1:[1]

To embed Unicode literals like ½ in your Python 2 script you need to use a special comment at the top of your script that lets the interpreter know how the Unicode has been encoded. If you want to use UTF-8 you will also need to tell your editor to save the file as UTF-8. And if you want to print Unicode text make sure your terminal is set to use UTF-8, too.

Here's a short example, tested on Python 2.6.6

# -*- coding: utf-8 -*-

value = "a string with fractions like 2½ in it"
value = value.replace("½",".5")
print(value)

output

a string with fractions like 2.5 in it

Note that I'm using ".5" as the replacement string; using "0.5" would convert "2½" to "20.5", which would not be correct.


Actually, those strings should be marked as Unicode strings, like this:

# -*- coding: utf-8 -*-

value = u"a string with fractions like 2½ in it"
value = value.replace(u"½", u".5")
print(value)

For further information on using Unicode in Python, please see Pragmatic Unicode, which was written by SO veteran Ned Batchelder.


I should also mention that you will need to change your regex pattern so that it allows a decimal point in numbers. Eg:

# -*- coding: utf-8 -*-
from __future__ import print_function
import re

pat = re.compile(r'[-+]?(?:\d*?[.])?\d+', re.U) 

data = u"+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114"
print(data)
print(pat.findall(data.replace(u"½", u".5")))

output

+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114
[u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-102', u'-2.5', u'-114']

Solution 2:[2]

There are more vulgar fractions in Unicode than just ½, here's some code to parse 'em all:

# coding=utf8

# curl -s "http://www.unicode.org/Public/UNIDATA/extracted/DerivedNumericValues.txt" | grep "VULGAR FRACTION"
fractions = {
    0x2189: 0.0,  # ; ; 0 # No       VULGAR FRACTION ZERO THIRDS
    0x2152: 0.1,  # ; ; 1/10 # No       VULGAR FRACTION ONE TENTH
    0x2151: 0.11111111,  # ; ; 1/9 # No       VULGAR FRACTION ONE NINTH
    0x215B: 0.125,  # ; ; 1/8 # No       VULGAR FRACTION ONE EIGHTH
    0x2150: 0.14285714,  # ; ; 1/7 # No       VULGAR FRACTION ONE SEVENTH
    0x2159: 0.16666667,  # ; ; 1/6 # No       VULGAR FRACTION ONE SIXTH
    0x2155: 0.2,  # ; ; 1/5 # No       VULGAR FRACTION ONE FIFTH
    0x00BC: 0.25,  # ; ; 1/4 # No       VULGAR FRACTION ONE QUARTER
    0x2153: 0.33333333,  # ; ; 1/3 # No       VULGAR FRACTION ONE THIRD
    0x215C: 0.375,  # ; ; 3/8 # No       VULGAR FRACTION THREE EIGHTHS
    0x2156: 0.4,  # ; ; 2/5 # No       VULGAR FRACTION TWO FIFTHS
    0x00BD: 0.5,  # ; ; 1/2 # No       VULGAR FRACTION ONE HALF
    0x2157: 0.6,  # ; ; 3/5 # No       VULGAR FRACTION THREE FIFTHS
    0x215D: 0.625,  # ; ; 5/8 # No       VULGAR FRACTION FIVE EIGHTHS
    0x2154: 0.66666667,  # ; ; 2/3 # No       VULGAR FRACTION TWO THIRDS
    0x00BE: 0.75,  # ; ; 3/4 # No       VULGAR FRACTION THREE QUARTERS
    0x2158: 0.8,  # ; ; 4/5 # No       VULGAR FRACTION FOUR FIFTHS
    0x215A: 0.83333333,  # ; ; 5/6 # No       VULGAR FRACTION FIVE SIXTHS
    0x215E: 0.875,  # ; ; 7/8 # No       VULGAR FRACTION SEVEN EIGHTHS
}

rx = r'(?u)([+-])?(\d*)(%s)' % '|'.join(map(unichr, fractions))

test = u'15? and ¼ and +212½ and -?'

import re

for sign, d, f in re.findall(rx, test):
    sign = -1 if sign == '-' else 1
    d = int(d) if d else 0
    number = sign * (d + fractions[ord(f)])
    print 'found', number

Solution 3:[3]

If you need regex madly then you can use unicode char as below. The unicode name of this is Unicode Character 'VULGAR FRACTION ONE HALF' (U+00BD) see details at here.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re

txt = u'-½ -103+½ -113-½ -105+½ -115-½ -105+½ -115 My test for Fraction -1½ -115'

print ''.join(re.findall(u'[+-]?[\d+]?\u00BD?',txt))

#for replacing

print re.sub(ur'\u00BD',ur'.5',txt)

Output-

-½-103+½-113-½-105+½-115-½-105+½-115-1½-115
-.5 -103+.5 -113-.5 -105+.5 -115-.5 -105+.5 -115 My test for Fraction -1.5 -115

N.B. You can modify the script as you want but you may need to change VULGAR FRACTION- you will get those encoding at the domain posted above.

Solution 4:[4]

For a more general solution I used unicodedata.numeric(character). This can convert any unicode character such as ? to its numeric form 0.1111111.

The solution is a bit long but I thought someone might find it useful.

def has_vulgar_fraction(digits: dict):
    result = False
    for _, value in digits.items():
        if value < 1:
            result = True
            break

    return result

def get_the_string_value_pair(value_digits):
    value_digits_len = len(value_digits) - 1

    ten_multiplier = 10**(value_digits_len - 1) # 10^(n - 1)
    total_sum_product = 0
    full_digit_string = ""

    # Do the maths merge all numbers found
    ten_multiplier = 10**(value_digits_len - 1) # 10^(n - 1)
    total_sum_product = 0
    full_digit_string = ""
    for key, value in value_digits.items():
        if value < 1:
            total_sum_product += value
        else:
            total_sum_product += value*ten_multiplier
            ten_multiplier = ten_multiplier/10
        full_digit_string += key

    return full_digit_string, total_sum_product

def convert_all_vulgar_fractions(string_value):
    result = {}
    value_digits = {}
    for character in string_value:
        try:
            # The heart of the solution is here
            value_digits[character] = unicodedata.numeric(character)
        except:
            # if string has no vulgar fraction i.e 1.25, dont try to parse it
            if not has_vulgar_fraction(value_digits):
                value_digits = {}
                continue

            # exclude the vulgar fraction value
            key, value = get_the_string_value_pair(value_digits)
            result[key] = value
            value_digits = {}

    # Sometimes the string has the fraction at the end
    if has_vulgar_fraction(value_digits):
        key, value = get_the_string_value_pair(value_digits)
        result[key] = value

    return result


if __name__ == "__main__":
    # Nonsense ingredient
    ingredient = "1.25 teaspoon 423½ ground 4½ cayenne pepper 15?"
    items_to_repace = convert_all_vulgar_fractions(ingredient)
    # items_to_replace = {'423½': 423.5, '4½': 4.5, '15?': 15.11111111111111}

    # Then we replace them from the original string
    for key, value in items_to_repace.items():
        ingredient = ingredient.replace(key, str(value))

    print(ingredient)

Summary: You can use unicodedata.numeric(character) to convert any numeric character to a float.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 georg
Solution 3
Solution 4 Lebohang Mbele