'Parse ½ as 0.5 in Python 2.7
I am scraping this link with BeautifulSoup4
I am parsing page HTML like this
page = BeautifulSoup(page.replace('ISO-8859-1', 'utf-8'),"html5lib")
You can see the values like these -4 -115 (separated by -)
I want both values in a list so I am using this regex.
value = re.findall(r'[+-]?\d+', value)
It works perfectly but not for these values +2½ -102, I only get [-102]
To tackle this, I tried this too
value = value.replace("½","0.5")
value = re.findall(r'[+-]?\d+', value)
but this gives me error about encoding saying I have to set encoding of my file.
I also tried setting encoding=utf-8 at top of file but still gives same error.
I need to ask how do I convert ½ to 0.5
Solution 1:[1]
To embed Unicode literals like ½ in your Python 2 script you need to use a special comment at the top of your script that lets the interpreter know how the Unicode has been encoded. If you want to use UTF-8 you will also need to tell your editor to save the file as UTF-8. And if you want to print Unicode text make sure your terminal is set to use UTF-8, too.
Here's a short example, tested on Python 2.6.6
# -*- coding: utf-8 -*-
value = "a string with fractions like 2½ in it"
value = value.replace("½",".5")
print(value)
output
a string with fractions like 2.5 in it
Note that I'm using ".5" as the replacement string; using "0.5" would convert "2½" to "20.5", which would not be correct.
Actually, those strings should be marked as Unicode strings, like this:
# -*- coding: utf-8 -*-
value = u"a string with fractions like 2½ in it"
value = value.replace(u"½", u".5")
print(value)
For further information on using Unicode in Python, please see Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
I should also mention that you will need to change your regex pattern so that it allows a decimal point in numbers. Eg:
# -*- coding: utf-8 -*-
from __future__ import print_function
import re
pat = re.compile(r'[-+]?(?:\d*?[.])?\d+', re.U)
data = u"+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114"
print(data)
print(pat.findall(data.replace(u"½", u".5")))
output
+2½ -105 -2½ -115 +2½ -105 -2½ -115 +2½ -102 -2½ -114
[u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-105', u'-2.5', u'-115', u'+2.5', u'-102', u'-2.5', u'-114']
Solution 2:[2]
There are more vulgar fractions in Unicode than just ½, here's some code to parse 'em all:
# coding=utf8
# curl -s "http://www.unicode.org/Public/UNIDATA/extracted/DerivedNumericValues.txt" | grep "VULGAR FRACTION"
fractions = {
0x2189: 0.0, # ; ; 0 # No VULGAR FRACTION ZERO THIRDS
0x2152: 0.1, # ; ; 1/10 # No VULGAR FRACTION ONE TENTH
0x2151: 0.11111111, # ; ; 1/9 # No VULGAR FRACTION ONE NINTH
0x215B: 0.125, # ; ; 1/8 # No VULGAR FRACTION ONE EIGHTH
0x2150: 0.14285714, # ; ; 1/7 # No VULGAR FRACTION ONE SEVENTH
0x2159: 0.16666667, # ; ; 1/6 # No VULGAR FRACTION ONE SIXTH
0x2155: 0.2, # ; ; 1/5 # No VULGAR FRACTION ONE FIFTH
0x00BC: 0.25, # ; ; 1/4 # No VULGAR FRACTION ONE QUARTER
0x2153: 0.33333333, # ; ; 1/3 # No VULGAR FRACTION ONE THIRD
0x215C: 0.375, # ; ; 3/8 # No VULGAR FRACTION THREE EIGHTHS
0x2156: 0.4, # ; ; 2/5 # No VULGAR FRACTION TWO FIFTHS
0x00BD: 0.5, # ; ; 1/2 # No VULGAR FRACTION ONE HALF
0x2157: 0.6, # ; ; 3/5 # No VULGAR FRACTION THREE FIFTHS
0x215D: 0.625, # ; ; 5/8 # No VULGAR FRACTION FIVE EIGHTHS
0x2154: 0.66666667, # ; ; 2/3 # No VULGAR FRACTION TWO THIRDS
0x00BE: 0.75, # ; ; 3/4 # No VULGAR FRACTION THREE QUARTERS
0x2158: 0.8, # ; ; 4/5 # No VULGAR FRACTION FOUR FIFTHS
0x215A: 0.83333333, # ; ; 5/6 # No VULGAR FRACTION FIVE SIXTHS
0x215E: 0.875, # ; ; 7/8 # No VULGAR FRACTION SEVEN EIGHTHS
}
rx = r'(?u)([+-])?(\d*)(%s)' % '|'.join(map(unichr, fractions))
test = u'15? and ¼ and +212½ and -?'
import re
for sign, d, f in re.findall(rx, test):
sign = -1 if sign == '-' else 1
d = int(d) if d else 0
number = sign * (d + fractions[ord(f)])
print 'found', number
Solution 3:[3]
If you need regex madly then you can use unicode char as below. The unicode name of this is Unicode Character 'VULGAR FRACTION ONE HALF' (U+00BD) see details at here.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
txt = u'-½ -103+½ -113-½ -105+½ -115-½ -105+½ -115 My test for Fraction -1½ -115'
print ''.join(re.findall(u'[+-]?[\d+]?\u00BD?',txt))
#for replacing
print re.sub(ur'\u00BD',ur'.5',txt)
Output-
-½-103+½-113-½-105+½-115-½-105+½-115-1½-115
-.5 -103+.5 -113-.5 -105+.5 -115-.5 -105+.5 -115 My test for Fraction -1.5 -115
N.B. You can modify the script as you want but you may need to change VULGAR FRACTION- you will get those encoding at the domain posted above.
Solution 4:[4]
For a more general solution I used unicodedata.numeric(character). This can convert any unicode character such as ? to its numeric form 0.1111111.
The solution is a bit long but I thought someone might find it useful.
def has_vulgar_fraction(digits: dict):
result = False
for _, value in digits.items():
if value < 1:
result = True
break
return result
def get_the_string_value_pair(value_digits):
value_digits_len = len(value_digits) - 1
ten_multiplier = 10**(value_digits_len - 1) # 10^(n - 1)
total_sum_product = 0
full_digit_string = ""
# Do the maths merge all numbers found
ten_multiplier = 10**(value_digits_len - 1) # 10^(n - 1)
total_sum_product = 0
full_digit_string = ""
for key, value in value_digits.items():
if value < 1:
total_sum_product += value
else:
total_sum_product += value*ten_multiplier
ten_multiplier = ten_multiplier/10
full_digit_string += key
return full_digit_string, total_sum_product
def convert_all_vulgar_fractions(string_value):
result = {}
value_digits = {}
for character in string_value:
try:
# The heart of the solution is here
value_digits[character] = unicodedata.numeric(character)
except:
# if string has no vulgar fraction i.e 1.25, dont try to parse it
if not has_vulgar_fraction(value_digits):
value_digits = {}
continue
# exclude the vulgar fraction value
key, value = get_the_string_value_pair(value_digits)
result[key] = value
value_digits = {}
# Sometimes the string has the fraction at the end
if has_vulgar_fraction(value_digits):
key, value = get_the_string_value_pair(value_digits)
result[key] = value
return result
if __name__ == "__main__":
# Nonsense ingredient
ingredient = "1.25 teaspoon 423½ ground 4½ cayenne pepper 15?"
items_to_repace = convert_all_vulgar_fractions(ingredient)
# items_to_replace = {'423½': 423.5, '4½': 4.5, '15?': 15.11111111111111}
# Then we replace them from the original string
for key, value in items_to_repace.items():
ingredient = ingredient.replace(key, str(value))
print(ingredient)
Summary: You can use unicodedata.numeric(character) to convert any numeric character to a float.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | georg |
| Solution 3 | |
| Solution 4 | Lebohang Mbele |
