'Escape character inconsistencies in contractions in strings
I have text that I'm trying to process. Here are 2 examples:
Example 1: <p>An alternative way with <code>*</code>:</p>

<pre><code>puts ["Toronto", "Maple Leafs"] * ', '
#Toronto, Maple Leafs
#=> nil
</code></pre>

<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>

Example 2: the thing is that I don't know what's the best way to solve it.
I am using BeautifulSoup and repr to process the text. They are being cleaned as:
Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n
Example 2: the thing is that I don't know what's the best way to solve it.
My issue is with the escape character before the ' . Why is the don't in example 1 being processed at don'\t and the don't in example 2 being processed as don't without the escape character? How would I get them them to be processed the same way?
Here is my code for processing the text:
from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
soup = BeautifulSoup(post,'lxml')
for e in soup.find_all():
if e.name not in ['code']:
e.unwrap()
returnString=str(soup)
post = html.unescape(returnString)
returnString=repr(post)
returnString = returnString[1:-1]
return (returnString)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
