'Escape character inconsistencies in contractions in strings

I have text that I'm trying to process. Here are 2 examples:

Example 1: <p>An alternative way with <code>*</code>:</p>&#xA;&#xA;<pre><code>puts ["Toronto", "Maple Leafs"] * ', '&#xA;#Toronto, Maple Leafs&#xA;#=&gt; nil&#xA;</code></pre>&#xA;&#xA;<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>&#xA;

Example 2: the thing is that I don't know what's the best way to solve it.

I am using BeautifulSoup and repr to process the text. They are being cleaned as:

Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n

Example 2: the thing is that I don't know what's the best way to solve it.

My issue is with the escape character before the ' . Why is the don't in example 1 being processed at don'\t and the don't in example 2 being processed as don't without the escape character? How would I get them them to be processed the same way?

Here is my code for processing the text:

from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
    
    soup = BeautifulSoup(post,'lxml')
    for e in soup.find_all():
       
    
        if e.name not in ['code']:
            e.unwrap()
            
    
    returnString=str(soup)  
    
    
    post = html.unescape(returnString)
    returnString=repr(post)
    returnString = returnString[1:-1]
    return (returnString)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Escape character inconsistencies in contractions in strings

Sources

Related Questions