'How do I stop htmlPurifier from automatically decoding html entities?

I have a strange issue. I use CKEditor-4 to collect formatted text from user in form of html. Also, the html content is filtered using htmlpurifier from the server.

When the user use quotes like , and CKEditor converts them into html entities like ”, ’, and “, which is fine. The issue is, when I filter them using htmlpurifier - this quotes get's automatically decoded. This prevents the content from: being presented to user for later edit as the quotes are literally encoded in strage ways like “

How do i fix this? I think, if I could stop htmlpurifier from automatically decoding things, this would work, But I am new to htmlpurifier - so I can't find a way.

I have tried using htmlentities before passing it to htmlpurifier. but it would encode the whole html, Hence: stopping htmlpurifier from purifying html at all.



Solution 1:[1]

After CBroe's comment, I found out that my application is not using UTF-8 all the way through.

And I can't rectify it also. For those who are in similar situation, I found a work-around. htmlPurifier does support a configuration to encode all non-ASCII charecters with some trade-offs - It's fine with my case(I think).

you can enable the htmlpurifier config Core.EscapeNonASCIICharacters like so

$config->set('Core.EscapeNonASCIICharacters', true);

which did the trick for me.


This is the full function

/**
 * Purifies dirty html
 *
 * @param string $dirty_html
 * @return string
 */
function purifyHtml($dirty_html)
{
    $config = HTMLPurifier_Config::createDefault();
    $config->set('Core.Encoding', 'UTF-8');
    $config->set('Core.EscapeNonASCIICharacters', true);
    $config->set('HTML.Doctype', 'HTML 4.01 Transitional');
    $config->set('Cache.SerializerPath', getStoragePath('cache/html-purifier'));

    $htmlPurifier = new HTMLPurifier($config);
    return $htmlPurifier->purify($dirty_html);
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1