'in DOMDocument why does an "en dash" in a title tag break unicode strings?
why does an "en dash" in a title tag break unicode strings in DOMDocument? this code
<?php
$html = <<<'HTML'
<!DOCTYPE html>
<html><head>
<title>example.org – example.org - example.org</title>
<meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
HTML;
$domd = new DOMDocument("1.0", "UTF-8");
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$interesting = $domd->getElementsByTagName("body")->item(0)->textContent;
var_dump($interesting, bin2hex($interesting));
prints the nonsense
string(14) "Trädgård"
string(28) "5472c383c2a46467c383c2a57264"
however if we just remove the en-dash from line 5, change it to
<title>example.org example.org - example.org</title>
it prints
string(10) "Trädgård"
string(20) "5472c3a46467c3a57264"
so why does en-dash break unicode strings in DOMDocument?
(took me a long time to track down that the en-dash is the cause x.x )
Solution 1:[1]
don't know why, exactly, but the key here seems to be that any unicode characters occurring before the utf-8 declaration will confuse it, meaning:
<!DOCTYPE html>
<html><head>
<title>æøå</title>
<meta charset="utf-8" />
</head>
<body>Trädgård</body>
</html>
will confuse it, while
<!DOCTYPE html>
<html><head>
<meta charset="utf-8" />
<title>æøå</title>
</head>
<body>Trädgård</body>
</html>
works fine.. and @Tino Didriksen found this quote from https://www.w3.org/International/questions/qa-html-encoding-declarations
so it's best to put it immediately after the opening head tag.
and.. as the top rated comment in the loadHTML documentation mentions, a quick'n dirty workaround is
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | hanshenrik |
