'Convert all french accents into HTML character format
I have for example a bunch of HTML pages like this :
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [<a
href="#tailcontent.html">tail</a>] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">
But impossible to convert all french accents in these HTML pages like above the accent in
"Table des matières" with "è" appearing instead of "è".
I tried 2 things :
for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done
=> the accents are not converted
for i in $(ls *.html); do recode ..html $i; done
=> I have the following errors :
recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...
I don't know what to do to convert all these french accents ?
Has anyone got an idea or suggestion to convert all possible french accents ? I would like to use iconv, recode or sed commands.
UPDATE 1: taking a basic example, here is the message I get for a single file :
$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
What's wrong ?
UPDATE 2: here is the output of my original HTML pages :
$file -i index.html
$ index.html: text/x-tex; charset=iso-8859-1
and the head of the index.html :
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
If I apply the command :
$ recode -vfd u8..html index.html
Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
and
<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
as you can see, the "è" has disappeared.
What can I do ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
