'what is the best way to get target substring through regex in bash script

I am creating a script to automate and extract large amounts of text files; Currently, my problem is to get target id from .html files, example below:

 \ \ <body id="some_id" class="calibre2">

what of my script function is to get "some_id" and check it is valid(ID is not allowed to start with a number) otherwise fix this id in .html file and other related files(toc.ncx, content.opf etc), my main used command is sed(but I think my method is cumbersome), the shell is below:

#!/bin/bash
for var in ./*
do
        if [[ $var =~ .*.html ]]
        then
                if grep -q -E '<body id="[0-9]+' $var
                then
                        ID="$(sed -n -E 's/\ \ <body id="[0-9]+(.*?)"\ .*/\1/gp' $var)"
                        echo $ID
                        sed -i -E 's/<body\ id="([0-9]+)/<body id="id\1/g' $var
                        sed -i -E "s/$ID/id$ID/g" ./../toc.ncx
                        echo $var
                fi
        fi
done

that means I don't know the ID of html, but I know the rule of ID, example below:

\ \ <body id="123char" class="calibre2">

"123char" is invalid, because ID is not allowed to start with a number, so I need to fix the ID with appending prefix characters, like "idchar", so html become below:

\ \ <body id="idchar" class="calibre2">

At the same time I need to update other file's id(change "123char" to "idchar"), like .ncx file

<content src="Text/xxx1.html#123char"/>
<!--need changes id as follow-->
<content src="Text/xxx1.html#idchar"/>

PS: as showed above, this shell is aimed at fixing .epub fix that can't pass epub validator, many e-book converters from mobi to epub have this type of bug(calibre, convertio...etc)

html linux bash sed

Solution 1:^[1]

This has been repeated here countless times already; it's a really bad idea to parse/edit HTML with regex! An HTML parser like xidel would be better suited. In fact, with its integrated EXPath File module one single call could be all you need:

$ xidel -se '
  for $x in file:list(.,false(),"*.html")
  where matches(doc($x)//body/@id,"^\d")
  return
  file:write(
    $x,
    x:replace-nodes(
      doc($x)//body/@id,
      function($x){attribute {name($x)} {replace($x,"^\d+","id")}}
    ),
    {"method":"html","indent":true()}
  )
'

file:list(.,false(),"*.html") returns all HTML-files in the current dir.
matches(doc($x)//body/@id,"^\d") restricts that to only those HTML-files with an id attribute's value that starts with a number.
x:replace-nodes( [...] ) replaces the number of that value with the string "id".
file:write( [...] ) replaces the original HTML-file.

Solution 2:^[2]

Parsing html with Regex is not easy nor is the right tool to use.
You can use pup which is a HTML parser.

input

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body>
    <h1>Here is h1 tag</h1>
</body>
</html>

test

pup 'h1 text{}' < index.html

output

Here is h1 tag

For any reason if you prefer to use regex, perl is much more suitable than bash. Given this as an input:

sample 1

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body id="some_id" class="calibre2">
    <h1>this is h1 tag</h1>
</body
</html>

with this perl one-liner

perl -lne '/<body\s+id="\K[^"]+/ && print $&' index.htm

the output would be:

some_id

sample 2

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body id="some_id" class="calibre2">
    <h1 id="number-1">this is h1 tag</h1>
    <h1 id="number-2">this is h1 tag</h1>
    <h1 id="number-3">this is h1 tag</h1>
</body
</html>

Perl one-liner

perl -lne '/<h1\s+id="\K[^"]+/ && print $&' index.html

output

number-1
number-2
number-3

And if you prefer to use grep you can use -P option to apply PCRE (Perl Compatible Regular Expression)

grep -oP '<h1\s+id="\K[^"]+'  index.html

# output
number-1
number-2
number-3

Using a bash function to get value of an id for a tag:

#!/bin/bash

function match_html_id(){
    {
        local tag=$1;
        local regex="<${tag}\s+id=\"\K[^\"]+";
        local filename="$2";
        local result='';

        if grep -P "$regex" "$filename" > /dev/null 2>&1; then
            result=$(grep -oP  $regex $filename);
            echo 'match found';
        else 
            echo 'match not found';
        fi
    } >&2;

    echo $result;
}

declare -r r=$(match_html_id body index.html);
echo r: "'$r'"

output for sample 2 or 1 on body tag

match found
r: 'some_id'

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Reino
Solution 2

'what is the best way to get target substring through regex in bash script

Solution 1:[1]

Solution 2:[2]

input

test

output

sample 1

sample 2

output

output for sample 2 or 1 on body tag

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]