'How to extract data from html table in shell script?
I am trying to create a BASH script what would extract the data from HTML table. Below is the example of table from where I need to extract data:
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.406 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.332 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.143 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.888 s</td></tr>
</table>
And I want the BASH script to output it like so:
SAVE_DOCUMENT OK 0.475 s
GET_DOCUMENT OK 0.345 s
DVK_SEND OK 0.002 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 4.465 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.002 s
SUMMARY_STATUS OK 5.294 s
How to do it?
So far I have tried using the sed, but I don't know how to use it quite well. The header of the table(Component, Status, Time/Error) I excluded with grep using grep "<tr><td>, so only lines starting with <tr><td> will be selected for next parsing (sed).
This is what I used: sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'
But then <tr> tags still remain and also it wont separate the strings. In other words the result of this script is:
<tr>SAVE_DOCUMENTOK0.406 s</tr>
The full command of the script I'm working on is:
cat $FILENAME | grep "<tr><td>" | sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'
Solution 1:[1]
You can use bash xpath (XML::XPath perl module) to accomplish that task very easily:
xpath -e '//tr[position()>1]' test_input1.xml 2> /dev/null | sed -e 's/<\/*tr>//g' -e 's/<td>//g' -e 's/<\/td>/ /g'
Solution 2:[2]
You may use html2text command and format the columns via column, e.g.:
$ html2text table.html | column -ts'|'
Component Status Time / Error
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
then parse it further from there (e.g. cut, awk, ex).
In case you'd like to sort it first, you can use ex, see the example here or here.
Solution 3:[3]
There are a lot of ways of doing this but here's one:
grep '^<tr><td>' < $FILENAME \
| sed \
-e 's:<tr>::g' \
-e 's:</tr>::g' \
-e 's:</td>::g' \
-e 's:<td>: :g' \
| cut -c2-
You could use more sed(1) (-e 's:^ ::') instead of the cut -c2- to remove the leading space but cut(1) doesn't get as much love as it deserves. And the backslashes are just there for formatting, you can remove them to get a one liner or leave them in and make sure that they're immediately followed by a newline.
The basic strategy is to slowly pull the HTML apart piece by piece rather than trying to do it all at once with a single incomprehensible pile of regex syntax.
Parsing HTML with a shell pipeline isn't the best idea ever but you can do it if the HTML is known to come in a very specific format. If there will be variation then you'd be better with with a real HTML parser in Perl, Ruby, Python, or even C.
Solution 4:[4]
A solution based on multi-platform web-scraping CLI xidel and XPath:
Tip of the hat to Reino for providing the simpler XPath equivalent to the original XQuery solution.[1]
xidel -s -e '//tr[position() > 1]/join(td)' file
With the sample input, this yields:
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
Explanation:
//tr[position() > 1]matches thetrelements starting with the 2nd one, so as to skip the header row), andjoin(td)joins the values of the matching elements' childtdelements with an implied single space as the separator.-smakesxidelsilent (suppresses output of status information).
While html2text is convenient for display of the extracted data, providing machine-parseable output is non-trivial, unfortunately:
html2text file | awk -F' *\\|' 'NR>2 {gsub(/^\||.\b/, ""); $1=$1; print}'
The Awk command removes the hidden \b-based (backspace-based) sequences that html2text outputs by default, and parses the lines into fields by |, and then outputs them with a space as the separator (a space is Awk's default output field separator; to change it to a tab, for instance, use -v OFS='\t').
Note: Use of -nobs to suppress backspace sequences at the source is not an option, because you then won't be able to distinguish between the hidden-by-default _ instances used for padding and actual _ characters in the data.
Note: Given that html2text seemingly invariably uses | as the column separator, the above will only work robustly if the are no | instances in the data being extracted.
[1] xidel -s --xquery 'for $tr in //tr[position()>1] return join($tr/td, " ")' file
Solution 5:[5]
You can parse the file using Ex editor (part of Vim) by removing HTML tags, e.g.:
$ ex -s +'%s/<[^>]\+>/ /g' +'v/0/d' +'wq! /dev/stdout' table.html
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
Here is shorter version by printing the whole file without HTML tags:
$ ex +'%s/<[^>]\+>/ /g|%p' -scq! table.html
Explanation:
%s/<[^>]\+>/ /g- Substitute all HTML tags into empty space.v/0/d- Deletes all lines without0.wq! /dev/stdout- Quits editor and writes the buffer to the standard output.
Solution 6:[6]
For the sake of completeness, pandoc does a good job when you have extracted the HTML table. For example,
pandoc --from html --to plain table.txt
---------------- -------- --------------
Component Status Time / Error
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
---------------- -------- --------------
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Emiliano Poggi |
| Solution 2 | Community |
| Solution 3 | mu is too short |
| Solution 4 | |
| Solution 5 | kenorb |
| Solution 6 | Robert Smith |
