'Sed ignoring potential characters at the end of the match group
I have the following text:
<h2 id="title"> ABC A BBBBB0 </h2>
<h2 id="title">ABC A BBBBB1 </h2>
<h2 id="title">ABC A BBBBB2</h2>
<h2 id="title"> ABC A BBBBB3 </h2>
and want to get of it the following:
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
I am currently running the next command:
sed -n "s/.*\"title\">[[:space:]]*\(.*\)<.*/\1/p" ./file.txt
but get lines with spaces at the end:
ABC A BBBBB0[space][space][space][space]
ABC A BBBBB1[space]
ABC A BBBBB2
ABC A BBBBB3[space]
I can not understand the concept of ignoring possible spaces at the end in my case, at the beginning of the possible matches I understand how to do it. Can somebody give me a clear example for this?
Solution 1:[1]
The last character in the group has to not be a space, then there may be spaces.
's/.*"title">[[:space:]]*\(.*[^[:space:]]\)[[:space:]]*<.*/\1/p'
I can not understand the concept
.* matches everything up until the end of the whole line. Then regex engine reads < and goes back from right to left up until it matches <, and then continues matching further.
You have to put something so that when you go back from the end of the string, you will end up at the place you want to be. So "not a space", for example. The process of "going back" is called "backtracking".
I can recommend https://www.regular-expressions.info/engine.html
Solution 2:[2]
Using sed
$ sed 's/[^>]*>[[:space:]]*\?\([[:alnum:][:space:]]*\)[[:space:]]\?<.*/\1/' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
$ sed -E 's/[^>]*> *?([A-Z0-9 ]*) ?<.*/\1/' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
When using seds grouping and back referencing, you can easily exclude any character, including spaces by not including it within the grouping parenthesis.
[^>]*> - Skip everything till the next >, as this is not within the parenthesis, it will be excluded.
*? - As too will this space. The ? makes it an optional character (or zero or more).
([A-Z0-9 ]*) - Everything within the parenthesis is included which will be capitals, integers and spaces.
?<.*/\1/' - Exclude a single space before < if one is present.
Solution 3:[3]
I'd just use awk:
$ awk -F'> *| *<' '{print $3}' file
ABC A BBBBB0
ABC A BBBBB1
ABC A BBBBB2
ABC A BBBBB3
Solution 4:[4]
This might work for you (GNU sed):
sed -nE 's/<h2 id="title">\s*(.*\S)\s*<\/h2>/\1/p' file
Use pattern matching to return the required strings.
N.B. \s matches white space and \S is its dual. Thus (.*\S) captures word or words.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | Ed Morton |
| Solution 4 | potong |
