'Match timestamps in WebVTT files with sed
I have the following PCRE2 regex that works to match and remove timestamp lines in a .webVTT subtitle file (the default for YouTube):
^[0-9].:[0-9].:[0-9].+$
This changes this:
00:00:00.126 --> 00:00:10.058
How are you today?
00:00:10.309 --> 00:00:19.272
Not bad, you?
00:00:19.559 --> 00:00:29.365
Been better.
To this:
How are you today?
Not bad, you?
Been better.
How would I convert this PCRE2 regex to an idiomatic (read: sane-looking) equivalent for sed
's flavour of regex?
Solution 1:[1]
Using your regex with sed
$ sed -En '/^[0-9].:[0-9].:[0-9].+$/!p' file
How are you today?
Not bad, you?
Been better.
Or, do not match lines that end with an integer
$ sed -n '/[0-9]$/!p' file
How are you today?
Not bad, you?
Been better.
Solution 2:[2]
Your pattern is not a specific PCRE2 pattern, only using sed you have to escape the \+
to make it a quantifier for 1 or more times.
At the positions that you use a dot to match any character (and looking at the example data) there is a digit as well.
You could make the pattern a bit more specific, and omit the quantifier at all. Just prevent the line from printing if the pattern matches.
sed -n '/^[0-9][0-9]:[0-9][0-9]:[0-9]/!p' file
-n
prevents the default printing in sed!p
prints the line if the pattern does not match
Output
How are you today?
Not bad, you?
Been better.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | HatLess |
Solution 2 | The fourth bird |