'Correct Syntax For Escaping Double Quotes in Regex Pattern Match?

I'm trying to get the 2nd substring between the double quotes chars in vars string & string2.

I think the problem is the way I'm trying to escape the double quotes.

What is the correct syntax for this:

#!/bin/bash

# Example strings.

string='"name": "Bash scripting cheatsheet",'
string2='"url": "https://devhints.io/bash"'

# I'm trying to get the 2nd substring between " "

# desired matches:
# string_name_match='Bash scripting cheatsheet'
# string2_url_match='https://devhints.io/bash'

# Attempts: using a pattern var with double quotes escaped.

pattern='\".*\"'  # Is the " char escaped correctly?
echo "$string" | awk "/$pattern/{print $2}" # Is the $pattern var used correctly?
echo "$string2" | awk "/$pattern/{print $2}" 

# 2nd pattern match using the name/url to parse:

name_pattern='^\"name:\"[:space:].*[^\",]'
url_pattern='^\"url\"[:space:]\"^url:.*[^"]'
echo "$string" | awk "/$name_pattern/{print $0}"
echo "$string2" | awk "/$url_pattern/{print $0}"


Solution 1:[1]

Addressing the current issue of passing a regex to awk, due to various issues with escape sequences it's usually easier to deal with variables instead of hard-coded regex patterns, combined with testing the entire line ($0) against the pattern (~ pattern_variable), eg:

string='"name": "Bash scripting cheatsheet",'
string2='"url": "https://devhints.io/bash"'
pattern='"([^"]*)".*"([^"]*)"'

$ awk -v ptn="${pattern}" -F'"' '$0 ~ ptn {print $2}' <<< "${string}"
"Bash

$ awk -v ptn="${pattern}" '$0 ~ ptn {print $2}' <<< "${string2}"
"https://devhints.io/bash"

OK, so we got awk working with the regex but we're not getting quite what we wanted because by default awk uses white space as the default field delimiter. We can tell awk to use the double quote as a delimiter, and knowing that the value we want is between the 2nd set of double quotes:

$ awk -v ptn="${pattern}" -F'"' '$0 ~ ptn {print $4}' <<< "${string}"
Bash scripting cheatsheet

$ awk -v ptn="${pattern}" -F'"' '$0 ~ ptn {print $4}' <<< "${string2}"
https://devhints.io/bash

'course, this requires spawning a subprocess each time we want to parse a string.

There are a few (better) ways to parse a string in bash without the overhead of spawning subprocess calls ...


One idea using some basic bash regex matching:

string='"name": "Bash scripting cheatsheet",'
string2='"url": "https://devhints.io/bash"'
pattern='"([^"]*)".*"([^"]*)"'

If bash finds a match it will populate the BASH_REMATCH[] array with info about the match(es), with each capture group (the part of the pattern inside a set of parens) making up a separate entry in the array.

Consider:

$ [[ "${string}" =~ ${pattern} ]] && string_name_match="${BASH_REMATCH[2]}"
$ typeset -p BASH_REMATCH string_name_match
declare -ar BASH_REMATCH=([0]="\"name\": \"Bash scripting cheatsheet\"" [1]="name" [2]="Bash scripting cheatsheet")
declare -- string_name_match="Bash scripting cheatsheet"

$ echo "${string_name_match}"
Bash scripting cheatsheet



$ [[ "${string2}" =~ ${pattern} ]] && string2_url_match="${BASH_REMATCH[2]}"
$ typeset -p BASH_REMATCH string2_url_match
declare -ar BASH_REMATCH=([0]="\"url\": \"https://devhints.io/bash\"" [1]="url" [2]="https://devhints.io/bash")
declare -- string2_url_match="https://devhints.io/bash"

$ echo "${string2_url_match}"
https://devhints.io/bash

Solution 2:[2]

With your shown samples, please try following grep code. Written and tested in GNU grep.

echo "$string" | grep -oP '.*?"[^"]*".*?"\K[^"]*'
Bash scripting cheatsheet

echo "$string2" | grep -oP '.*?"[^"]*".*?"\K[^"]*'
https://devhints.io/bash

Explanation: Using GNU grep here. Printing value of string(s) by echo command and sending it as a standard input to grep command. In grep command using regex .*?"[^"]*".*?"\K[^"]*(which is explained below) to achieve required output.

Explanation of regex(.*?"[^"]*".*?"\K[^"]*):

.*?"    ##using lazy match capability of GNU grep and matching till very first occurrence of " here.
[^"]*"  ##Then matching everything just before next occurrence of " including " here.
.*?"    ##Using lazy match to match till very next occurrence of " here, which will be 3rd occurrence of ".
\K      ##Now using magical \K option of GNU grep to forget(basically not to print) whatever was matched before.
[^"]*   ##Matching everything just before 4th occurrence of " which is required output.

Solution 3:[3]

You can use a Bash regex:

$ [[ $string =~ ^([^\"]*\"){4} ]] && echo "${BASH_REMATCH[1]%\"}"
Bash scripting cheatsheet

$ [[ $string2 =~ ^([^\"]*\"){4} ]] && echo "${BASH_REMATCH[1]%\"}"
https://devhints.io/bash

Or same method with sed:

sed -E 's/^([^"]*\"){4}/\1/; s/".*//' <<<"$string"
Bash scripting cheatsheet

sed -E 's/^([^"]*\"){4}/\1/; s/".*//' <<<"$string2"
https://devhints.io/bash

(But escaping the " is not required with the sed...)

Solution 4:[4]

Here is another simple solution:

Using gawk standard Linux awk. FPAT variable is a regexp that match the data fields.

echo '"url": "https://devhints.io/bash"' |awk -vFPAT='[^\"]*' '{print $4}'
https://devhints.io/bash

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 RavinderSingh13
Solution 3
Solution 4 Dudi Boy