'Only get alphanumeric characters in capture group using sed

Input:

x.y={aaa b .c}

Note that the the content within {} are only an example, in reality it could be any value.

Problem: I would like to keep only the alphanumeric characters within the {}.

So it would be come:

x.y={aaabbc}

Trial 0

$ echo 'x.y={aaa b .c}' | sed 's/[^[:alnum:]]\+//g'
xyaaabc

This is great, but I'd like to only modify the part within {}. So I thought this may need capture groups, hence I went ahead and tried these:

Trial 1

$ echo 'x.y={aaa b .c}' | sed -E 's/x.y=\{(.*)\}/x.y={\1}/'
x.y={aaa b .c}

Here I have captured the content I want to modify (aaa b .c) correctly, but I need a way to somehow do s/[^[:alnum:]]\+//g only on \1.

Instead, I tried capturing all alphanumeric characters only (to \1) like this:

Trial 2

$ echo 'x.y={aaa b .c}' | sed -E 's/x.y=\{([[:alnum:]]+)\}/x.y={\1}/'
x.y={aaa b .c}

Of course, it doesn't work because I'm only expecting alnum's and then immediately a } literal. I didn't tell it to ignore the non-alnum's. I.e, this part:

s/x.y=\{([[:alnum:]]+)\}/x.y={\1}/
      ^^^^^^^^^^^^^^^^^^   

It literally matches: an open brace, some alnum's, and a closing brace -- which is not what I want. I'd like it to match everything, but only capture the alnum's.


Example of input/output:

x.y={aaa b .c} blah
blah
x.y={1 2 3 def} blah
blah

to

x.y={aaabc} blah
blah
x.y={123def} blah
blah

I searched the web before finally giving up and posting the question but I didn't find anything helpful as I didn't see anyone with a similar problem as mine. Would appreciate some help this as I'd love to have a better understanding of variables in regex/sed, thanks!



Solution 1:[1]

With sed (tested on GNU sed, syntax may vary for other implementations):

$ sed -E ':a s/(\{[[:alnum:]]*)[^[:alnum:]]+([^}]*})/\1\2/; ta' ip.txt
x.y={aaabc} blah
blah
x.y={123def} blah
blah
  • :a marks that location as label a (used to jump using ta as long as the substitution succeeds)
  • (\{[[:alnum:]]*) matches { followed by zero or more alnum characaters
  • [^[:alnum:]]+ matches one or more non-alnum characters
  • ([^}]*}) matches till the next } character


If perl is okay:

$ perl -pe 's/\{\K[^}]+(?=\})/$&=~s|[^a-z\d]+||gir/e' ip.txt
x.y={aaabc} blah
blah
x.y={123def} blah
blah
  • \{\K[^}]+(?=\}) match sequence of { to } (assuming } cannot occur in between)
    • \{\K and (?=\}) are used to avoid the braces from being part of the matched portion
  • e flag allows you to use Perl code in replacement portion, in this case another substitute command
  • $&=~s|[^a-z\d]+||gir here, $& refers to entire matched portion, gi flags are used for global/case-insensitive and r flag is used to return the value of this substitution instead of modifying $&
    • [^a-z\d]+ matches non-alphanumeric characters (assuming ASCII, you can also use [^[:alnum:]]+)
    • use \W+ if you want to preserve underscores as well

For both solutions, you can add x\.y= prefix if needed to narrow the scope of matching.

Solution 2:[2]

Here is another gnu-awk solution using FPAT:

s='x.y={aaa b .c}'
awk -v OFS= -v FPAT='{[^}]+}|[^{}]+' '
{
   for (i=1; i<=NF; ++i)
      if ($i ~ /^{/) $i = "{" gensub(/[^[:alnum:]]+/, "", "g", $i) "}"
} 1' <<< "$s"

x.y={aaabc}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 anubhava