'multi replace in postgresql query using perl

I'm cleaning some text directly in my query, and rather than using nested replace functions, I found this bit of code that uses perl to perform multiple replacements at once: multi-replace with perl

CREATE FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[]) 
RETURNS text 
AS $BODY$ 
  my ($string, $orig, $repl) = @_;
  my %subs;

  if (@$orig != @$repl) {
     elog(ERROR, "array sizes mismatch");
  } 
  if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
     elog(ERROR, "array dimensions mismatch");
  } 

  @subs{@$orig} = @$repl;
  
  my $re = join "|",
     sort { (length($b) <=> length($a)) } keys %subs;
  $re = qr/($re)/;

  $string =~ s/$re/$subs{$1}/g;

  return $string;
$BODY$ language plperl strict immutable;

Example query:

select

name as original_name, 
multi_replace(name, '{-,&,LLC$}', '{_,and,business}') as cleaned_name

from some_table

The function finds the pattern LLC at the end of the name string but removes it instead of replacing it with "business."

How can I make this work as intended?



Solution 1:[1]

When the strings in @$orig are to be matched literally, I'd actually use this:

my ($string, $orig, $repl) = @_;

# Argument checks here.

my %subs; @subs{ @$orig } = @$repl;

my $pat =
   join "|",
      map quotemeta,
         sort { length($b) <=> length($a) }
            @$orig;

return $string =~ s/$re/$subs{$&}/gr;

In particular, map quotemeta, was missing.

(By the way, the sort line isn't needed if you ensure that xy comes before x in @$orig when you want to replace both x(?!y) and xy.)


But you want the strings in @$orig to be treated as regex patterns. For that, you can use the following:

# IMPORTANT! Only provide strings from trusted sources in
# `@$orig` as it allows execution of arbitrary Perl code.

my ($string, $orig, $repl) = @_;

# Argument checks here.

my $re =
   join "|",
      map "(?:$orig->[$_])(?{ $_ })",
         0..$#$orig;

{
   use re qw( eval );
   $re = qr/$re/;
}

return $string =~ s/$re/$repl->[$^R]/gr;

However, in your environment, I have doubts about the availability of use re qw( eval ); and (?{ }), so the above may be an unviable solution for you.

my ($string, $orig, $repl) = @_;

# Argument checks here.

my $re =
   join "|",
      map "(?<_$_>$orig->[$_])",
         0..$#$orig;

$re = qr/$re/;

return
   $string =~ s{$re}{
      my ( $n ) =
         map substr( $_, 1 ),
            grep { $-{$_} && defined( $-{$_}[0] ) }
               grep { /^_\d+\z/aa }
                  keys( %- );

      $repl->[$n]
   }egr;

Solution 2:[2]

While the regexp tests for LLC$ with the special meaning of the $, what gets captured into $1 is just the string LLC and so it doesn't find the look-up value to replace.

If the only thing you care about is $, then you could fix it by changing the map-building lines to:

 @subs{map {my $t=$_; $t=~s/\$$//; $t} @$orig} = @$repl;

 my $re = join "|",
    sort { (length($b) <=> length($a)) } @$orig;

But it will be very hard to make it work more generally for every possible feature of regex.

Solution 3:[3]

The purpose of this plperl function in the linked blog post is to find/replace strings, not regular expressions. LLC being found with LLC$ as a search term does not happen in the original code, as the search terms go through quotemeta before being included into $re (as also sugggested in ikegami's answer)

The effect of removing the quotemeta transformation is that LLC at the end of a string is matched, but since as a key it's not found in $subs (because the key there isLLC$), then it's getting replaced by an empty string.

So how to make this work with regular expressions in the orig parameter?

The solution proposed by @ikegami does not seem usable from plperl, as it complains with this error: Unable to load re.pm into plperl.

I thought of an alternative implementation without the (?{ code }) feature: each match from the main alternation regexp can be rechecked against each regexp in orig, in a code block run with /ge. On the first match, the corresponding string in repl is selected as the replacement. Code:

CREATE or replace FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[]) 
RETURNS text AS
$BODY$
  my ($string, $orig, $repl) = @_;
  my %subs;

  if (@$orig != @$repl) {
     elog(ERROR, "array sizes mismatch");
  } 
  if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
     elog(ERROR, "array dimensions mismatch");
  } 

  @subs{@$orig} = @$repl;
  
  my $re = join "|", keys %subs;
  $re = qr/($re)/;

  # on each match, recheck the match individually against each regexp
  # to find the corresponding replacement string
  $string =~ s/$re/{ my $r; foreach (@$orig) { if ($1 =~ $_) {$r=$subs{$_}; last;} } $r;}/ge;

  return $string;
$BODY$ language plperl strict immutable;

Test

=> select pg_temp.multi_replace(
    'bar foo - bar & LLC',
    '{^bar,-,&,LLC$}',
    '{baz,_,and,business}'
   );

       multi_replace        
----------------------------
 baz foo _ bar and business

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 jjanes
Solution 3