'multi replace in postgresql query using perl
I'm cleaning some text directly in my query, and rather than using nested replace functions, I found this bit of code that uses perl to perform multiple replacements at once: multi-replace with perl
CREATE FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text
AS $BODY$
my ($string, $orig, $repl) = @_;
my %subs;
if (@$orig != @$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
@subs{@$orig} = @$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } keys %subs;
$re = qr/($re)/;
$string =~ s/$re/$subs{$1}/g;
return $string;
$BODY$ language plperl strict immutable;
Example query:
select
name as original_name,
multi_replace(name, '{-,&,LLC$}', '{_,and,business}') as cleaned_name
from some_table
The function finds the pattern LLC at the end of the name string but removes it instead of replacing it with "business."
How can I make this work as intended?
Solution 1:[1]
When the strings in @$orig are to be matched literally, I'd actually use this:
my ($string, $orig, $repl) = @_;
# Argument checks here.
my %subs; @subs{ @$orig } = @$repl;
my $pat =
join "|",
map quotemeta,
sort { length($b) <=> length($a) }
@$orig;
return $string =~ s/$re/$subs{$&}/gr;
In particular, map quotemeta, was missing.
(By the way, the sort line isn't needed if you ensure that xy comes before x in @$orig when you want to replace both x(?!y) and xy.)
But you want the strings in @$orig to be treated as regex patterns. For that, you can use the following:
# IMPORTANT! Only provide strings from trusted sources in
# `@$orig` as it allows execution of arbitrary Perl code.
my ($string, $orig, $repl) = @_;
# Argument checks here.
my $re =
join "|",
map "(?:$orig->[$_])(?{ $_ })",
0..$#$orig;
{
use re qw( eval );
$re = qr/$re/;
}
return $string =~ s/$re/$repl->[$^R]/gr;
However, in your environment, I have doubts about the availability of use re qw( eval ); and (?{ }), so the above may be an unviable solution for you.
my ($string, $orig, $repl) = @_;
# Argument checks here.
my $re =
join "|",
map "(?<_$_>$orig->[$_])",
0..$#$orig;
$re = qr/$re/;
return
$string =~ s{$re}{
my ( $n ) =
map substr( $_, 1 ),
grep { $-{$_} && defined( $-{$_}[0] ) }
grep { /^_\d+\z/aa }
keys( %- );
$repl->[$n]
}egr;
Solution 2:[2]
While the regexp tests for LLC$ with the special meaning of the $, what gets captured into $1 is just the string LLC and so it doesn't find the look-up value to replace.
If the only thing you care about is $, then you could fix it by changing the map-building lines to:
@subs{map {my $t=$_; $t=~s/\$$//; $t} @$orig} = @$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } @$orig;
But it will be very hard to make it work more generally for every possible feature of regex.
Solution 3:[3]
The purpose of this plperl function in the linked blog post is to find/replace strings, not regular expressions. LLC being found with LLC$ as a search term does not happen in the original code, as the search terms go through quotemeta before being included into $re (as also sugggested in ikegami's answer)
The effect of removing the quotemeta transformation is that LLC at the end of a string is matched, but since as a key it's not found in $subs (because the key there isLLC$), then it's getting replaced by an empty string.
So how to make this work with regular expressions in the orig parameter?
The solution proposed by @ikegami does not seem usable from plperl, as it complains with this error: Unable to load re.pm into plperl.
I thought of an alternative implementation without the (?{ code }) feature: each match from the main alternation regexp can be rechecked against each regexp in orig, in a code block run with /ge. On the first match, the corresponding string in repl is selected as the replacement.
Code:
CREATE or replace FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text AS
$BODY$
my ($string, $orig, $repl) = @_;
my %subs;
if (@$orig != @$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
@subs{@$orig} = @$repl;
my $re = join "|", keys %subs;
$re = qr/($re)/;
# on each match, recheck the match individually against each regexp
# to find the corresponding replacement string
$string =~ s/$re/{ my $r; foreach (@$orig) { if ($1 =~ $_) {$r=$subs{$_}; last;} } $r;}/ge;
return $string;
$BODY$ language plperl strict immutable;
Test
=> select pg_temp.multi_replace(
'bar foo - bar & LLC',
'{^bar,-,&,LLC$}',
'{baz,_,and,business}'
);
multi_replace
----------------------------
baz foo _ bar and business
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | jjanes |
| Solution 3 |
