'Clean up a comma-separated list by regex

I want to clean up a tag list separated by comma to remove empty tags and extra spaces. I came up with

$str='first , second ,, third, ,fourth   suffix';
echo preg_replace('#[,]{2,}#',',',preg_replace('#\s*,+\s*#',',',preg_replace('#\s+#s',' ',$str)));

which works well so far, but is it possible to do it in one replacement?

Solution 1:^[1]

You can use

preg_replace('~\s*(?:(,)\s*)+|(\s)+~', '$1$2', $str)

Merging the two alternatives into one results in

preg_replace('~\s*(?:([,\s])\s*)+~', '$1', $str)

See the regex demo and the PHP demo. Details:

\s*(?:(,)\s*)+ - zero or more whitespaces and then one or more occurrences of a comma (captured into Group 1 ($1)) and then zero or more whitespaces
| - or
(\s)+ - one or more whitespaces while capturing the last one into Group 2 ($2).

In the second regex, ([,\s]) captures a single comma or a whitespace character.

The second regex matches:

\s* - zero or more whitespaces
(?:([,\s])\s*)+ - one or more occurrences of
- ([,\s]) - Group 1 ($1): a comma or a whitespace
- \s* - zero or more whitespaces

See the PHP demo:

<?php
 
$str='first , second ,, third, ,fourth   suffix';
echo preg_replace('~\s*(?:(,)\s*)+|(\s)+~', '$1$2', $str) . PHP_EOL;
echo preg_replace('~\s*(?:([,\s])\s*)+~', '$1', $str);
// => first,second,third,fourth suffix
//    first,second,third,fourth suffix

BONUS

This solution is portable to all NFA regex flavors, here is a JavaScript demo:

const str = 'first , second ,, third, ,fourth   suffix';
console.log(str.replace(/\s*(?:(,)\s*)+|(\s)+/g, '$1$2'));
console.log(str.replace(/\s*(?:([,\s])\s*)+/g, '$1'));

It can even be adjusted for use in POSIX tools like sed:

sed -E 's/[[:space:]]*(([,[:space:]])[[:space:]]*)+/\2/g' file > outputfile

See the online demo.

Solution 2:^[2]

You can use:

[\h*([,\h])[,\h]*

See an online demo. Or alternatively:

\h*([,\h])(?1)*

See an online demo

\h* - 0+ (Greedy) horizontal-whitespace chars;
([,\h]) - A 1st capture group to match a comma or horizontal-whitespace;
[,\h]* - Option 1: 0+ (Greedy) comma's or horizontal-whitespace chars;
(?1)* - Option 2: Recurse the 1st subpattern 0+ (Greedy) times.

Replace with the 1st capture group:

$str='first , second ,, third, ,fourth   suffix';
echo preg_replace('~\h*([,\h])[,\h]*~', '$1', $str);
echo preg_replace('~\h*([,\h])(?1)*~', '$1', $str);

Both print:

first,second,third,fourth suffix

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2

'Clean up a comma-separated list by regex

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]