'Replace Comma Outside Double Quote - Python - Regex

I want to open a CSV file, using open(). I read it per line. For some reason, I'm not using Pandas.

I want to replace comma , with _XXX_, but I want to avoid replacing commas inside double quotes " because that comma is not a separation tag, so I can't use:

string_ = string_.replace(',', '_XXX_')

How to do this? User regex maybe?

I've found replace comma inside quotation, Python regex: find and replace commas between quotation marks, but i need replace comma OUTSIDE quotation.



Solution 1:[1]

You may use a re.sub with a simple "[^"]*" regex (or (?s)"[^"\\]*(?:\\.[^"\\]*)*" if you need to handle escaped sequences in between double quotes, too) to match strings between double quotes, capture this pattern into Group 1, and then match a comma in all other contexts. Then, pass the match object to a callable used as the replacement argument where you may further manipulate the match.

import re
print( re.sub(r'("[^"]*")|,', 
    lambda x: x.group(1) if x.group(1) else x.group().replace(",", ""),
    '1,2,"test,3,7","4, 5,6, ... "') )
    # => 12"test,3,7""4, 5,6, ... "

print( re.sub(r'(?s)("[^"\\]*(?:\\.[^"\\]*)*")|,', 
    lambda x: x.group(1) if x.group(1) else x.group().replace(",", ""),
    r'1,2,"test, \"a,b,c\" ,03","4, 5,6, ... "') )
    # => 12"test, \"a,b,c\" ,03""4, 5,6, ... "

See the Python demo.

Regex details

  • ("[^"]*")|,:
    • ("[^"]*") - Capturing group 1: a ", then any 0 or more chars other than " and then a "
    • | - or
    • , - a comma

The other one is

  • (?s) - the inline version of a re.S / re.DOTALL flag
  • ("[^"\\]*(?:\\.[^"\\]*)*") - Group 1: a ", then any 0 or more chars other than " and \ then 0 or more sequences of a \ and any one char followed with 0 or more chars other than " and \ and then a "
  • | - or
  • , - comma.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew