'Creating a file format to deal with a comma in a string

I'm trying to import a csv file into Snowflake. I created a file format as usual but I quickly found out that one of the values had a comma which causing an error.

For example,

1234, John, male, Stack,
1235, Kati, female, Overflow,
1236, Rudy, male, Stack,Overflow

Which results to this

| ID | Name | String1 | String 2| {NULL} |
|----------------------------------------|
|1234| John |   male  |  Stack  |        |
|1235| Kati |  female | Overflow|        |
|1236| Rudy |   male  |  Stack  |Overflow|

But I want is this:

| ID | Name | String1 |   String 2   |
|------------------------------------|
|1234| John |   male  |     Stack    |
|1235| Kati |  female |   Overflow   |
|1236| Rudy |   male  |Stack,Overflow|

I created a file format like this below:

CREATE OR REPLACE FILE FORMAT format_example 
    TYPE = CSV 
    COMPRESSION = AUTO 
    FIELD_DELIMITER = ',' 
    SKIP_HEADER = 1 
    DATE_FORMAT = "DD-MON-YYYY" 
    FIELD_OPTIONALLY_ENCLOSED_BY = '"' 
    EMPTY_FIELD_AS_NULL = TRUE

What should I do? Is there a way for me get the desired outcome? Or is there a way to wrap the strings in quotations in Snowflake?



Solution 1:[1]

Since this is a problem with a malformed CSV, if you can't change it just ingest the whole row and then split inside Snowflake:

with data as (
    select '1236, Rudy, male, Stack,Overflow' x
    union all select '1234, John, male, Stack,'
    union all select '1235, Kati, female, Overflow,'
)

select regexp_substr(x, '[^,]*')
    , regexp_substr(x, '[^,]+', 1, 2)
    , regexp_substr(x, '[^,]+', 1, 3)
    , regexp_substr(x, '[^,]*,[^,]*,[^,]*,(.*[^,])', 1,1,'',1)
from data

enter image description here

For more complex cases you could use a CSV parser in JS or Java.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Felipe Hoffa