'Hide String characters based on column in different table

I've written two queries below with results that look similar to this:

table : messages

message_body
Are you going to get the items from Walmart?
I am going to get the items from Target.
I picked up the items from target.
I am getting a computer from walmart.
I am going grocery shopping at Whole Foods.

table: companies

company_name
Walmart
Sprouts
Target
Whole Foods

I need to hide the names of the companies that show up in the companies table

My desired results are a query like this

message_body
I got the items from ***?
I am going to get the items from ***.
I picked up the items from ***.
I am getting a computer from ***.
I am going grocery shopping at ***.

Because the string values in messages can have punctuation I would need to remove all characters that are not alphanumeric to match properly with companies.

I am working in Snowsight, but I know MS server, so I can convert functions/procedures written in MS server to Snowflake. Unfortunately, I cannot do this in Python. Any ideas to get me going? Thanks



Solution 1:[1]

WITH message_body(msg) as (
    select * from values
        ('Are you going to get the items from Walmart?'),
        ('I am going to get the items from Target.'),
        ('I picked up the items from target.'),
        ('I am getting a computer from walmart.'),
        ('I am going grocery shopping at Whole Foods.')
), company_name(name) as (
    select * from values
        ('Walmart'),
        ('Sprouts'),
        ('Target'),
        ('Whole Foods')
)
select 
    regexp_replace(m.msg, f.name, '***', 1, 1, 'i')
from message_body as m
left join company_name as f
    on m.msg ilike '%'||f.name||'%'
qualify row_number() over (partition by m.msg order by length(f.name) desc) = 1;

gives:

REGEXP_REPLACE(M.MSG, F.NAME, '***', 1, 1, 'I')
Are you going to get the items from ***?
I am getting a computer from ***.
I am going grocery shopping at ***.
I am going to get the items from ***.
I picked up the items from ***.

Solution 2:[2]

To replace all occurrences, please try below:

WITH message_body(msg) as (
    select * from values
        ('Are you going to get the items from Walmart?'),
        ('I am going to get the items from Target.'),
        ('I picked up the items from target.'),
        ('I am getting a computer from walmart.'),
        ('I am going grocery shopping at Whole Foods.'),
        ('Walmart & Sprouts & Target')
), 
company_name(name) as (
    select * from values
        ('Walmart'),
        ('Sprouts'),
        ('Target'),
        ('Whole Foods')
),
company_name_list(name) as (
    select 
        listagg(name, '|')
    from 
         company_name
)
select 
    regexp_replace(m.msg, l.name, '***', 1, 0, 'i') as masked_name
from message_body as m,
company_name_list as l;

Solution 3:[3]

Based on the message that you are getting, the problem seems related to the project being owned by a gmail.com account (you can check the ownership in IAM & Admin >> IAM).

Internal projects would only work for the domain you are working on. I remember you do have an option to whitelist other domains to have access to the resources and run it, but I am not sure if you can use it to whitelist the gmail.com domain.

If you want to leave it as internal, I would suggest changing the ownership of the project to the Google Workspace account and test using the option to whitelist domains.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Simeon Pilgrim
Solution 2 Eric Lin
Solution 3 Fernando Lara