'I want to de-dupe records in BigQuery with max column value on specific column with expression

company  | email | phone | website | address
Amar CO LLC | [email protected] | 123 | NULL | India
Amar CO | [email protected] | NULL | NULL | IND
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom
Stacks CO LLC | [email protected] | NULL | NULL | UK

I want to drop the company name with CO LLC instead want to keep Amar CO but want all the columns from Amar CO LLC as it has minimum NULL values or maximum column data.

In short: De-dupe the records, remove the company name with 'ending with or matching with LLC' (case insensitive), but keep the values from both of the record which has maximum Information column.

Expected output

Amar CO | [email protected] | 123 | NULL | India
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom


Solution 1:[1]

You need group by and replace as follows:

select replace(company,' LLC','') as company, max(email) as email, max(phone) as phone,
       max(website) as website, max(address) as address
  from your_table t
group by replace(company,' LLC','')

I can see that you need all data of the both rows but precedence should be given to LLC record (India, IND --> India) then you can use it as follows:

select t.company, 
       coalesce(tt.email,t.emial) as email, 
       coalesce(tt.phone,t.phone) as phone
       coalesce(tt.website,t.website) as website,
       coalesce(tt.address,t.address) as address
  from your_table t join your_table tt 
    on concat(t.company,' LLC') = tt.company

If you want to update the data and then drop the record itself, I would suggest the following delete and update.

delete from your_table where t.company = 'Amar CO';

update your_table t
set t.comapny = replace(company,' LLC','') -- or use 'Amar CO'
where t.company = 'Amar CO LLC';

-- Update

You want to give precedence to the record having minimum null values then you can use the following query:

select t.company,
       case when tt_nulls > t_nulls then ttemail else temail end as email,
       case when tt_nulls > t_nulls then ttphone else tphone end as phone,
       case when tt_nulls > t_nulls then ttwebsite else twebsite end as website,
       case when tt_nulls > t_nulls then taddress else taddress end as address
from    
(select t.company, 
        count(case when t.email IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when t.phone IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when t.website IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when t.address IS NULL THEN 1 end) over (partition by t.company)  
        as t_nulls,
        count(case when tt.email IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when tt.phone IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when tt.website IS NULL THEN 1 end) over (partition by t.company) 
        + count(case when tt.address IS NULL THEN 1 end) over (partition by t.company)  
        as tt_nulls
        t.email as temail, 
        t.phone as tphone,
        t.website as twebsite,
        t.address as taddress,
        tt.email as ttemail, 
        tt.phone as ttphone,
        tt.website as ttwebsite,
        tt.address as ttaddress
   from your_table t join your_table tt 
     on concat(t.company,' LLC') = tt.company) t

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1