'I want to de-dupe records in BigQuery with max column value on specific column with expression
company | email | phone | website | address
Amar CO LLC | [email protected] | 123 | NULL | India
Amar CO | [email protected] | NULL | NULL | IND
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom
Stacks CO LLC | [email protected] | NULL | NULL | UK
I want to drop the company name with CO LLC instead want to keep Amar CO but want all the columns from Amar CO LLC as it has minimum NULL values or maximum column data.
In short: De-dupe the records, remove the company name with 'ending with or matching with LLC' (case insensitive), but keep the values from both of the record which has maximum Information column.
Expected output
Amar CO | [email protected] | 123 | NULL | India
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom
Solution 1:[1]
You need group by and replace as follows:
select replace(company,' LLC','') as company, max(email) as email, max(phone) as phone,
max(website) as website, max(address) as address
from your_table t
group by replace(company,' LLC','')
I can see that you need all data of the both rows but precedence should be given to LLC record (India, IND --> India) then you can use it as follows:
select t.company,
coalesce(tt.email,t.emial) as email,
coalesce(tt.phone,t.phone) as phone
coalesce(tt.website,t.website) as website,
coalesce(tt.address,t.address) as address
from your_table t join your_table tt
on concat(t.company,' LLC') = tt.company
If you want to update the data and then drop the record itself, I would suggest the following delete and update.
delete from your_table where t.company = 'Amar CO';
update your_table t
set t.comapny = replace(company,' LLC','') -- or use 'Amar CO'
where t.company = 'Amar CO LLC';
-- Update
You want to give precedence to the record having minimum null values then you can use the following query:
select t.company,
case when tt_nulls > t_nulls then ttemail else temail end as email,
case when tt_nulls > t_nulls then ttphone else tphone end as phone,
case when tt_nulls > t_nulls then ttwebsite else twebsite end as website,
case when tt_nulls > t_nulls then taddress else taddress end as address
from
(select t.company,
count(case when t.email IS NULL THEN 1 end) over (partition by t.company)
+ count(case when t.phone IS NULL THEN 1 end) over (partition by t.company)
+ count(case when t.website IS NULL THEN 1 end) over (partition by t.company)
+ count(case when t.address IS NULL THEN 1 end) over (partition by t.company)
as t_nulls,
count(case when tt.email IS NULL THEN 1 end) over (partition by t.company)
+ count(case when tt.phone IS NULL THEN 1 end) over (partition by t.company)
+ count(case when tt.website IS NULL THEN 1 end) over (partition by t.company)
+ count(case when tt.address IS NULL THEN 1 end) over (partition by t.company)
as tt_nulls
t.email as temail,
t.phone as tphone,
t.website as twebsite,
t.address as taddress,
tt.email as ttemail,
tt.phone as ttphone,
tt.website as ttwebsite,
tt.address as ttaddress
from your_table t join your_table tt
on concat(t.company,' LLC') = tt.company) t
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
