'Best practice to find emails and phones related in a customers table

Let say I have the following customers table:

row email Phone group
1 [email protected] +135698887445 1
2 [email protected] +145698834549 2

I would like to group customers that have the same email or same phone (or both).
The tricky part is that with incremental loading once a new customer row is been added, earlier identities that grouped to different groupIds could now be joined to the same group like so:

row email Phone group
1 [email protected] +135698887445 1
2 [email protected] +145698834549 2--changed to group 1
3 [email protected] +145698834549 1

In the last example you can notice that the third row has email from the first row and cellphone from the second row so it links the above rows to the same group.

How should this be done in TSQL? And is it possible to keep the groupids numbers while changing early group number only when needed?



Solution 1:[1]

First, define that if multiple rows are actually of the same group, the smallest group will prevail. We then just need to loop updates until no group decrease is found:

declare @rc bigint=1 -- to go through first iteration check
declare @counter_safety_max=100
declare @counter=1

while(@rc>0 and @counter<@counter_safety_max)
begin
    set @counter+=1
    
    update current
        set group=lowest.group
    from
        customer current
        cross apply
        (   
            select top 1 group
            from customer cl
            where (cl.phone=current.phone or cl.email=current.email) and cl.group<current.group
            order by group asc
        ) as lowest
        
    set @rc=@@rowcount
    
end

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 George Menoutis