'Finding duplicate values in multiple colums in a SQL table and count for chars
referring to this question:
Finding duplicate values in multiple colums in a SQL table and count
I have the following table structure:
id name1 name2 name3 ...
1 Hans Peter Frank
2 Hans Frank Peter
3 Hans Peter Frank
4 Paul Peter Hans
.
.
.
I use the following command to display the duplicates and the counts:
SELECT COUNT(name1), name1, name2, name3
FROM table
GROUP BY name1, name2, name3
HAVING (COUNT(name1) > 1) AND (COUNT(name2) > 1) AND (COUNT(name3) > 1)
This command gives me a count of 2. I would like to know how the second line could also be counted as a dublicate.
Unfortunately, the solution to the original question (Finding duplicate values in multiple colums in a SQL table and count) does not work for char
Solution 1:[1]
First normalize the table with UNION ALL
in a CTE to get each of the 3 names in a separate row.
Then with ROW_NUMBER()
window function you can rank alphabetically the 3 names so that you can group by them:
WITH cte(id, name) AS (
SELECT id, name1 FROM tablename
UNION ALL
SELECT id, name2 FROM tablename
UNION ALL
SELECT id, name3 FROM tablename
)
SELECT COUNT(*) count, name1, name2, name3
FROM (
SELECT id,
MAX(CASE WHEN rn = 1 THEN name END) name1,
MAX(CASE WHEN rn = 2 THEN name END) name2,
MAX(CASE WHEN rn = 3 THEN name END) name3
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) rn
FROM cte
)
GROUP BY id
)
GROUP BY name1, name2, name3
HAVING COUNT(*) > 1;
Another way to do it, that uses similar logic to your previous question with numeric values, with string function REPLACE()
instead of window functions, but works only if the 3 names in each row are different:
SELECT COUNT(*) count,
MIN(name1, name2, name3) name_1,
REPLACE(
REPLACE(
REPLACE(name1 || ',' || name2 || ',' || name3, MIN(name1, name2, name3), ''),
MAX(name1, name2, name3), ''), ',', ''
) name_2,
MAX(name1, name2, name3) name_3
FROM tablename
GROUP BY name_1, name_2, name_3
HAVING COUNT(*) > 1;
See the demo.
Solution 2:[2]
Not particularly pretty but a different approach to pivot the columns and then aggregate them as a string and counting duplicates. Slightly fiddly as unfortunately in SQL Lite the group_concat function can't specify any ordering within the group, forcing another level of nesting and a row_number. I guess that's part of being SQL lite!
You could remove the Seq and filtering criteria if you wanted to show all the duplicate variations.
with cte as (
select Duplicates, name1, name2, name3,
Row_Number() over(partition by Duplicates order by name1,name2,name3) Seq
from (
select count(*) over(partition by allnames) Duplicates, name1, name2, name3
from t
left join (
select Id, group_concat(Dname) allNames
from (
select Id, Dname, row_number() over (partition by Id order by Dname) seq
from (
select id, name1 Dname from t union all
select id, name2 from t union all
select id, name3 from t
)x
)x
group by Id
order by seq
)d on d.id=t.id
)d
)
select Duplicates, name1, name2, name3
from cte
where duplicates>1 and seq=1
Solution 3:[3]
You just need to sort rows by all columns, group them, count the duplicates, and find the record with the most duplicates. It is rather a hassle to get this written in SQL, which uses a nested query containing window function. The SQL statement is long and difficult to read. An alternative is to export data out of database and handle it in Python or SPL. SPL, the open-source Java package, is easier to be integrated into a Java program and generate much simpler code. It finishes the task with only three lines of code:
A | |
---|---|
1 | =sqlite.query("select name1,name2,name3 from names") |
2 | =A1.group([#1,#2,#3].sort();~.len()).maxp(#2) |
3 | =create(count,name1,name2,name3).record([A2.#2,A2.#1(1),A2.#1(2),A2.#1(3)]) |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 |