'Trying to find duplicates in the same table with different load dates. and same records. Cannot figure out if query is returning duplicates or not
This is my first post and I apologize if I mess up any format. Please let me know. I'll get down to the problem.
'X' sends files to us that are concatenated to a file 'InHouse' that we create. The entries in 'InHouse' are uploaded to our databases to a table 'ABC'.
Then the number of records is sent to 'X'. Now, 'X' says that records from the 17th have also been uploaded on the 18th and 21st.
I have been entering different queries to see if I could weed out the duplicates if there are any. I'm going in with the assumption that there are duplicates and using queries or comparing excel files with data from different dates.
I ran select queries for 17th, 18th, and 21st individually including 'X's id in the where condition. After exporting the data to an excel file, I tried comparing the files using beyond compare but the files were being placed one after the other. The main issue was that I was not able to edit those files from beyond compare.
Since the table has no primary keys, I tried using a mixture of different fields like id, location code and amount, etc to find any recurring duplicates without mentioning the load date. Even though I'm returned with a set of records, not able to tell if the records are duplicates or not.
Below are the different queries that i used:
-
Select * from ABC where a in (select a, b,c,d,e from ABC group by a,b,c,d,e having count(*)>1) and a = 1234 and f in('031722','031822'); -
select distinct a,b,c,d,e from ABC where a=1234 and f in('031722','031822') group by a,b,c,d,e; -
select a,b,c,d,e,set_id from ABC s where 1 <( select count(*) from ABC i where i.a = s.a and i.b = s.b and i.c=s.c and i.d=s.d and i.e = s.e) and f in ('031722','031822') and a=1234;
Solution 1:[1]
If you want to find duplicates on the A, B, C, D and E columns when F is a DATE data type and is either 2022-03-17, 2022-03-18 or 2022-03-21 and A is 1234 then you can use the COUNT analytic function:
SELECT *
FROM (
SELECT a, b, c, d, e, f,
COUNT(*) OVER (PARTITION BY a, b, c, d, e) AS num_duplicates
FROM ABC
WHERE f IN (DATE '2022-03-17', DATE '2022-03-18', DATE '2022-03-21')
AND a = 1234
)
WHERE num_duplicates > 1
In Oracle, a DATE data type has both date and time and if your date value have a non-midnight time component then you will want to filter on a date range:
SELECT *
FROM (
SELECT a, b, c, d, e, f,
COUNT(*) OVER (PARTITION BY a, b, c, d, e) AS num_duplicates
FROM ABC
WHERE ( (f >= DATE '2022-03-17' AND f < DATE '2022-03-19')
OR (f >= DATE '2022-03-21' AND f < DATE '2022-03-22'))
AND a = 1234
)
WHERE num_duplicates > 1
As for your queries:
Should not compile as you are checking whether
WHERE a IN (SELECT a, b, c, d, e ...)which is comparing a single value to a set of 5 values. If you fix it toWHERE (a, b, c, d, e) IN (SELECT a, b, c, d, e ...then it may work but you do not filter the sub-query to be within your range of dates so you would also count duplicates that are outside your range and you are likely to get false-positives.It also has the issue that
'031822'is a string literal and not a date literal so Oracle will have to perform an implicit string-to-date conversion (assuming you are following best practice and actually storing dates asDATEdata types).Will get all matching values; it does not check for duplicates.
Also, using
DISTINCTandGROUP BYin the same query is not necessary as they have similar effects so one, or the other, is redundant.Has similar problems to #1 in that you do not restrict the sub-query to a date range so you will get false positives and you are using string literals for date values.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
