'row_number is not unique for duplicate records
I am trying to find the latest update of a particular row from a bunch of rows per uuid.
For that we use row_number() over a partition as shown below,
"row_number"() OVER (
PARTITION BY "uuid"
ORDER BY "uuid" ASC,
"status"."sequence" DESC,
"modifiedon" DESC
) "row_id_ranked",
After this, anything other than 'row_id_ranked = 1' are discarded. But when a retry/re-transmission happens, "status"."sequence" and "modifiedon" will be exactly same. This creates mutiple rows with row_id_ranked=1. Making it difficult to remove them.
As per my understanding per the documentation, the row_number should be a unique value. But in case of these duplicates, it clearly is not.
row_number() → bigint#
Returns a unique, sequential number for each row, starting with one, according to the ordering of rows within the window partition.
How do I de-duplicate in this case?
Solution 1:[1]
Assuming that there does not exist some third column beyond the status sequence and modified on date to break a potential tie, and also assuming you don't care which single record gets retained, you may use RANDOM() here:
ROW_NUMBER() OVER (PARTITION BY uuid
ORDER BY "status"."sequence" DESC, modifiedon DESC, RANDOM())
AS row_id_ranked
Solution 2:[2]
If you use "Partition by uuid" in the "over" clause of row number you will get a row 1 for each distinct uuid. That is you have partitioned your data by uuid and the numbering will restart for each partition, as in the definition you have posted.
I think that you just want to use order by, which will return 1 row number 1.
"row_number"() OVER (
ORDER BY "uuid" ASC,
"status"."sequence" DESC,
"modifiedon" DESC
) "row_id_ranked",
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tim Biegeleisen |
| Solution 2 |
