'Compare whether incident is new or already exists
There is an algorithm that identifies issues/incidents in a network. After that it writes all the cases into a DB. Lets say it looks like (simplified):
| ID | Date | Case |
|---|---|---|
| A1 | 2022-01-01 | 1 |
| B1 | 2022-01-01 | 2 |
| C1 | 2022-01-01 | 3 |
| A1 | 2022-01-02 | NA |
| C1 | 2022-01-02 | NA |
| A1 | 2022-01-03 | NA |
| B1 | 2022-01-03 | NA |
| C1 | 2022-01-03 | NA |
Each row represents an incident.
Now I want to identify if an incident already existed the last time we ran this script. For this it should check the actual Date and compare it with the last existing date in the table.
Note: it can occur that the last day was not yesterday, it might be that there is a difference up to 7 days.
So the logical is:
- compare the second highest
datevalue in thisIDgroup with the second highestDatefrom fulldf - If it's the same it means, case already exists. Then take it's last
Casenumber. If it's new, create a newCasenumber (max(Case) + 1)
Update 11.05.2022 - 17:02:
- It should consider existing
Casevalues and not overwrite them. Or with other words, it should overwrite/fill theNAs. There will never beNAs in between. The existing cases always have a number, the new ones doesn't. 100%.
Expected result:
| ID | Date | Case | Comment |
|---|---|---|---|
| A1 | 2022-01-01 | 1 | |
| B1 | 2022-01-01 | 2 | |
| C1 | 2022-01-01 | 3 | |
| A1 | 2022-01-02 | 1 | |
| C1 | 2022-01-02 | 3 | |
| A1 | 2022-01-03 | 1 | |
| B1 | 2022-01-03 | 4 | New case, as there wasn't B1 on 2022-01-02 |
| C1 | 2022-01-03 | 3 |
I was able to identify the second highest Dates:
> df[, nth(unique(Date),length(unique(Date))-1), ID]
ID V1
1: A1 2022-01-02 ## TRUE, as it's the second highest Date
2: B1 2022-01-01 ## FALSE, as it's not the second highest Date
3: C1 2022-01-02 ## TRUE, as it's the second highest Date
> df[, nth(unique(Date),length(unique(Date))-1)]
[1] "2022-01-02" ## Second highest Date in df
But now I'm struggling with creating a new column with this condition. Can somebody please help? data.table solution prefered, but dplyr is great as well.
MWE
library(data.table)
df = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case = NA)
Goal = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case=c(1,2,3,1,3,1,4,3))
Solution 1:[1]
How about this:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),case:=rleid(ID,d)][
,d:=NULL]
Output:
ID Date case
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4
8: C1 2022-01-03 3
If you actually want the comment column, you can refine the above, like this:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),`:=`(
case=rleid(ID,d),
comment=fifelse(d!=1,paste0("New case, as there was no ", ID, " on ",Date-1),""))][
,d:=NULL][]
Output:
ID Date case comment
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4 New case, as there was no B1 on 2022-01-02
8: C1 2022-01-03 3
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | langtang |
