'data.table count observations close in distance and time of current observation
I am looking to calculate a new column "congestion" by counting the number of times values are within sec +/- 5 and within x +/- 5 and within y +/- 5 of the current row. Essentially I am wanting to find observations that occur within a close distance (x,y) and time period (sec) of the current observation which is just a big count ifelse statement. All values are numerical.
current data.table
data <- data.table(x = c(1,3,10,15,6),
y = c(5,5,11,14,19),
sec=c(1,3,5,6,9))
desired output
data <- data.table(x = c(1,3,10,15,6),
y = c(5,5,11,14,6),
sec=c(1,3,5,6,7),
congestion = c(1,2,1,1,2)
preferable solution in data.table but happy to work in dplyr.
Solution 1:[1]
I think your "desired output" is incorrect given the criteria you've specified.
However, if your data is small enough you can do a full-join on the data with itself, and filter out invalid combinations
library(data.table)
data <- data.table(x = c(1,3,10,15,6),
y = c(5,5,11,14,19),
sec=c(1,3,5,6,9))
data[, join_key := 1L ] ## specify a key on which to join
data[
data
, on = .(join_key) ## Full Join to put all possible combinations together
, allow.cartesian = TRUE
][
(x >= i.x * 5 * -1 & x <= i.x * 5) & ## Filter the valid combinations
(y >= i.y * 5 * -1 & y <= i.y * 5) &
(sec >= i.sec - 5 & sec <= i.sec + 5)
, .(
congestion = .N
)
, by = .(x, y, sec)
]
# x y sec congestion
# 1: 1 5 1 4
# 2: 3 5 3 4
# 3: 10 11 5 4
# 4: 15 14 6 4
# 5: 6 19 9 3
A slightly more efficient approach might be to do a by = .EACHI join (borrowing the concept from this answer
data[, row_idx := 1L]
data[
data
, {
idx = (x >= i.x * 5 * -1 & x <= i.x * 5) &
(y >= i.y * 5 * -1 & y <= i.y * 5) &
(sec >= i.sec - 5 & sec <= i.sec + 5)
.(
x = x[ idx ]
, y = y[ idx ]
, sec = sec[ idx ]
)
}
, on = .(row_idx)
, by = .EACHI
][
, .(congestion = .N)
, by = .(x, y, sec)
]
# x y sec congestion
# 1: 1 5 1 4
# 2: 3 5 3 4
# 3: 10 11 5 4
# 4: 15 14 6 4
# 5: 6 19 9 3
Solution 2:[2]
You could define the limits and join on them:
data[,`:=`(x_high = x +5,
x_low = x - 5,
y_high = y + 5,
y_low = y - 5,
sec_high = sec +5,
sec_low = sec - 5)]
data[data,.(x,y,sec,x.x,x.y,x.sec),
on=.(x>=x_low,
x<=x_high,
y>=y_low,
y<=y_high,
sec>=sec_low,
sec<=sec_high)][
!(x==x.x&y==x.y&sec==x.sec),.(congestion=.N),by=.(x,y,sec)]
x y sec congestion
<num> <num> <num> <int>
1: 1 5 1 1
2: 3 5 3 1
3: 10 11 5 1
4: 15 14 6 1
According to the +/- 5 rule, I find less congestions than your expected result. If I understood correctly the constraints, this seems correct to me.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
