'Fastest way to find nearest value in vector

I have two integer/posixct vectors:

a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements

Now my resulting vector c should contain for each element of vector a the nearest element of b:

c <- c(4,4,4,4,4,6,6,...)

I tried it with apply and which.min(abs(a - b)) but it's very very slow.

Is there any more clever way to solve this? Is there a data.table solution?

r


Solution 1:[1]

library(data.table)

a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))

a[,merge:=Value]

b=data.table(Value=c(4,6,10,16))

b[,merge:=Value]

setkeyv(a,c('merge'))

setkeyv(b,c('merge'))

Merge_a_b=a[b,roll='nearest']

In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.

Solution 2:[2]

As it is presented in this link you can do either:

which(abs(x - your.number) == min(abs(x - your.number)))

or

which.min(abs(x - your.number))

where x is your vector and your.number is the value. If you have a matrix of dataframe, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.

For example:

x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))

would output:

[1] 21 22

Solution 3:[3]

Not quite sure how it will behave with your volume but cut is quite fast.

The idea is to cut your vector a at the midpoints between the elements of b.

Note that I am assuming the elements in b are strictly increasing!

Something like this:

a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements

cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)

cut(a, breaks=cuts, labels=b)
# [1] 4  4  4  4  4  6  6  6  10 10 10 10 10 16 16
# Levels: 4 6 10 16

This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).

findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4

So of course you can do something like:

index = findInterval(a, cuts)
b[index]
# [1]  4  4  4  4  6  6  6 10 10 10 10 10 16 16 16

Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.

Solution 4:[4]

For those who would be satisfied with the slow solution:

sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)

Solution 5:[5]

Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)

To get around this we can lapply over your a list, and find the closest.

library(DescTools)

lapply(a, function(i) Closest(x = b, a = i))

You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).

To get around this, put either min or max around the result:

lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))

Then unlist the result to get a plain vector :)

Solution 6:[6]

Here might be a simple base R option, using max.col + outer:

b[max.col(-abs(outer(a,b,"-")))]

which gives

> b[max.col(-abs(outer(a,b,"-")))]
 [1]  4  4  4  4  6  6  6 10 10 10 10 10 16 16 16

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4 Katarzyna Paczkowska
Solution 5 morgan121
Solution 6 ThomasIsCoding