Round to nearest arbitrary number from list - r

I'm basically looking for a way to do a variation of this Ruby script in R.
I have an arbitrary list of numbers (steps of a moderator for a regression plot in this case) which have unequal distances from each other, and I'd like to round values which are within a range around these numbers to the nearest number in the list.
The ranges don't overlap.
arbitrary.numbers <- c(4,10,15) / 10
numbers <- c(16:1 / 10, 0.39, 1.45)
range <- 0.1
Expected output:
numbers
## 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.39 1.45
round_to_nearest_neighbour_in_range(numbers,arbitrary.numbers,range)
## 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
I've got a little helper function that might do for my specific problem, but it's not very flexible and it contains a loop. I can post it here, but I think a real solution would look completely different.
The different answers timed for speed (on a million numbers)
> numbers = rep(numbers,length.out = 1000000)
> system.time({ mvg.round(numbers,arbitrary.numbers,range) })[3]
elapsed
0.067
> system.time({ rinker.loop.round(numbers,arbitrary.numbers,range) })[3]
elapsed
0.289
> system.time({ rinker.round(numbers,arbitrary.numbers,range) })[3]
elapsed
1.403
> system.time({ nograpes.round(numbers,arbitrary.numbers,range) })[3]
elapsed
1.971
> system.time({ january.round(numbers,arbitrary.numbers,range) })[3]
elapsed
16.12
> system.time({ shariff.round(numbers,arbitrary.numbers,range) })[3]
elapsed
15.833
> system.time({ mplourde.round(numbers,arbitrary.numbers,range) })[3]
elapsed
9.613
> system.time({ kohske.round(numbers,arbitrary.numbers,range) })[3]
elapsed
26.274
MvG's function is the fastest, about 5 times faster than Tyler Rinker's second function.

A vectorized solution, without any apply family functions or loops:
The key is findInterval, which finds the "space" in arbitrary.numbers where each element in numbers is "between". So, findInterval(6,c(2,4,7,8)) returns 2, because 6 is between the 2nd and 3rd index of c(2,4,7,8).
# arbitrary.numbers is assumed to be sorted.
# find the index of the number just below each number, and just above.
# So for 6 in c(2,4,7,8) we would find 2 and 3.
low<-findInterval(numbers,arbitrary.numbers) # find index of number just below
high<-low+1 # find the corresponding index just above.
# Find the actual absolute difference between the arbitrary number above and below.
# So for 6 in c(2,4,7,8) we would find 2 and 1.
# (The absolute differences to 4 and 7).
low.diff<-numbers-arbitrary.numbers[ifelse(low==0,NA,low)]
high.diff<-arbitrary.numbers[ifelse(high==0,NA,high)]-numbers
# Find the minimum difference.
# In the example we would find that 6 is closest to 7,
# because the difference is 1.
mins<-pmin(low.diff,high.diff,na.rm=T)
# For each number, pick the arbitrary number with the minimum difference.
# So for 6 pick out 7.
pick<-ifelse(!is.na(low.diff) & mins==low.diff,low,high)
# Compare the actual minimum difference to the range.
ifelse(mins<=range+.Machine$double.eps,arbitrary.numbers[pick],numbers)
# [1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5

Yet another solution using findInterval:
arbitrary.numbers<-sort(arbitrary.numbers) # need them sorted
range <- range*1.000001 # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers - range) # index of nearest
nearest <- c(-Inf, arbitrary.numbers)[nearest + 1] # value of nearest
diff <- numbers - nearest # compute errors
snap <- diff <= range # only snap near numbers
numbers[snap] <- nearest[snap] # snap values to nearest
print(numbers)
The nearest in the above code is not really mathematically the nearest number. Instead, it is the largest arbitrary number such that nearest[i] - range <= numbers[i], or equivalently nearest[i] <= numbers[i] + range. So in one go we find the largest arbitrary number which is either in the snapping range for a given input number, or still too small for that. For this reason, we only need to check one way for snap. No absolute value required, and even the squaring from a previous revision of this post was unneccessary.
Thanks to Interval search on a data frame for the pointer at findInterval, as I found it there before recognizing it in the answer by nograpes.
If, in contrast to your original question, you had overlapping ranges, you could write things like this:
arbitrary.numbers<-sort(arbitrary.numbers) # need them sorted
range <- range*1.000001 # avoid rounding issues
nearest <- findInterval(numbers, arbitrary.numbers) + 1 # index of interval
hi <- c(arbitrary.numbers, Inf)[nearest] # next larger
nearest <- c(-Inf, arbitrary.numbers)[nearest] # next smaller
takehi <- (hi - numbers) < (numbers - nearest) # larger better than smaller
nearest[takehi] <- hi[takehi] # now nearest is really nearest
snap <- abs(nearest - numbers) <= range # only snap near numbers
numbers[snap] <- nearest[snap] # snap values to nearest
print(numbers)
In this code, nearestreally ends up being the nearest number. This is achieved by considering both endpoints of every interval. In spirit, this is very much like the version by nograpes, but it avoids using ifelse and NA, which should benefit performance as it reduces the number of branching instructions.

Is this what you want?
> idx <- abs(outer(arbitrary.numbers, numbers, `-`)) <= (range+.Machine$double.eps)
> rounded <- arbitrary.numbers[apply(rbind(idx, colSums(idx) == 0), 2, which)]
> ifelse(is.na(rounded), numbers, rounded)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5

Please note that due to rounding errors (most likely), I use range = 0.1000001 to achieve the expected effect.
range <- range + 0.0000001
blah <- rbind( numbers, sapply( numbers, function( x ) abs( x - arbitrary.numbers ) ) )
ff <- function( y ) { if( min( y[-1] ) <= range + 0.000001 ) arbitrary.numbers[ which.min( y[ -1 ] ) ] else y[1] }
apply( blah, 2, ff )

This is still shorter:
sapply(numbers, function(x) ifelse(min(abs(arbitrary.numbers - x)) >
range + .Machine$double.eps, x, arbitrary.numbers[which.min
(abs(arbitrary.numbers - x))] ))
Thanks #MvG

Another option:
arb.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
lapply(1:length(arbitrary.numbers), function(i){
numbers <<- arrnd(numbers, arbitrary.numbers[i], range)
}
)
numbers
}
arb.round(numbers, arbitrary.numbers, range)
Yields:
> arb.round(numbers, arbitrary.numbers, range)
[1] 1.5 1.5 1.5 1.3 1.2 1.0 1.0 1.0 0.8 0.7 0.6 0.4 0.4 0.4 0.2 0.1 0.4 1.5
EDIT: I removed the return call at the end of the function as it's not necessary adn can burn time.
EDIT: I think a loop will be even faster here:
loop.round <- function(numbers, arbitrary.numbers, range) {
arrnd <- function(x, ns, r){
ifelse(abs(x - ns) <= range +.00000001, ns, x)
}
for(i in seq_along(arbitrary.numbers)){
numbers <- arrnd(numbers, arbitrary.numbers[i], range)
}
numbers
}

Related

Entropy calculation gives NaN - is applying na.omit a valid tweak?

By definition, the entropy is defined as:
entropy <- function (p) sum(-p * log(p))
I'm performing LCA using the poLCA package and trying to calculate entropy, which for some of my models are outputting NaN.
error_prior <- entropy(lca2$P) # Class proportions model 2
error_post <- mean(apply(lca2$posterior, 1, entropy), na.rm = TRUE)
results[2,8] <- round(((error_prior - error_post) / error_prior), 3)
From the answer to this question: Entropy output is NaN for some class solutions and not others, I learnt that it is caused by zeros in p and it can be resolved by adding na.omit to the function as follows:
entropy <- function (p) sum(na.omit(-p * log(p)))
My question is - is this technical tweak mathematically valid without affecting the integrity of the calculation?
In my case, around 1/3 of the values in p are zeros. I'm really unsure if I should use na.omit or find another way to resolve this problem.
It is valid, but not transparent at first glance. The reason is that the mathematical limit of xlog(x) as x -> 0 is 0 (we can prove this using L'Hospital Rule). In this regard, the most robust definition of the function should be
entropy.safe <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- numeric(length(p))
safe <- p != 0
log.p[safe] <- log(p[safe])
sum(-p * log.p)
}
But simply dropping p = 0 cases gives identical results, because the result at p = 0 is 0 and contributes nothing to the sum anyway.
entropy.brutal <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- log(p)
## as same as sum(na.omit(-p * log.p))
sum(-p * log.p, na.rm = TRUE)
}
## p has a single 0
( p <- seq(0, 1, by = 0.1) )
#[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 2.455935
entropy.safe(p)
#[1] 2.455935
## half of p are zeros
p[1:5] <- 0
p
#[1] 0.0 0.0 0.0 0.0 0.0 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 1.176081
entropy.safe(p)
#[1] 1.176081
In conclusion, we can use either entropy.brutal or entropy.safe.

How to create random number in ifelse function?

I would like to implement random numbers for the time values equal to 0 (time == 0) and keep other time values as given.
set.seed(123)
df$time.new <- ifelse(df$time == 0, sample(0.2:0.8, replace=F), df$time)
Using the formula only 0.2 is created.
I will fill the blanks in the comment that answered the question. This is your code:
set.seed(123)
df$time.new <- ifelse(df$time == 0, sample(0.2:0.8, replace=F), df$time)
The key to understand why you are getting always 0.2 is to run:
0.2:0.8
This just yields: [1] 0.2 and that's the reason you are always getting 0.2 The seq() command lets you make sequences that have more elements by specifying shorter increments:
> seq(0.2, 0.8, by = 0.1)
[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8
If I remember correctly the default increment for a:b is one unit. Let's check a toy example:
> a <- 1; b <- 7
> a:b
[1] 1 2 3 4 5 6 7
If we do this with a <- 0.2 and b <- 0.8 the resulting vector would consist of just the value 0.2 hence, your code just detects such value.

How does R (v3.6.1) determine rounding with odd floor values? [duplicate]

Yes I know why we always round to the nearest even number if we are in the exact middle (i.e. 2.5 becomes 2) of two numbers. But when I want to evaluate data for some people they don't want this behaviour. What is the simplest method to get this:
x <- seq(0.5,9.5,by=1)
round(x)
to be 1,2,3,...,10 and not 0,2,2,4,4,...,10.
Edit: To clearify: 1.4999 should be 1 after rounding. (I thought this would be obvious)
This is not my own function, and unfortunately, I can't find where I got it at the moment (originally found as an anonymous comment at the Statistically Significant blog), but it should help with what you need.
round2 = function(x, digits) {
posneg = sign(x)
z = abs(x)*10^digits
z = z + 0.5 + sqrt(.Machine$double.eps)
z = trunc(z)
z = z/10^digits
z*posneg
}
x is the object you want to round, and digits is the number of digits you are rounding to.
An Example
x = c(1.85, 1.54, 1.65, 1.85, 1.84)
round(x, 1)
# [1] 1.8 1.5 1.6 1.8 1.8
round2(x, 1)
# [1] 1.9 1.5 1.7 1.9 1.8
(Thanks #Gregor for the addition of + sqrt(.Machine$double.eps).)
If you want something that behaves exactly like round except for those xxx.5 values, try this:
x <- seq(0, 1, 0.1)
x
# [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
floor(0.5 + x)
# [1] 0 0 0 0 0 1 1 1 1 1 1
As #CarlWitthoft said in the comments, this is the IEC 60559 standard as mentioned in ?round:
Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
An additional explanation by Greg Snow:
The logic behind the round to even rule is that we are trying to
represent an underlying continuous value and if x comes from a truly
continuous distribution, then the probability that x==2.5 is 0 and the
2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999..., if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values
between 2.45 and 2.50 will all round to 3 (having been rounded first
to 2.5). This will tend to bias estimates upwards. To remove the
bias we need to either go back to before the rounding to 2.5 (which is
often impossible to impractical), or just round up half the time and
round down half the time (or better would be to round proportional to
how likely we are to see values below or above 2.5 rounded to 2.5, but
that will be close to 50/50 for most underlying distributions). The
stochastic approach would be to have the round function randomly
choose which way to round, but deterministic types are not
comforatable with that, so "round to even" was chosen (round to odd
should work about the same) as a consistent rule that rounds up and
down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact
value (money for example), then you may do better by multiplying all
values by 10 or 100 and working in integers, then converting back only
for the final printing. Note that 2.50000001 rounds to 3, so if you
keep more digits of accuracy until the final printing, then rounding
will go in the expected direction, or you can add 0.000000001 (or
other small number) to your values just before rounding, but that can
bias your estimates upwards.
This appears to work:
rnd <- function(x) trunc(x+sign(x)*0.5)
Ananda Mahto's response seems to do this and more - I am not sure what the extra code in his response is accounting for; or, in other words, I can't figure out how to break the rnd() function defined above.
Example:
seq(-2, 2, by=0.5)
# [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
round(x)
# [1] -2 -2 -1 0 0 0 1 2 2
rnd(x)
# [1] -2 -2 -1 -1 0 1 1 2 2
Depending on how comfortable you are with jiggling your data, this works:
round(x+10*.Machine$double.eps)
# [1] 1 2 3 4 5 6 7 8 9 10
This method:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
does not seem to work well when we have numbers with many digits. E.g. doing round2(2436.845, 2) will give us 2436.84. The issue seems to occur with the trunc(z) function.
Overall, I think it has something to do with the way R stores numbers and thus the trunc and float function doesn't always work. I was able to get around it in not the most elegant way:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(as.numeric(as.character(z)))
z = z/10^n
(z)*posneg
}
This mimics the rounding away from zero at .5:
round_2 <- function(x, digits = 0) {
x = x + abs(x) * sign(x) * .Machine$double.eps
round(x, digits = digits)
}
round_2(.5 + -2:4)
-2 -1 1 2 3 4 5

R: What is wrong with rounding? [duplicate]

Yes I know why we always round to the nearest even number if we are in the exact middle (i.e. 2.5 becomes 2) of two numbers. But when I want to evaluate data for some people they don't want this behaviour. What is the simplest method to get this:
x <- seq(0.5,9.5,by=1)
round(x)
to be 1,2,3,...,10 and not 0,2,2,4,4,...,10.
Edit: To clearify: 1.4999 should be 1 after rounding. (I thought this would be obvious)
This is not my own function, and unfortunately, I can't find where I got it at the moment (originally found as an anonymous comment at the Statistically Significant blog), but it should help with what you need.
round2 = function(x, digits) {
posneg = sign(x)
z = abs(x)*10^digits
z = z + 0.5 + sqrt(.Machine$double.eps)
z = trunc(z)
z = z/10^digits
z*posneg
}
x is the object you want to round, and digits is the number of digits you are rounding to.
An Example
x = c(1.85, 1.54, 1.65, 1.85, 1.84)
round(x, 1)
# [1] 1.8 1.5 1.6 1.8 1.8
round2(x, 1)
# [1] 1.9 1.5 1.7 1.9 1.8
(Thanks #Gregor for the addition of + sqrt(.Machine$double.eps).)
If you want something that behaves exactly like round except for those xxx.5 values, try this:
x <- seq(0, 1, 0.1)
x
# [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
floor(0.5 + x)
# [1] 0 0 0 0 0 1 1 1 1 1 1
As #CarlWitthoft said in the comments, this is the IEC 60559 standard as mentioned in ?round:
Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
An additional explanation by Greg Snow:
The logic behind the round to even rule is that we are trying to
represent an underlying continuous value and if x comes from a truly
continuous distribution, then the probability that x==2.5 is 0 and the
2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999..., if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values
between 2.45 and 2.50 will all round to 3 (having been rounded first
to 2.5). This will tend to bias estimates upwards. To remove the
bias we need to either go back to before the rounding to 2.5 (which is
often impossible to impractical), or just round up half the time and
round down half the time (or better would be to round proportional to
how likely we are to see values below or above 2.5 rounded to 2.5, but
that will be close to 50/50 for most underlying distributions). The
stochastic approach would be to have the round function randomly
choose which way to round, but deterministic types are not
comforatable with that, so "round to even" was chosen (round to odd
should work about the same) as a consistent rule that rounds up and
down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact
value (money for example), then you may do better by multiplying all
values by 10 or 100 and working in integers, then converting back only
for the final printing. Note that 2.50000001 rounds to 3, so if you
keep more digits of accuracy until the final printing, then rounding
will go in the expected direction, or you can add 0.000000001 (or
other small number) to your values just before rounding, but that can
bias your estimates upwards.
This appears to work:
rnd <- function(x) trunc(x+sign(x)*0.5)
Ananda Mahto's response seems to do this and more - I am not sure what the extra code in his response is accounting for; or, in other words, I can't figure out how to break the rnd() function defined above.
Example:
seq(-2, 2, by=0.5)
# [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
round(x)
# [1] -2 -2 -1 0 0 0 1 2 2
rnd(x)
# [1] -2 -2 -1 -1 0 1 1 2 2
Depending on how comfortable you are with jiggling your data, this works:
round(x+10*.Machine$double.eps)
# [1] 1 2 3 4 5 6 7 8 9 10
This method:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
does not seem to work well when we have numbers with many digits. E.g. doing round2(2436.845, 2) will give us 2436.84. The issue seems to occur with the trunc(z) function.
Overall, I think it has something to do with the way R stores numbers and thus the trunc and float function doesn't always work. I was able to get around it in not the most elegant way:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(as.numeric(as.character(z)))
z = z/10^n
(z)*posneg
}
This mimics the rounding away from zero at .5:
round_2 <- function(x, digits = 0) {
x = x + abs(x) * sign(x) * .Machine$double.eps
round(x, digits = digits)
}
round_2(.5 + -2:4)
-2 -1 1 2 3 4 5

Computing sparse pairwise distance matrix in R

I have a NxM matrix and I want to compute the NxN matrix of Euclidean distances between the M points. In my problem, N is about 100,000. As I plan to use this matrix for a k-nearest neighbor algorithm, I only need to keep the k smallest distances, so the resulting NxN matrix is very sparse. This is in contrast to what comes out of dist(), for example, which would result in a dense matrix (and probably storage problems for my size N).
The packages for kNN that I've found so far (knnflex, kknn, etc) all appear to use dense matrices. Also, the Matrix package does not offer a pairwise distance function.
Closer to my goal, I see that the spam package has a nearest.dist() function that allows one to only consider distances less than some threshold, delta. In my case, however, a particular value of delta may produce too many distances (so that I have to store the NxN matrix densely) or too few distances (so that I can't use kNN).
I have seen previous discussion on trying to perform k-means clustering using the bigmemory/biganalytics packages, but it doesn't seem like I can leverage these methods in this case.
Does anybody know a function/implementation that will compute a distance matrix in a sparse fashion in R? My (dreaded) backup plan is to have two for loops and save results in a Matrix object.
Well, we can't have you resorting to for-loops, now can we :)
There is of course the question of how to represent the sparse matrix. A simple way is to have it only contain the indices of the points that are closest (and recalculate as needed). But in the solution below, I put both distance ('d1' etc) and index ('i1' etc) in a single matrix:
sparseDist <- function(m, k) {
m <- t(m)
n <- ncol(m)
d <- vapply( seq_len(n-1L), function(i) {
d<-colSums((m[, seq(i+1L, n), drop=FALSE]-m[,i])^2)
o<-sort.list(d, na.last=NA, method='quick')[seq_len(k)]
c(sqrt(d[o]), o+i)
}, numeric(2*k)
)
dimnames(d) <- list(c(paste('d', seq_len(k), sep=''),
paste('i', seq_len(k), sep='')), colnames(m)[-n])
d
}
Trying it out on 9 2d-points:
> m <- matrix(c(0,0, 1.1,0, 2,0, 0,1.2, 1.1,1.2, 2,1.2, 0,2, 1.1,2, 2,2),
9, byrow=TRUE, dimnames=list(letters[1:9], letters[24:25]))
> print(dist(m), digits=2)
a b c d e f g h
b 1.1
c 2.0 0.9
d 1.2 1.6 2.3
e 1.6 1.2 1.5 1.1
f 2.3 1.5 1.2 2.0 0.9
g 2.0 2.3 2.8 0.8 1.4 2.2
h 2.3 2.0 2.2 1.4 0.8 1.2 1.1
i 2.8 2.2 2.0 2.2 1.2 0.8 2.0 0.9
> print(sparseDist(m, 3), digits=2)
a b c d e f g h
d1 1.1 0.9 1.2 0.8 0.8 0.8 1.1 0.9
d2 1.2 1.2 1.5 1.1 0.9 1.2 2.0 NA
d3 1.6 1.5 2.0 1.4 1.2 2.2 NA NA
i1 2.0 3.0 6.0 7.0 8.0 9.0 8.0 9.0
i2 4.0 5.0 5.0 5.0 6.0 8.0 9.0 NA
i3 5.0 6.0 9.0 8.0 9.0 7.0 NA NA
And trying it on a larger problem (10k points). Still, on 100k points and more dimensions it will take a long time (like 15-30 minutes).
n<-1e4; m<-3; m=matrix(runif(n*m), n)
system.time( d <- sparseDist(m, 3) ) # 9 seconds on my machine...
P.S. Just noted that you posted an answer as I was writing this: the solution here is roughly twice as fast because it doesn't calculate the same distance twice (the distance between points 1 and 13 is the same as between points 13 and 1).
For now I am using the following, inspired by this answer. The output is a n x k matrix where element (i,k) is the index of the data point that is the kth closest to i.
n <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = n)
min.k.dists <- function(x,k=5) {
apply(x,2,function(r) {
b <- colSums((x - r)^2)
o <- order(b)
o[1:k]
})
}
min.k.dists(x) # first row should be 1:ncol(x); these points have distance 0
dist(t(x)) # can check answer against this
If one is worried about how ties are handled and whatnot, perhaps rank() should be incorporated.
The above code seems somewhat fast, but I'm sure it could be improved (though I don't have time to go the C or fortran route). So I'm still open to fast and sparse implementations of the above.
Below I include a parallelized version that I ended up using:
min.k.dists <- function(x,k=5,cores=1) {
require(multicore)
xx <- as.list(as.data.frame(x))
names(xx) <- c()
m <- mclapply(xx,function(r) {
b <- colSums((x - r)^2)
o <- order(b)
o[1:k]
},mc.cores=cores)
t(do.call(rbind,m))
}
If you want to keep the logic of your min.k.dist function and return duplicate distances, you might want to consider modifying it a bit. It seems pointless to return the first line with 0 distance, right? ...and by incorporating some of the tricks in my other answer, you can speed up your version by some 30%:
min.k.dists2 <- function(x, k=4L) {
k <- max(2L, k + 1L)
apply(x, 2, function(r) {
sort.list(colSums((x - r)^2), na.last=NA, method='quick')[2:k]
})
}
> n<-1e4; m<-3; m=matrix(runif(n*m), n)
> system.time(d <- min.k.dists(t(m), 4)) #To get 3 nearest neighbours and itself
user system elapsed
17.26 0.00 17.30
> system.time(d <- min.k.dists2(t(m), 3)) #To get 3 nearest neighbours
user system elapsed
12.7 0.0 12.7

Resources