R vectorization - subtracting vectors - r

Given an euclidean distance function:
eDistance <- function(q,m){
apply(m,1,function(x){
sqrt(sum((q-x)^2))
})
}
And the following matrix:
t = cbind(c(3,4,1,6,3),c(1,9,1,2,1))
I want to find a vectorized way to do the following:
r = rep(NA,nrow(t))
for (i in 1:nrow(t)){
tmp = eDistance(t[i,],t)
r[i] = sum(tmp)
}
My main concern is performance time.
Thank You.

Use the dist() function. You want the column sums of this...
colSums(as.matrix(dist(t)))
# 1 2 3 4 5
#13.22454 31.94863 17.64302 18.70368 13.22454

Related

Algorithm that gives you any number n in base 3 in R

I need to write an algorithm that gives you any number n in base 3 in R.
So far I wrote that:
NameOfTheFunction <- function(n) { while (n != 0) {
{q<- n%/%3}
{r <- n%%3}
{return(r)}
q<- n } }
My problem is that I now need to stock every r in a vector. I've never done that and don't quite know how to handle it. I tried to find some things on the internet but I did not find anything really relevant to this particular situation.
After your function, use:
sapply(vector, FUN=function(n) return(NameOfTheFunction(n)))
What sapply does is, for a given vector of your choice, it will repeat the function NameOfTheFunction(n) using every element in your vector in place of n in the function. The result, in this case, will be a vector of every output from your vector.
For example:
vector <- c(10, 100, 1000, 10000)
NameOfTheFunction <- function(n) { while (n != 0) {
{q<- n%/%3}
{r <- n%%3}
{return(r)}
q<- n } }
sapply(vector, NameOfTheFunction)
[1] 1 1 1 1

Closest pair for any of a huge number of points

We are given a huge set of points in 2D plane. We need to find, for each point the closest point within the set. For instance suppose the initial set is as follows:
foo <- data.frame(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
The output should be like this:
ClosesPair(foo)
2
1
4
3
3 # (could be 4 also)
Any idea?
The traditional approach is to preprocess the data
and put it in a data structure, often a K-d tree,
for which the "nearest point" query is very fast.
There is an implementation in the nnclust package.
library(nnclust)
foo <- cbind(x=c(1,2,4,4,10),y=c(1,2,4,4,10))
i <- nnfind(foo)$neighbour
plot(foo)
arrows( foo[,1], foo[,2], foo[i,1], foo[i,2] )
Here is an example; all wrapped into a single function. You might want to split it a bit for optimization.
ClosesPair <- function(foo) {
dist <- function(i, j) {
sqrt((foo[i,1]-foo[j,1])**2 + (foo[i,2]-foo[j,2])**2)
}
foo <- as.matrix(foo)
ClosestPoint <- function(i) {
indices <- 1:nrow(foo)
indices <- indices[-i]
distances <- sapply(indices, dist, i=i, USE.NAMES=TRUE)
closest <- indices[which.min(distances)]
}
sapply(1:nrow(foo), ClosestPoint)
}
ClosesPair(foo)
# [1] 2 1 4 3 3
Of cause, it does not handle ties very well.
Use the package spatstat . It's got builtin functions to do this sort of stuff.

How to use apply instead of for loop for stringMatch function?

I'm trying to calculate the number of pairwise differences between a long list of sequences, and put it back into a matrix form. This is a toy example of what I want to do.
library(MiscPsycho)
b <- c("-BC", "ACB", "---") # Toy example of sequences
workb <- expand.grid(b,b)
new <- c(1:9)
# Need to get rid of this for loop somehow
for (i in 1:9) {
new[i] <- stringMatch(workb[i,1], workb[i,2], normalize="NO")
}
workb <- cbind(workb, new)
newmat <- reShape(workb$new, id=workb$Var1, colvar=workb$Var2)
a <- c("Subject1", "Subject2", "Subject3") #Relating it back to the subject ID
colnames(newmat) <- a
rownames(newmat) <- a
newmat
I'm not very familiar with using the apply functions, but I'd like to use it to be able to replace the for loop, which will probably get slow considering I have a large number of sequences. (The stringMatch function is from MiscPsycho). Please let me know how to make it more efficient!
Thank you very much!
To get those "pairwise distances" I would have done something like:
Vm <- Vectorize(stringMatch)
nex <- outer(b,b, FUN=Vm, normalize = "NO")
nex
[,1] [,2] [,3]
[1,] 0 3 2
[2,] 3 0 3
[3,] 2 3 0
To replace the loop
new <- apply(workb, 1, function(x) stringMatch(x[[1]],x[[2]], normalize="NO"))
I would make a function that takes your index, i, and returns new[i].
myfun <- function(i) {
stringMatch(workb[i, 1], workb[i, 2], normalize='NO')
}
Then you can apply along your new vector:
workb$new <- unlist(lapply(new, myfun))
In general, you are using a for loop correctly in R. You have allocated the vector new before hand and are filling it rather than growing it.

tapply on matrices of data and indices

I am calculating sums of matrix columns to each group, where the corresponding group values are contained in matrix columns as well. At the moment I am using a loop as follows:
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
for (i in 1:2) {
tapply(x[,i], index[,i], sum)
}
At the end of the day I need the following result:
1 2
A 3 15
B 7 11
Is there a way to do this using matrix operations without a loop? On top, the real data is large (e.g. 500 x 10000), therefore it has to be fast.
Thanks in advance.
Here are a couple of solutions:
# 1
ag <- aggregate(c(x), data.frame(index = c(index), col = c(col(x))), sum)
xt <- xtabs(x ~., ag)
# 2
m <- mapply(rowsum, as.data.frame(x), as.data.frame(index))
dimnames(m) <- list(levels(factor(index)), 1:ncol(index))
The second only works if every column of index has at least one of each level and also requires that there be at least 2 levels; however, its faster.
This is ugly and works but there's a much better way to do it that is more generalizable. Just getting the ball rolling.
data.frame("col1"=as.numeric(table(rep(index[,1], x[,1]))),
"col2"=as.numeric(table(rep(index[,2], x[,2]))),
row.names=names(table(index)))
I still suspect there's a better option, but this seems reasonably fast actually:
index <- matrix(sample(LETTERS[1:4],size = 500*1000,replace = TRUE),500,10000)
x <- matrix(sample(1:10,500*10000,replace = TRUE),500,10000)
rs <- matrix(NA,4,10000)
rownames(rs) <- LETTERS[1:4]
for (i in LETTERS[1:4]){
tmp <- x
tmp[index != i] <- 0
rs[i,] <- colSums(tmp)
}
It runs in ~0.8 seconds on my machine. I upped the number of categories to four and scaled it up to the size data you have. But I don't having to copy x each time.
You can get clever with matrix multiplication, but I think you still have to do one row or column at a time.
You used tapply. If you add mapply, you can complete your objective.
It does the same thing as that for loop.
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
mapply( function(i) tapply(x[,i], index[,i], sum), 1:2 )
result:
[,1] [,2]
A 3 15
B 7 11

Calculating all distances between one point and a group of points efficiently in R

First of all, I am new to R (I started yesterday).
I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.
My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.
Each point is an array of 64 doubles, so
> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64
I have tried with
for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is
distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}
for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with
distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.
I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?
ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.
Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
Runs in:
utilisateur système écoulé
0.100 0.008 0.108
on my laptop.
rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly.
https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html
Usage :
library(fields)
#generating fake data
n <- 5
m <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = d)
y <- matrix(rnorm(m * d), ncol = d)
rdist(x, y)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
[2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
[3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
[4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
[5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
[6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
[7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
[8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
[9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
[10,] 2.492993 2.994359 3.3573190 6.498129 3.337441
You may want to have a look into the apply functions.
For instance, this code
for (j in 1:K)
{
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
Can easily be substituted by something like
dt <- data[i,]
d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})
You can definitely optimise it more but you get the point I hope
dist works fast because is't vectorized and call internal C functions.
You code in loop could be vectorized in many ways.
For example to compute distance between data and centers you could use outer:
diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
X <- outer(seq_len(n), seq_len(K), diff_ij)
This gives you n x K matrix of distances. And should be way faster than loop.
Then you could use max.col to find maximum in each row (see help, there are some nuances when are many maximums). X must be negate cause we search for minimum.
CL <- max.col(-X)
To be efficient in R you should vectorized as possible. Loops could be in many cases replaced by vectorized substitute. Check help for rowSums (which describe also rowMeans, colSums, rowSums), pmax, cumsum. You could search SO, e.g.
https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples.
My solution:
# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}
You can try it, like:
x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])

Resources