Alternative way to loop for faster computing - r

Let's say I have a n*p dataframe.
I have computed a list of n matrix of p*p dimensions (named listMat in the R script belowed), in which each matrix is the distance matrix between the p variables for each of the n respondants.
I want to compute a n*n matrix called normMat, with each elements corresponding to the norm of the difference between each pairwise distance matrix. For exemple : normMat[1,2] will be the norm of a matrix named "diffMat" where diffMat is the difference between the 1st distance matrix and the 2nd distance matrix of the list of Matrix "listMat".
I wrote the following script which works fine, but i'm wondering if there is a more efficient way to write it, to avoid the loops (using for exemple lapply, etc ..) and make the script execution go faster.
# exemple of n = 3 distances matrix between p = 5 variables
x <- abs(matrix(rnorm(1:25),5,5))
y <- abs(matrix(rnorm(1:25),5,5))
z <- abs(matrix(rnorm(1:25),5,5))
listMat <- list(x, y, z)
normMat <- matrix(NA,n,n)
for (numRow in 1:n){
for (numCol in 1:n){
diffMat <- listMat[[numRow]] - listMat[[numCol]]
normMat[numRow, numCol] <- norm(diffMat, type="F")
}
}
Thanks for your help.

Try:
normMat <- function(x, y) {
norm(x-y, type="F")
}
sapply(listMat, function(x) sapply(listMat, function(y) normMat(x,y)))

Related

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

Calculate a n-byn matrix using values in 2 vectors (lengths of n) in R

I'm trying to calculate a n-by-n matrix in R using the values from 2 n vectors.
For example, let's say I have the following vectors.
formula f(x,y)=x+y
x<-c(1,2,3)
y<-c(8,9,10)
z should be a 3-by-3 matrix where z[0][0] is f(0,0) z[0][1] is f(0,1). IS there any way to perform such a calculation in R?
You can try outer
outer(x, y, FUN= f)
where
f <- function(x,y) x+y

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

compute samples variance without loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of
length 600 - each in one column of the data frame.
I want to pick up 10 of the time-series randomly and then assign them
random weights that sum up to one. Using those I want to compute the
variance of the sum of the 10 weighted time series variables (e.g.
convex combination).
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
i can compute it inside a loop but r is vector oriented and it is not efficient.
ntrials = 10000
ts.sd = NULL
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd = cbind(ts.sd, temp)
}
Not sure what type of "random" you want for your weights... so I'll use a normal distribution scaled s.t. it sums to one:
x=as.data.frame(matrix(sample(1:20, 100*600, replace=TRUE), ncol=100))
myfun <- function(inc, DF=x) {
w = runif(10)
w = w / sum(w)
t(w) %*% cov(DF[, sample(seq_along(DF), 10)]) %*% w
}
lapply(1:ntrials, myfun)
However, this isn't really avoiding loops per say since lapply is just an efficient looping construct. That said, for loops in R aren't explicitly bad or inefficient. Growing a data structure, like you're doing with cbind, however, is.
But in this case since you're only growing it by appending a single element it really wont change things much. The "correct" version would be to pre-allocate your vector ts.sd using ntrials.
ts.sd = vector(mode='numeric', length=ntrials)
The in your loop assign into it using i:
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd[i] = temp
}

Resources