How to speed up an apply function in too many loops - r
Firstly, I’d like to apologize. I’m learning R on my own, so I couldn’t simplify my problem and decided to just write a short version of my real variables here.
I’m trying to implement a variant of the Maximum Likelihood classifier in R. So, I have some variables for each class written in vectors and lists (each position refers to one class), and I want to apply a function to the lines of a matrix that contains the data I want to classify. The problem is that I need the results of that function separated by class. So far, I’m doing this:
cc<-vector(length=2)
mm<-list(length=2)
ii<-list(length=2)
temp1<-matrix(nrow=16,ncol=6)
temp1<-as.data.frame(temp1)
temp1[]<-c(256,235,194,235,215,173,215,215,194,215,215,215,194,173,152,215,
430,388,388,388,388,430,430,430,388,346,346,388,388,388,346,388,
283,317,283,283,248,283,283,283,214,214,248,283,214,283,214,248,
3701,3450,3576,3826,3534,3450,3868,4035,3450,3493,3450,3701,3534,3242,3032,3116,
1646,1589,1589,1646,1646,1589,1646,1732,1560,1475,1589,1589,1675,1532,1503,1418,
474,556,556,515,556,556,597,637,556,515,515,515,515,515,434,434)
temp2<- matrix(nrow=11,ncol=6)
temp2<-as.data.frame(temp2)
temp2[]<-c(422,463,462,483,546,525,483,566,546,483,546,
770,812,770,812,854,854,812,939,939,854,981,
1038,1175,1004,1141,1209,1209,1038,1311,1311,1175,1311,
2359,2359,2275,2359,2359,2359,2359,2401,2359,2401,2401,
2445,2531,2417,2588,2759,2617,2388,2674,2730,2645,2731,
1413,1413,1373,1495,1618,1535,1413,1535,1659,1535,1618)
cc[1]<-det(cov(temp1))
cc[2]<-det(cov(temp2))
mm[[1]]<-as.numeric(sapply(temp1,"mean"))
mm[[2]]<-as.numeric(sapply(temp2,"mean"))
ii[[1]]<-solve(cov(temp1))
ii[[2]]<-solve(cov(temp2))
data<-matrix(nrow=10,ncol=6)
data<-as.data.frame(data)
data[]<-c(181,203,224,203,203,224,181,181,161,161,
338,338,338,338,296,296,338,381,338,296,
208,242,208,208,208,208,208,242,208,173,
3164,2954,2660,2787,2744,2787,2534,3457,2870,2912,
1476,1505,1391,1332,1304,1391,1132,1591,1448,1304,
474,474,474,515,392,432,432,556,515,474)
for (k in 1:2){
Pxi<-apply(data,1,function(x)1/(2*pi^(6/2)*cc[k]^(1/2))*exp(-1/2*t(as.numeric(x-mm[[k]]))%*%ii[[k]]%*%(as.numeric(x-mm[[k]]))))
if (k==1) {rule<-Pxi} else {rule<-cbind(rule,Pxi)}
}
So I got it:
rule
rule Pxi
[1,] 4.316396e-13 0.000000e+00
[2,] 6.835553e-15 7.970888e-284
[3,] 8.674921e-21 2.687251e-145
[4,] 5.923777e-19 8.020048e-189
[5,] 5.627127e-16 8.064007e-184
[6,] 2.495667e-17 5.738550e-209
[7,] 6.311390e-22 8.913098e-97
[8,] 1.413893e-12 0.000000e+00
[9,] 5.521715e-15 1.619401e-221
[10,] 5.212091e-17 5.810407e-254
Well, as you can imagine, data is actually much bigger than in my example, and this last loop is taking a very long time when k is too big. Any suggestions on how to make it faster?
Should be faster if you work in matrices. Here is a suggestion to replace the for loop
data <- as.matrix(data)
const <- 2*pi^(6/2)
do.call(cbind, lapply(1L:2L, function(k) {
m <- sweep(data, 2L, mm[[k]])
#1/(const*cc[k]^(1/2))* exp(-1/2 * diag(m %*% ii[[k]] %*% t(m)))
1/(const*cc[k]^(1/2))* exp(-1/2 * rowSums((m %*% ii[[k]]) * m))
}))
The use of rowSums (instead of the original diag(m %*% ii[[k]] %*% t(m)) was from compute only diagonals of matrix multiplication in R
output:
[,1] [,2]
[1,] 4.316396e-13 0.000000e+00
[2,] 6.835553e-15 7.970888e-284
[3,] 8.674921e-21 2.687251e-145
[4,] 5.923777e-19 8.020048e-189
[5,] 5.627127e-16 8.064007e-184
[6,] 2.495667e-17 5.738550e-209
[7,] 6.311390e-22 8.913098e-97
[8,] 1.413893e-12 0.000000e+00
[9,] 5.521715e-15 1.619401e-221
[10,] 5.212091e-17 5.810407e-254
Using cbind() in a loop is very expensive. Instead, you should assign the intermediate loop results to a list and then do.call(cbind, rule) at the end:
Regarding why the apply() statement is slow, there are a lot of operations to go through for each row of data looped through. Instead, it is better to try to do matrix operations (or a function) all at once.
This uses the mahalanobis() function to simplify what's in the exp() call. It turns out that the function uses the same exact approach as #chinsoon12.
1 / (2*pi^(6/2)*det(cov(temp1))^(1/2))*exp(-1 / 2 * mahalanobis(data, colMeans(temp1), cov(temp1)))
mahalanobis
#function (x, center, cov, inverted = FALSE, ...)
#{
# x <- if (is.vector(x))
# matrix(x, ncol = length(x))
# else as.matrix(x)
# if (!isFALSE(center))
# x <- sweep(x, 2L, center)
# if (!inverted)
# cov <- solve(cov, ...)
# setNames(rowSums(x %*% cov * x), rownames(x))
#}
#<bytecode: 0x000000000c217d80>
#<environment: namespace:stats>
I would approach this by first making a list() of your temp data.frames and then using lapply() to loop through them:
tmps <- list(temp1, temp2)
do.call(cbind,
lapply(tmps,
function(tmp) {
n = length(tmp)
cov_tmp <- cov(tmp)
1 / (2*pi^(n/2)*det(cov_tmp)^(1/2))*exp(-1 / 2 * mahalanobis(data, colMeans(tmp), cov_tmp))
}
)
)
[,1] [,2]
[1,] 4.316396e-13 0.000000e+00
[2,] 6.835553e-15 7.970888e-284
[3,] 8.674921e-21 2.687251e-145
[4,] 5.923777e-19 8.020048e-189
[5,] 5.627127e-16 8.064007e-184
[6,] 2.495667e-17 5.738550e-209
[7,] 6.311390e-22 8.913098e-97
[8,] 1.413893e-12 0.000000e+00
[9,] 5.521715e-15 1.619401e-221
[10,] 5.212091e-17 5.810407e-254
Reference: http://sar.kangwon.ac.kr/etc/rs_note/rsnote/cp11/cp11-7.htm
Related
Populate matrix rows by looping over vector where loop interval is larger than 1
I'm trying to populate an matrix via a loop. Given a vector of values and an empty matrix; n <- c(seq(10,100,10)) out <- matrix(ncol=1, nrow=length(n)) running this simple loop as an example; for(i in n){ dostuff <- i*2 out[i,1] <- dostuff } gives the error message Error in[<-(tmp, i, 1, value = dostuff) : subscript out of bounds, as the interval within the vector that the loop is based on is larger than 1 and therefore does not fit the 1:10 index of the matrix rows. Removing i from the out row index only repeats the result of the last iteration : for(i in n){ dostuff <- i*2 out[,1] <- dostuff } There is obviously something fundamental about loops that I don't understand. I have looked, e.g. here and here, but have not been able to find a good solution. This is the result I'm looking for: [,1] [1,] 20 [2,] 40 [3,] 60 [4,] 80 [5,] 100 [6,] 120 [7,] 140 [8,] 160 [9,] 180 [10,] 200
Calculating distance between two points for multiple records for matching rows - loop over rows of two matrices
I have got two matrices with coordinates and I am trying to compute distances between points in matching rows, i.e. between row 1 in first matrix and row 1 in second matrix. What I am getting is computed distance between row 1 and all the other rows. This is creating memory issues as I have 800,000 rows. Does anyone know how to ask for that? I am using dist1 <- distm(FareStageMatrix[1:25000,], LSOACentroidMatrix[1:25000,], fun=distHaversine) I am trying to create something like this but doesn't seem to work for(i in 1:nrow(FareStageMatrix)) { for(j in 1:nrow(LSOACentroidMatrix)) { my_matrix[i] <- my_matrix[distm(FareStageMatrix[i], LSOACentroidMatrix[i], fun=distHaversine)] } } changed to for (i in 1:nrow(FareStageMatrix)){ for (i in 1:nrow(LSOACentroidMatrix)){ r1<-FareStageMatrix[i,] r2<-LSOACentroidMatrix[i,] results[i]<-distm(r1, r2, fun=distHaversine) } } Is that something that should be working?
It seems I have managed to find a solution to that: results<-matrix(NA,nrow(FareStageMatrix)) for (i in 1:nrow(FareStageMatrix)){ for (i in 1:nrow(LSOACentroidMatrix)){ r1<-FareStageMatrix[i,] r2<-LSOACentroidMatrix[i,] results[i]<-distm(r1, r2, fun=distHaversine) ## Example function } } where FareStageMatrix and LSOACentroidMatrix are matrices with coordinates It seems to have calculated one distance for a given pair of points
I've adapted geosphere's distGeo function (geodesic distance) for this purpose. library(geosphere) source("https://raw.githubusercontent.com/RomanAbashin/distGeo_v/master/distGeo_v.R") Data set.seed(1702) m1 <- matrix(runif(20000, -10, 10), ncol = 2) m2 <- matrix(runif(20000, -10, 10), ncol = 2) Code result <- distGeo_v(m1[, 1], m1[, 2], m2[, 1], m2[, 2]) Result > head(m1) [,1] [,2] [1,] 8.087152 9.227607 [2,] 9.528334 9.103403 [3,] 5.637921 -2.213228 [4,] -2.473758 -9.812986 [5,] -2.844036 -5.245779 [6,] -4.824615 -4.330890 > head(m2) [,1] [,2] [1,] 0.1673027 0.6483745 [2,] -2.5033184 0.1386050 [3,] 4.8589785 5.1996968 [4,] 8.3239454 -8.9810949 [5,] 0.8280422 -7.8272613 [6,] -6.2633738 -5.8725562 > head(result) [1] 1292351.3 1661739.3 824260.0 1189476.4 496403.2 233480.2
R Loop and Matrices
I am trying to get this simple 'for loop' to work. I can't get dim(F4) to be a 6848x2 matrix. I just want to divide the row entries of two matrices. Here's what I have... > dim(F3) [1] 6848 2 > head(F3) [,1] [,2] [1,] 140.9838 516.0239 [2,] 140.9838 516.0239 [3,] 140.9838 516.0239 [4,] 140.9838 516.0239 [5,] 140.9838 516.0239 [6,] 175.5093 515.2280 > dim(scale) [1] 6848 1 F4 <- matrix(, nrow = nrow(F1), ncol = 1) for (i in 1:t){ F4[i,]<-(F3[i]/scale[i])} #ONLY WANT F3(i) ROW TO BE DIVIDED BY SCALE(i) ROW > dim(F4) #DOESN'T GIVE ME 6848x2 Matrix [1] 6848 1
No need to use a for loop here. Here a vectorized solution: F3/as.vector(sacle) ## BAD! use of built-in function "scale" as a variable! Example : mat <- matrix(1:8,4,2) sx <- matrix(1:4,4,1) mat /as.vector(sx) The use of as.vector to get-rid of matrix division dimensions.
How to use some apply function to solve what requires two for-loops in R
I have a matrix, named "mat", and a smaller matrix, named "center". temp = c(1.8421,5.6586,6.3526,2.904,3.232,4.6076,4.8,3.2909,4.6122,4.9399) mat = matrix(temp, ncol=2) [,1] [,2] [1,] 1.8421 4.6076 [2,] 5.6586 4.8000 [3,] 6.3526 3.2909 [4,] 2.9040 4.6122 [5,] 3.2320 4.9399 center = matrix(c(3, 6, 3, 2), ncol=2) [,1] [,2] [1,] 3 3 [2,] 6 2 I need to compute the distance between each row of mat with every row of center. For example, the distance of mat[1,] and center[1,] can be computed as diff = mat[1,]-center[1,] t(diff)%*%diff [,1] [1,] 3.92511 Similarly, I can find the distance of mat[1,] and center[2,] diff = mat[1,]-center[2,] t(diff)%*%diff [,1] [1,] 24.08771 Repeat this process for each row of mat, I will end up with [,1] [,2] [1,] 3.925110 24.087710 [2,] 10.308154 7.956554 [3,] 11.324550 1.790750 [4,] 2.608405 16.408805 [5,] 3.817036 16.304836 I know how to implement it with for-loops. I was really hoping someone could tell me how to do it with some kind of an apply() function, maybe mapply() I guess. Thanks
apply(center, 1, function(x) colSums((x - t(mat)) ^ 2)) # [,1] [,2] # [1,] 3.925110 24.087710 # [2,] 10.308154 7.956554 # [3,] 11.324550 1.790750 # [4,] 2.608405 16.408805 # [5,] 3.817036 16.304836
If you want the apply for expressiveness of code that's one thing but it's still looping, just different syntax. This can be done without any loops, or with a very small one across center instead of mat. I'd just transpose first because it's wise to get into the habit of getting as much as possible out of the apply statement. (The BrodieG answer is pretty much identical in function.) These are working because R will automatically recycle the smaller vector along the matrix and do it much faster than apply or for. tm <- t(mat) apply(center, 1, function(m){ colSums((tm - m)^2) })
Use dist and then extract the relevant submatrix: ix <- 1:nrow(mat) as.matrix( dist( rbind(mat, center) )^2 )[ix, -ix] 6 7 # 1 3.925110 24.087710 # 2 10.308154 7.956554 # 3 11.324550 1.790750 # 4 2.608405 16.408805 # 5 3.817036 16.304836 REVISION: simplified slightly.
You could use outer as well d <- function(i, j) sum((mat[i, ] - center[j, ])^2) outer(1:nrow(mat), 1:nrow(center), Vectorize(d))
This will solve it t(apply(mat,1,function(row){ d1<-sum((row-center[1,])^2) d2<-sum((row-center[2,])^2) return(c(d1,d2)) })) Result: [,1] [,2] [1,] 3.925110 24.087710 [2,] 10.308154 7.956554 [3,] 11.324550 1.790750 [4,] 2.608405 16.408805 [5,] 3.817036 16.304836
Joining data by 'rbind' as a loop in R
I have two equally long dataset - 'vpXmin' and 'vpXmax' created from 'vp' > head(vpXmin) vp [1,] 253641 2621722 [2,] 253641 2622722 [3,] 253641 2623722 [4,] 253641 2624722 [5,] 253641 2625722 [6,] 253641 2626722 > head(vpXmax) vp [1,] 268641 2621722 [2,] 268641 2622722 [3,] 268641 2623722 [4,] 268641 2624722 [5,] 268641 2625722 [6,] 268641 2626722 I want to join each of the rows from these datasets using 'rbind' and want to create separate matrix; e.g. l1<-rbind(vpXmax[1,],vpXmin[1,]) l2<-rbind(vpXmax[2,],vpXmin[2,]) ... ... Even though I'm not familiar with R loops, I want to deal with such a large data as a loop ... but I failed while trying this: for (i in 1:length(vp)){rbind(vpXmax[i,],vpXmin[i,])} Any idea why? Also, please gimme some good references for learning different kinds of loops using R, if any. thanks in advance.
Maybe something like: vpXmax <- matrix(1:10,ncol=2) vpXmin <- matrix(11:20,ncol=2) l <- lapply(1:nrow(vpXmin),function(i) rbind(vpXmax[i,],vpXmin[i,]) ) Then, instead of l1, l2 etc etc you have l[[1]] # [,1] [,2] #[1,] 1 6 #[2,] 11 16 l[[2]] # [,1] [,2] #[1,] 2 7 #[2,] 12 17 And although it is probably not ideal, there is one major thing wrong with your initial loop. You aren't assigning your output, so you need to use assign or <- in some way to actually make an object. However, using assign, is pretty much a flag to set off alarm bells that there is a better way to do things, and <- would require pre-allocating or other stuffing around. Nevertheless, it will work, albeit polluting your work space with l1 l2... ln objects: for (i in 1:nrow(vpXmax)) {assign(paste0("l",i), rbind(vpXmax[i,],vpXmin[i,]) )} > l1 # [,1] [,2] #[1,] 1 6 #[2,] 11 16 > l2 # [,1] [,2] #[1,] 2 7 #[2,] 12 17
As #ToNoy indicates, it is not obvious the kind of output that you want. The easiest way to proceed would be to create a list in which each element is the result of rbind each row of the two original data frames. A <- data.frame("a" = runif(100, -1, 0), "b" = runif(100, 0, 1)) Z <- data.frame("a" = runif(100, -2, -1), "b" = runif(100, 1, 2)) output <- vector("list", nrow(A)) for (i in 1:nrow(A)) { output[[i]] <- rbind(A[i, ], Z[i, ]) }