I wrote a program using an unsupervised K-means algorithm to try and compress images. It now works but in comparison to Python it's incredibly slow! Specifically it's finding the rowNorms thats slow. The array X is 350000+ elements.
This is the particular function:
find_closest_centroids <- function(X, centroids) {
m <- nrow(X)
c <- integer(m)
for(i in 1:m){
distances = rowNorms(sweep(centroids,2,X[i,]))
c[i] = which.min(distances)
}
return(c)
}
In Python I am able to do it like this:
def find_closest_centroids(X, centroids):
m = len(X)
c = np.zeros(m)
for i in range(m):
distances = np.linalg.norm(X[i] - centroids, axis=1)
c[i] = np.argmin(distances)
return c
Any recommendations?
Thanks.
As dvd280 has noted in his comment, R tends to do worse than many other languages in terms of performance. If are content with the performance of your code in Python, but need the function available in R, you might want to look into the reticulate package which provides an interface to python like the Rcpp package mentioned by dvd280 does for C++.
If you still want to implement this natively in R, be mindful of the data structures you use. For rowwise operations, data frames are a poor choice as they are lists of columns. I'm not sure about the data structures in your code, but rowNorms() seems to be a matrix method. You might get more mileage out of a list of rows structure.
If you feel like getting into dplyr, you could find this vignette on row-wise operations helpful. Make sure you have the latest version of the package, as the vignette is based on dplyr 1.0.
The data.table package tends to yield the best performance for large data sets in R, but I'm not familiar with it, so I can't give you any further directions on that.
Related
Is there an equivalent to numpy's apply_along_axis() (or R's apply())in Julia? I've got a 3D array and I would like to apply a custom function to each pair of co-ordinates of dimensions 1 and 2. The results should be in a 2D array.
Obviously, I could do two nested for loops iterating over the first and second dimension and then reshape, but I'm worried about performance.
This Example produces the output I desire (I am aware this is slightly pointless for sum(). It's just a dummy here:
test = reshape(collect(1:250), 5, 10, 5)
a=[]
for(i in 1:5)
for(j in 1:10)
push!(a,sum(test[i,j,:]))
end
end
println(reshape(a, 5,10))
Any suggestions for a faster version?
Cheers
Julia has the mapslices function which should do exactly what you want. But keep in mind that Julia is different from other languages you might know: library functions are not necessarily faster than your own code, because they may be written to a level of generality higher than what you actually need, and in Julia loops are fast. So it's quite likely that just writing out the loops will be faster.
That said, a couple of tips:
Read the performance tips section of the manual. From that you'd learn to put everything in a function, and to not use untyped arrays like a = [].
The slice or sub function can avoid making a copy of the data.
How about
f = sum # your function here
Int[f(test[i, j, :]) for i in 1:5, j in 1:10]
The last line is a two-dimensional array comprehension.
The Int in front is to guarantee the type of the elements; this should not be necessary if the comprehension is inside a function.
Note that you should (almost) never use untyped (Any) arrays, like your a = [], since this will be slow. You can write a = Int[] instead to create an empty array of Ints.
EDIT: Note that in Julia, loops are fast. The need for creating functions like that in Python and R comes from the inherent slowness of loops in those languages. In Julia it's much more common to just write out the loop.
Hello everyone and thanks in advance! I've had a bit of an interesting journey with this problem. Here I figured out how to create a file-backed big matrix using the bigmemory package. This 7062 row by 364520 column matrix is the constraint matrix in a linear programming problem I'm trying to solve using the Rsymphony package. The code is below and the constraint matrix is called mat :
Rsymph <- Rsymphony_solve_LP(obj
,mat[1:nrow(mat),1:ncol(mat)]
,dir
,rhs
,types="B",max=F, write_lp=T)
Unfortunately when I run this, Rsymphony tries bringing the file-backed matrix into memory and I don't have enough RAM. The only reason why I even created the big matrix with bigmemory in the first place was to use as little RAM as possible. Is there any way I can with this code or using another linear programming function complete this with the amount of memory I have available? Thanks.
This was my concern before. By running mat[...] you are converting the big.matrix in to a regular matrix. The function will need to be rewritten so it is compatible with big.matrix objects. If you look at the source code for R_symphony_solve_LP you will find the following call:
out <- .C("R_symphony_solve",
as.integer(nc),
as.integer(nr),
as.integer(mat$matbeg),
as.integer(mat$matind),
as.double(mat$values),
as.double(col_lb),
as.double(col_ub),
as.integer(int),
if(max) as.double(-obj) else as.double(obj),
obj2 = double(nc),
as.character(paste(row_sense, collapse = "")),
as.double(rhs),
double(),
objval = double(1L),
solution = double(nc),
status = integer(1L),
verbosity = as.integer(verbosity),
time_limit = as.integer(time_limit),
node_limit = as.integer(node_limit),
gap_limit = as.double(gap_limit),
first_feasible = as.integer(first_feasible),
write_lp = as.integer(write_lp),
write_mps = as.integer(write_mps))
This C function will need to be rewritten for it to be compatible with big.matrix objects. If the use of this function is critically important to you, there are some examples of how to access big.matrix objects on the Rcpp Gallery website using Rcpp and RcppArmadillo. I am sorry to say there is no easy solution beyond this right now. You either need to get more RAM or start writing some more code.
I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii < x.nrow(); ii++){
for (jj=0; jj < x.ncol(); jj++){
if(x(ii,jj) < y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.
I have the following piece of code:
Y.hat.tr <- array(0,c(nXtr,2))
for (i in 1:nXtr){
#print(i)
Y.hat.tr[i,2] <- ktr[,i]%*%solve(K + a*In)%*%Ytr
#Y.hat.tr[i,2] <- ktr[,i]%*%chol2inv(chol((K + a*In)))%*%Ytr
}
Y.hat.tr[,1] <- Ytr
My problem is that nXtr =300, and ktr is a 300x300 matrix. This routine takes approx 30 seconds to run in R version 3.0.1. I have tried various approaches to reduce the run time, but to no avail.
Any ideas would be gratefully received. If any other information is required please let me know
I have now taken the solve(K + a*In)%*%Ytr out of the loop, which has helped, but I was hoping to somehow vectorise this piece of code. Having thought about this for a while, and also after looking through various posts I cannot see how this can be done?
Maybe I am missing something (and without sample or simulated data to test on it is harder to check), but isn't your loop equivalent to:
Y.hat.tr[,2] <- t(ktr) %*% solve(K + a*In) %*% Ytr
?
Removing the loop altogether and using internal vectorized code may speed things up.
Also, you are using solve with 1 argument, often you can speed things by using solve with 2 arguments (fewer internal calculations), something like:
t(ktr) %*% solve( K + a*In, Ytr )
Your loop is of the type called embarrassingly parallel, which means that if you want to keep the loop and are working on a computer with more than 1 core (or have easy access to a cluster) then you could use the parallel package (and maybe simplest to convert using the foreach package) to run the calculations in parallel which sometimes can greatly speed up the process.
When playing with large objects the memory and speed implications of pass-by-value can be substantial.
R has several ways to pass-by-reference:
Reference Classes
R.oo
C/C++/other external languages
Environments
However, many of them require considerable overhead (in terms of code complexity and programmer time).
In particular, I'm envisioning something like what you could use constant references for in C++ : pass a large object, compute on it without modifying that, and return the results of that computation.
Since R does not have a concept of constants, I suspect if this happens anywhere, it's in compiled R functions, where the compiler could see that the formal argument was not modified anywhere in the code and pass it by reference.
Does the R compiler pass-by-reference if an argument is not modified? If not, are there any technical barriers to it doing so or has it just not been implemented yet?
Example code:
n <- 10^7
bigdf <- data.frame( x=runif(n), y=rnorm(n), z=rt(n,5) )
myfunc <- function(dat) invisible(with( dat, x^2+mean(y)+sqrt(exp(z)) ))
library(compiler)
mycomp <- compile(myfunc)
tracemem(bigdf)
> myfunc(bigdf)
> # No object was copied! Question is not necessary
This may be way off base for what you need, but what about wrapping the object in a closure? This function makes a function that knows about the object given to its parent, here I use the tiny volcano to do a very simple job.
mkFun <- function(x) {
function(rownumbers) {
rowSums(x[rownumbers , , drop = FALSE])
}
}
fun <- mkFun(volcano)
fun(2) ##1] 6493
fun(2:3) ##[1] 6493 6626
Now fun can get passed around by worker functions to do its job as it likes.