R function Mclust slow - r

I used the Mclust function in the mclust package for EM-Clustering of a vector of about 27,000 entries into two clusters:
Mclust(data_vector, G=2)
Another software that uses opencv for the EM-Clustering is about 3 times faster than Mclust (even if I reduce the maximum number of iterations in Mclust to e.g. 4). In the mclust source it looks like the function is implemented in fortran.
How can it be that it seems to be slower than the opencv implementation?

Try running both with the exact same:
initial conditions
model (with/without covariance etc.)
I believe Mclust does a quite expensive initialization. If openCV starts with a random sample as initialization, no wonder it is faster.
So for a starter, give both the exact same vector to start with.

Related

Fast NMF in R on sparse matrices

I'm looking for a fast NMF implementation for sparse matrices in R.
The R NMF package consists of a number of algorithms, none of which impress in terms of computational time.
NNLM::nnmf() seems state of the art in R at the moment, specifically the method = "scd" and loss = "mse", implemented as alternating least squares solved by sequential coordinate descent. However, this method is quite slow on very large, very sparse matrices.
The rsparse::WRMF function is extremely fast, but that's due to the fact that only positive values in A are used for row-wise computation of W and H.
Is there any reasonable implementation for solving NMF on a sparse matrix?
Is there an equivalent to scikit-learn in R? See this question
There are various worker functions, such as fnnls, tsnnls in R, none of which surpass nnls::nnls (written in Fortran). I have been unable to code any of these functions into a faster NMF framework.
Forgot I even posted this question, but one year later...
I wrote a very fast implementation of NMF in RcppEigen, see the RcppML R package on CRAN.
install.packages("RcppML")
# for the development version
devtools::install_github("zdebruine/RcppML")
?RcppML::nmf
It's at least an order of magnitude faster than NNLM::nnmf and for comparison, RcppML::nmf rivals the runtime of irlba::irlba SVD (although it's an altogether different algorithm).
I've successfully applied my implementation to 1.3 million single-cells containing 26000 genes in a 96% sparse matrix for rank-100 factorization in 1 minute. I think that's very reasonable.

Inconsistent results between dqrng and R API for PRNG in RCPP

I am attempting to implement a particle filter within Rcpp and use OpenMP to parallelise the transition step. I am using dqrng to create threadsafe RNG with using the boost distribution functions as per here.
The code for the R API can be found here and introducing dqrng here
The issue I am having is that, using the R API, I achieve correct results, verified against alternate implementations, with the density of the estimator being roughly normally distributed as expected. However, for the dqrng version the density of the estimator does not appear correct with differing results being obtained. The density plots can be seen below.
Does anyone have any understanding of why this might be the case?

How Matrix Inversion is done in "krige" function of gstat package of R tool

I am in the midway of understanding how gstat package in R tool implements kriging method. I have understood the calculation of empirical semivariogram and fitting semivariogram models. But I did not understand how it implements the matrix inversion to calculate the weights of the kriging estimators. I have a large data set containing 50000 lat-long-precipitaion triplets. Theoretically inversion of a matrix of size 50000x50000 must be done in order to get the weights. While this large matrix takes several GBs of man memory, which is particularly impractical.
My question is that how krige function does all this within a second?
Regards,
Chandan
You didn't tell what your computing environment is, but I believe it is safe to say that it didn't solve a 50000 points kriging problem in a second. In order to understand what it did, please provide more information, e.g. the commands you used, and the output gstat gave.

Is there an equivalent to matlab's rcond() function in Julia?

I'm porting some matlab code that uses rcond() to test for singularity, as also recommended here (for matlab singularity testing).
I see that there is a cond() function in Julia (as also in Matlab), but rcond() doesn't appear to be available by default:
ERROR: rcond not defined
I'd assume that rcond(), like the Matlab version is more efficient than 1/cond(). Is there such a function in Julia, perhaps using an add-on module?
Julia calculates the condition number using the ratio of maximum to the minimum of the eigenvalues (got to love open source, no more MATLAB black boxs!)
Julia doesn't have a rcond function in Base, and I'm unaware of one in any package. If it did, it'd just be the ratio of the maximum to the minimum instead. I'm not sure why its efficient in MATLAB, but its quite possible that whatever the reason is it doesn't carry though to Julia.
Matlab's rcond is an optimization based upon the fact that its an estimate of the condition number for square matrices. In my testing and given that its help mentions LAPACK's 1-norm estimator, it appears as though it uses LAPACK's dgecon.f. In fact, this is exactly what Julia does when you ask for the condition number of a square matrix with the 1- or Inf-norm.
So you can simply define
rcond(A::StridedMatrix) = 1/cond(A,1)
You can save Julia from twice-inverting LAPACK's results by manually combining cond(::StridedMatrix) and cond(::LU), but the savings here will almost certainly be immeasurable. Where there is a measurable savings, however, is that you can directly take the norm(A) instead of reconstructing a matrix similar to A through its LU factorization.
rcond(A::StridedMatrix) = LAPACK.gecon!('1', lufact(A).factors, norm(A, 1))
In my tests, this behaves identically to Matlab's rcond (2014b), and provides a decent speedup.

parallelize process in missForest package

I am using a package called missForest to estimate the missing values in my data set.
My question is: how can we parallelize this process to shorten the time that it takes to get the results?
Please refer to this example (from missForest package):
data(iris)
summary(iris)
The data contains four continuous and one categorical variable.
Artificially produce missing values using the prodNA function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)
Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.
The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").
Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.
Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.
It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:
Create the randomForest model objects in parallel;
Execute multiple randomForest operations (create model and predict) in parallel for each of the columns of the data frame that contain NA's.
Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.
Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.
In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.
In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.

Resources