A framework for comparing the time performance of Expectation Maximization - r

I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.
Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?

Fair benchmarking of EM is hard. Very hard.
the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.
EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.
There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality.
Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.
Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.
For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?
You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.

Related

Time complexity of nlm-package in R?

I'm estimating a Non-Linear system (via seemingly unrelated regressions - SUR), using systemfit (nlsystemfit() function) package with 4 equations, 32 parameters to estimate (!) and 412 observations. But my code is taking forever (my laptop it's not a super-powerful one tho). So far, the process was on a 13 hours run. I'm not an expert in computational stuff, but someone explained me some time ago the concept of Time Complexity of the algorithms (or big-o), then depending on this concept the time to compute a certain algorithm could rely on specific functional relation on the number of observations and/or coefficients.
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
But all this has a sense if I can change key elements in my model, which can reduce the time of processing significantly. That's why I was looking on google about the time complexity of nlm-package (nlsystemfit() function relies on nlm) but unsuccessfully. So, this is my question: Anybody knows where I can find that info, or at least give me advice on how test non-linear systems before run a whole model?
Since you didn't provide any substantial information regarding your model or some code for the same, its hard to express a betterment for your situation.
From what you said:
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
It seems you require benchmarking or to obtain the measured time taken to execute, as in your case. (although it can deal with memory usage or some other performance metric as well)
There are quite a few ways to benchmark code in R, which include the use of Sys.time() or system.time() just before and right after your algorithm/function executes, or libraries such as rbenchmark (which is a simple wrapper around the system.time function), tictoc, bench and microbenchmark.
Among these the last two are preferable options, as bench::mark includes system_time(), a higher precision alternative to system.time() and microbenchmark is known to be a reliable source to accurately measure and compare the execution time of R expressions/algorithms.

OpenMDAO efficiency with using multiple comp

I recently read this sentence in a paper:
One important feature of OpenMDAO is the ability to subdivide a
problem into components that have a small number of inputs and outputs
and contain relatively simple analyzes.
Moreover, looking at the examples in the manual there are few number of inputs and outputs for each component.
Would that mean it is more efficient to use an execcomp that takes in two inputs from from an explicit component and outputs a constraint instead of doing everything within the explicitcomp. I try to come up with an example here:
x1,x2 --> ExplicitComp -->y1
y1 --> Execcomp --->constraint
OR
x1,x2 --->ExplicitComp -->y1,constraint
What the comment in that paper is referring to is not computational efficiency, but rather the benefit to the user in terms of making models more modular and maintainable. Additionally, when you have smaller components with fewer inputs, it is much easier to compute analytic derivatives for them.
The idea is that by breaking your calculation up into smaller steps, the partial derivatives are them simpler for you to compute by hand. OpenMDAO will then compute the total derivatives across the model for you.
So in a sense, you're leaning on OpenMDAO's ability to compute derivatives across large models to lessen your work load.
From a computational cost perspective, there is some cost associated with having more components vs less. Taken to the extreme, if you had one component for each line of code in a huge calculation then the framework overhead could become a problem. There are some features in OpenMDAO that can help mitigate some of this cost, specifically the in-memory assembly of Jacobians, for serial models.
With regard to the ExecComp specifically, that component is meant for simple and inexpensive calculations. It computes its derivatives using complex-step, which can be costly if large array inputs are involved. Its there to make simple steps like adding variables easier. But for expensive calculations, you shouldn't use it.
In your specific case, I would suggest that you consider if it is hard to propogate the derivatives from x1,x1 through to the constraint yourself. If the chain rule is not hard to handle, then probably I would just lump it all into one calculation. If for some reason, the derivatives are nasty when you combine all the calculations, then just split them up.

Train SVM on a very large dataset stored on hard drive

There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks
Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.
A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.
You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.

How to improve randomForest performance?

I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R snippet, to train the model using randomForest. This is taking hours for me.
rf.model <- randomForest(
Weekly_Sales~.,
data=newdata,
keep.forest=TRUE,
importance=TRUE,
ntree=200,
do.trace=TRUE,
na.action=na.roughfix
)
I think, due to na.roughfix, it is taking long time to execute. There are so many NA's in the training set.
Could someone let me know how can I improve the performance?
My system configuration is:
Intel(R) Core i7 CPU # 2.90 GHz
RAM - 8 GB
HDD - 500 GB
64 bit OS
(The tl;dr is you should a) increase nodesize to >> 1 and b) exclude very low-importance feature columns, maybe even exclude (say) 80% of your columns. Your issue is almost surely not na.roughfix, but if you suspect that, run na.roughfix separately as a standalone step, before calling randomForest. Get that red herring out of the way at first.)
Now, all of the following advice only applies until you blow out your memory limits, so measure your memory usage, and make sure you're not exceeding. (Start with ridiculously small parameters, then scale them up, measure the runtime, and keep checking it didn't increase disproportionately.)
The main parameters affecting the performance of randomForest are:
mtry (less is faster)
ntrees
number of features/cols in data - more is quadratically slower, or worse! See below
number of observations/rows in data
ncores (more is faster - as long as parallel option is being used)
some performance boost by setting importance=F and proximity=F (don't compute proximity matrix)
Never ever use the insane default nodesize=1, for classification! In Breiman's package, you can't directly set maxdepth, but use nodesize as a proxy for that, and also read all the good advice at: CrossValidated: "Practical questions on tuning Random Forests"
So here your data has 4.2e+5 rows, then if each node shouldn't be smaller than ~0.1%, try nodesize=42. (First try nodesize=420 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.)
runtime is proportional to ~ 2^D_max, i.e. polynomial to (-log1p(nodesize))
optionally you can also speedup by using sampling, see strata,sampsize arguments
Then a first-order estimate of runtime, denoting mtry=M, ntrees=T, ncores=C, nfeatures=F, nrows=R, maxdepth=D_max, is:
Runtime proportional to: T * F^2 * (R^1.something) * 2^D_max / C
(Again, all bets are off if you exceed memory. Also, try running on only one core, then 2, then 4 and verify you actually do get linear speedup. And not slowdown.)
(The effect of large R is worse than linear, maybe quadratic, since tree-partitioning has to consider all partitions of the data rows; certainly it's somewhat worse than linear. Check that by using sampling or indexing to only give it say 10% of rows).
Tip: keeping lots of crap low-importance features quadratically increases runtime, for a sublinear increase in accuracy. This is because at each node, we must consider all possible feature selection (or whatever number mtry) allows. And within each tree, we must consider all (F-choose-mtry) possible combinations of features.
So here's my methodology, doing "fast-and-dirty feature-selection for performance":
generate a tree normally (slow), although use a sane nodesize=42 or larger
look at rf$importances or randomForest::varImpPlot(). Pick only the top-K features, where you choose K; for a silly-fast example, choose K=3. Save that entire list for future reference.
now rerun the tree but only give it newdata[,importantCols]
confirm that speed is quadratically faster, and oob.error is not much worse
once you know your variable importances, you can turn off importance=F
tweak mtry and nodesize (tweak one at a time), rerun and measure speed improvement
plot your performance results on logarithmic axes
post us the results! Did you corroborate the above? Any comments on memory usage?
(Note that the above is not a statistically valid procedure for actual feature-selection, do not rely on it for that, read randomForest package for the actual proper methods for RF-based feature-selection.)
I suspect do.trace might also consume time... instead do.trace = TRUE, you can used do.trace = 5 (to show only 5 traces) just to have some feel about errors. For large dataset, do.trace would take up a lot time as well.
Another think I noticed:
the correct is ntrees, not ntree.
The default is 500 for the randomForest package.
Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
ranger (uses C++ compiler)
randomForestSRC (uses C++ compiler)
h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!

hclust size limit?

I'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".
Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
EDIT
I ended up increasing the max.limit and increased the machine's memory to 8GB and that seems to have fixed it.
Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.
Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.
You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).
The size limit is being set by your hardware and software, and you have not given enough specifics to say much more. On a machine with adequate resources you would not be getting this error. Why not try a 10% sample before diving into the deep end of the pool? Perhaps starting with:
reduced <- full[ sample(1:nrow(full), nrow(full)/10 ) , ]

Resources