I notice that using R to program constantly causing some running speed problem, especially when the code involves growing a list. It's just very unintuitive what is slowing down the program so dramatically during the looping.
Specifically I'm using caret to train a gbm model, after getting the tuned hyperparameter, I need to do LOOCV to obtain the test error, which demands me to train the model for n times (n=number of samples). All I store in the list is my prediction result. Yet the list grows slower as the loop progress.
Can you offer some general advice for testing the memory issues related to R programming?
First create an empty list/vector. Size of your "n" so R has not to rewrite it every time it wants to add one additional value
Related
I'm estimating a Non-Linear system (via seemingly unrelated regressions - SUR), using systemfit (nlsystemfit() function) package with 4 equations, 32 parameters to estimate (!) and 412 observations. But my code is taking forever (my laptop it's not a super-powerful one tho). So far, the process was on a 13 hours run. I'm not an expert in computational stuff, but someone explained me some time ago the concept of Time Complexity of the algorithms (or big-o), then depending on this concept the time to compute a certain algorithm could rely on specific functional relation on the number of observations and/or coefficients.
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
But all this has a sense if I can change key elements in my model, which can reduce the time of processing significantly. That's why I was looking on google about the time complexity of nlm-package (nlsystemfit() function relies on nlm) but unsuccessfully. So, this is my question: Anybody knows where I can find that info, or at least give me advice on how test non-linear systems before run a whole model?
Since you didn't provide any substantial information regarding your model or some code for the same, its hard to express a betterment for your situation.
From what you said:
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
It seems you require benchmarking or to obtain the measured time taken to execute, as in your case. (although it can deal with memory usage or some other performance metric as well)
There are quite a few ways to benchmark code in R, which include the use of Sys.time() or system.time() just before and right after your algorithm/function executes, or libraries such as rbenchmark (which is a simple wrapper around the system.time function), tictoc, bench and microbenchmark.
Among these the last two are preferable options, as bench::mark includes system_time(), a higher precision alternative to system.time() and microbenchmark is known to be a reliable source to accurately measure and compare the execution time of R expressions/algorithms.
I have a classification task that I managed to train with mlr package using LDA ("classif.lda") in a few seconds. However when I trained it using "classif.rpart" the training never ended.
Is there any different setup to be done for the different methods?
My training data here if needed to replicate the problem. I tried to train it simply with
pred.bin.task <- makeClassifTask(id="CountyCrime", data=dftrain, target="count.bins")
train("classif.rpart", pred.bin.task)
In general, you don't need to change anything about the setup when switching learners -- one of the main points of mlr is to make this easy! This does not mean that it'll always work though, as different learning methods do different things under the hood.
It looks like in this particular case the model simply takes a long time to train, so you probably didn't wait long enough for it to complete. You have quite a large data frame.
Looking at your data, you seem to have an interval of values in count.bins. This is treated as a factor by R (i.e. intervals are only the same if the string matches completely), which is probably not what you want here. You could encode start and end as separate (numerical) features.
There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks
Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.
A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.
You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.
I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.
I am using a package called missForest to estimate the missing values in my data set.
My question is: how can we parallelize this process to shorten the time that it takes to get the results?
Please refer to this example (from missForest package):
data(iris)
summary(iris)
The data contains four continuous and one categorical variable.
Artificially produce missing values using the prodNA function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)
Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.
The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").
Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.
Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.
It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:
Create the randomForest model objects in parallel;
Execute multiple randomForest operations (create model and predict) in parallel for each of the columns of the data frame that contain NA's.
Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.
Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.
In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.
In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.