Profiling SVM (e1071) in R - r

I am new to R and SVMs and I am trying to profile svm function from e1071 package. However, I can't find any large dataset that allows me to get a good profiling range of results varying the size of the input data. Does anyone know how to work svm out? Which dataset should I use? Any particular parameters to svm that makes it work harder?
I copy some commands that I am using to test the performance. Perhaps it is most useful and easier to get what I am trying here:
#loading libraries
library(class)
library(e1071)
#I've been using golubEsets (more examples availables)
library(golubEsets)
#get the data: matrix 7129x38
data(Golub_Train)
n <- exprs(Golub_Train)
#duplicate rows(to make the dataset larger)
n<-rbind(n,n)
#take training samples as a vector
samplelabels <- as.vector(Golub_Train#phenoData#data$ALL.AML)
#calculate svm and profile it
Rprof('svm.out')
svmmodel1 <- svm(x=t(n), y=samplelabels, type='C', kernel="radial", cross=10)
Rprof(NULL)
I keep increasing the dataset duplicating rows and columns but I reached the limit of memory instead of making svm works harder...

In terms of "working SVM out" - what will make SVM work "harder" is a more complex model which is not easily separated, higher dimensionality and a larger, denser dataset.
SVM performance degrades with:
Dataset size increases (number of data points)
Sparsity decreases (fewer zeros)
Dimensionality increases (number of attributes)
Non-linear kernels are used (and kernel parameters can make the
kernel evaluation more complex)
Varying Parameters
Are there parameters you can change to make SVM take longer. Of course the parameters affect the quality of the solution you will get and may not make any sense to use.
Using C-SVM, varying C will result in different runtimes. (The similar parameter in nu-SVM is nu) If the dataset is reasonably separable, making C smaller will result in a longer runtime because the SVM will allow more training points to become support vectors. If the dataset is not very separable, making C bigger will cause longer run times because you are essentially telling SVM you want a narrow-margin solution which fits tightly to the data and that will take much longer to compute when the data doesn't easily separate.
Often you find when doing a parameter search that there are parameters that will increase computation time with no appreciable increase in accuracy.
The other parameters are kernel parameters and if you vary them to increase the complexity of calculating the kernel then naturally the SVM runtime will increase. The linear kernel is simple and will be the fastest; non-linear kernels will of course take longer. Some parameters may not increase the calculation complexity of the kernel, but will force a much more complex model, which may take SVM much longer to find the optimal solution to.
Datasets to Use:
The UCI Machine Learning Repository is a great source of datasets.
The MNIST handwriting recognition dataset is a good one to use - you can randomly select subsets of the data to create increasingly larger sized datasets. Keep in mind the data at the link contains all digits, SVM is of course binary so you would have to reduce the data to just two digits or do some kind of multi-class SVM.
You can easily generate datasets as well. To generate a linear dataset, randomly select a normal vector to a hyperplane, then generate a datapoint and determine which side of the hyperplane it falls on to label it. Add some randomness to allow points within a certain distance of the hyperplane to sometimes be labeled differently. Increase the complexity by increasing that overlap between classes. Or generate some numbers of clusters of normally distributed points, labeled either 1 or -1, so that the distributions overlap at the edges. The classic non-linear example is a checkerboard. Generate points and label them in a checkerboard pattern. To make it more difficult enlarge the number of squares, increase the dimensions and increase the number of datapoints. You will have to use a non-linear kernel for that of course.

Related

Speed up estimation of overlapping additive models (mgcv)

I have some set of variables and I'm fitting many (hundreds of thousands) additive models, each of which includes a subset of all the variables. The dependent variable is the same in every case, and some of the models overlap or are nested. Not all of the independent variables have to enter the model nonparametrically. For clarity, I might have a set of variables {x1,x2,x3,x4,x5} and estimate:
a) y=c+f(x1)+f(x2),
b) y=c+x1+f(x2),
c) y=c+f(x1)+f(x2)+x3, etc.
I'm wondering if there is anything I can do to speed up the gam estimation in this case? Is there anything that is being calculated over and over again that I could calculate once and supply to the function?
What I have already tried:
Memoization since the models repeat exactly from time to time.
Reluctantly switched from thin plate regression splines to cubic regression splines (quite a significant improvement).
The mgcv guide says:
The user can retain most of the advantages of the t.p.r.s. approach by supplying a reduced set of covariate values from which to obtain the basis - typically the number of covariate values used will be substantially smaller than the number of data, and substantially larger than the basis dimension, k.
This caused quite a noticeable improvement with smaller models, e.g. 5 smooths, but not with larger models, e.g. 10 smooths. In fact, in the latter case, it often caused the estimation to take (potentially much) longer.
What I'd like to try but don't know if it's possible:
One obvious thing that repeats itself in both, say, y=c+f(x1)+f(x2) and y=c+x1+f(x2), is the calculation of the basis for f(x2). If I were to use the same knots every time, how (if it's possible at all) could I precalculate the basis for every variable and then supply that to mgcv? Would you expect this to bring a significant time improvement?
Is there anything else you'd recommend?

R h2o model sizes on disk

I am using the h2o package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)

Can you calculate the size of a RandomForest before you run it?

When running RandomForest, is there a way to use the number of rows and columns from the input data, plus the options of the forest (trees and trys) to calculate the size of the forest (in bytes) before it's run?
The specific issue I'm having is when running my final RandomForest (as opposed to exploratory), I want as robust a model as possible. I want to run right up to my memory limit without hitting it. Right now, I'm just doing trial and error, but I'm looking for a more precise way.
I want to run right up to my memory limit without hitting it.
Why do you want to do that? Instead of pushing your resources to the limit, you should instead just use whatever resources are required to build a good random forest model. In my experience, I have rarely ran into memory limit problems when running random forests. This is because I train on a subset of the actual data set which is reasonably sized.
The randomForest function (from the randomForest package) has two parameters which influence how large the forest will become. The first is ntree, which is the number of trees to be used when building the forest. The fewer the trees, the smaller the size of the model. Another parameter is nodesize, which controls how many observations will be placed into each leaf node of each tree. The smaller the node size, the more splitting which has to be done in each tree, and the larger the forest model.
You should experiment with these parameters, and also train on a reasonably-sized training set. The metric for a good model is not how close you come to maxing out your memory limit, but rather how robust a model you build.

calibration and liftchart with caret R package

I am comparing various predictive models on a binary classification task using the caret R package with respect to their predictive performance (liftChart) and prediction accuracy (calibration plot). I found the following issues:
1. Sometimes the lift function is very very slow when the number of observation is quite big or there are various competing classifiers. In addition I wonder whether it is possible to manually define the cuts of the calibration plot. I have a severe imbalanced model (average probability is 5%) and the calibration plot function assumes evenly spaced cuts.
The lift plot does the calculation for every unique probability value (much like an ROC curve), which is why it is slow.
Neither of those options are available right now. You can add two issues to the github page. I'm fairly swamped right now but those shouldn't be a big deal to change (you could always contribute solutions too).
Max

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.

Resources