Train SVM on a very large dataset stored on hard drive - r

There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks

Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.

A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.

You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.

Related

Machine learning project: split training/test sets before or after exploratory data analysis?

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?
I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.
I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?
I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.
I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).
What do you do in your own work and why?
Thanks for helping a new programmer!
To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).
Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.
Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.
Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.
This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.
I would take the problem the other way round; is it bad to use the test set ?
The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.
From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.
Here are my reasons:
The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.
Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.

R h2o model sizes on disk

I am using the h2o package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)

A framework for comparing the time performance of Expectation Maximization

I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.
Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?
Fair benchmarking of EM is hard. Very hard.
the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.
EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.
There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality.
Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.
Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.
For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?
You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.

How can I tell if R is still estimating my SVM model or has crashed?

I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.

SVM modeling with BIG DATA

For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.

Resources