I am new to R and data analysis.
I am hitting the wall as my hardware is not able to process the whole training set for computing a model.
I was thinking by using the caret package, am I able to train the model by breaking the training data in batches i.e. training the model with the first 1000 rows, followed by the next 1000 rows and so on and so forth? I then will be able to trim the model at every stage to save memory.
Will the model be “updated” with every feed of the batch of training data?
I know this method is known as sequential training but wasn’t able to find a practical example/case study.
Hope to get some guidance on this. Thanks.
Related
I'm using caret's timeslice method to do step-ahead cross validation on time series data. I was surprised to find that:
the 'best' hyperparameters chosen by caret are those with the best average performance across all train/test splits, and similarly the reported performance is the average across all train/test splits based on these hyperparam values, and
caret trains a final model using all of the data available - which makes sense when fixedWindow = TRUE but perhaps not otherwise.
My data are non-stationary, so I'd like hyperparameter tuning, performance reporting and final model training to be based on recent data so that:
optimal hyperparameter values can change as underlying relationships change
reported performance can account for the fact that the best hyperparam values may change across splits, and
recent changes in underlying relationships are best captured when training the final model.
I'm wondering if the best approach for my non-stationary data would follow an approach something like:
split each training fold into a training subset and validation subset - using the validation subset to pick hyperparam values within each training fold.
train on the entire training fold using the hyperparam values selected in (1)
report performance based on whatever hyperparameter values were selected in (1), even though these may change from fold to fold
The final model is trained, and hyperparameter values selected, based on steps (1) and (2) using the most recent data only.
Having typed this up I've realised that I'm just describing nested CV for time series. My questions:
is this best practice for training time series models when data are non-stationary?
can caret, or another R package, do this?
Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?
I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.
I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?
I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.
I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).
What do you do in your own work and why?
Thanks for helping a new programmer!
To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).
Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.
Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.
Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.
This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.
I would take the problem the other way round; is it bad to use the test set ?
The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.
From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.
Here are my reasons:
The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.
Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.
I am relatively new to the machine learning ocean, please excuse me if some of my questions are really basic.
Current situation: The overall goal was trying to improve some code for h2o package in r running on the supercomputer cluster. However, since the data is too large that single node with h2o really takes more than a day, therefore, we have decided to use multiple nodes to run the model. I came up with an idea:
(1) Distribute each node to build (nTree/num_node) trees and saved into a model;
(2) running on the cluster at each node for (nTree/num_node) number of trees in the forest;
(3) Merging the trees back together and reform the original forest, and using the measurement results in average.
I later realized this could be risky. But I cannot find the actual support or against statement since I am not machine learning focused programmer.
Questions:
if this way of handling random forest will result in some risk, please reference me the link so I can have a basic idea why this is not right.
If this way is actually an "ok" way to do so. What should I be do to merge the trees, is there a package or method I can borrow from?
If this is actually a solved problem, please reference me the link, I may have searched the wrong keywords, and thank you!
The real number-involved example I can present here is:
I have a random forest task with 80k rows and 2k columns and wanted the number of trees are 64. What I have done is put 16 trees on each node running with the whole dataset, and each one of four nodes come up with an RF model. I am now trying to merge the trees from each model into this one big RF model and average the measurements (from each of those four models).
There is no need to merge the models. Unlike with boosting methods, every tree in a Random Forest is grown independently (just don't set the same seed prior to kicking off RF on each node!).
You are basically doing what Random Forest does on its own, which is to grow X independent trees and then average across the votes. Many packages provide an option to specify the number of cores or threads, in order to take advantage of this feature of RF.
In your case, since you have the same number of trees per node, you'll get 4 "models" back, but those are really just collections of 16 trees. To use it, I'd just keep the 4 models separate and when you want a prediction, average the prediction from each of the 4 models. Assuming you're going to be doing that more than once, you could write a small wrapper function to predict with the 4 models and average the output.
10,000 rows by 1,000 columns is not overly large and should not take that long to train an RF model.
It sound like something unexpected is happening.
While you can try to average models if you know what you are doing, I don't think it should be necessary in this case.
I'm very new to all this and I have a bit of a mental block on the logic of the process. I am trying to predict customer churn using a database of current and already churned customers. So far I have
1) Taken complete customer database of current customers and already churned customers along with customer service variables etc to use to predict on.
2) Split the data set randomly 70/30 into train and test
3) Using R, I have trained a random forest model to predict make predictions and then compared to the actual status using a confusion matrix.
4) I have ran that model using the test data to check accuracy for identifying the churners
I'm now a bit confused. What I want to do now is take all of our current customers and predict which ones will churn. Have I done this all wrong as alot of the current customers I need to predict if will churn have already been seen by the model as they appear in the training set?
Was I somehow supposed to use a training and test set that will not be part of the dataset I need to make predictions on?
Many thanks for any help.
As far as I have understood your question, I feel you want to know if you've done the right thing by using overlapping examples in your training and test set. You first need to understand that you need to keep your training set separate from your test set. Since your model parameters have been computed based on your training set, for similar examples in the test set, the model will give you the correct prediction, so your accuracy will definitely be positively impacted for those common training and test set examples but that is not the correct thing to do. Your test set should always contain previously unseen examples in order to properly evaluate the performance of your algorithm.
If your current customers (on which you want to test your model) are already there in the training set, you would want to leave them out in the testing process. I'd suggest you perform a check between the training set customers and the current set of customers based on some unique identifier (if present) such as the Customer ID and leave common customers out of your fresh batch of unseen test examples.
It looks to me that you have the standard training-test-validation set problem. If I understood correctly, you want to test the performance of your model (Random Forest) to all the data you have.
Standard classroom way to do this is indeed what you already did: Split the dataset for example 70% training and 30% test/validation set, train the model with training set and test with test set.
Better way to test (and predict for all of the data) is to use Cross-Validation to perform the analysis (https://en.wikipedia.org/wiki/Cross-validation_(statistics)). One example for cross-validation is 10-fold cross-validation: You split your data to 10 equal size blocks, loop over all the blocks and for every iteration use the remaining 9 blocks to train your model and the test the model on the specific block.
What you end up with cross-validation is a more comprehensive knowledge of the performance of your model, as well as the results for all of the customers in your database. Cross-validation mitigates the errors in analysis due to random selection of the test set.
Hope this helps!
I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?
EDIT:
To clarify there are a couple issues.
1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:
RF <- randomForest(x, y, ntree, type,...)
then turn around and use the model to predict data using the test data set:
pred <- predict(RF, testData)
2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:
a) find a way to use > 5000 lines in a training set
or
b) find a way to use the model on the full 100k lines.
The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.
The training data is the model.
To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.
The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.
As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.
The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.