Does Seed Value affect the result of training data in R? - r

I am trying to create a model with my data divided into training(70%) ,validation(15%) and testing(15%) set.After running the model I am getting some accuracy(ROC) and some Value for my confusion matrix.But every time I keep changing the seed value,it is affecting my output. How do I address this? Is this the expected behavior? If so how can I come to a conclusion of which value to be chosen as the final output?

set.seed() defines a starting point for the generation of random values. Running an analysis with the same seed should return the same result. Using a different seed can result in different output. In your case probably because of a different split in training, validation and testing.
If the differences are acceptable small, then your model is robust for different splits in training, testing and validation. If the differences are large, then your model is not robust and should not be trusted. You will have to change the way the data is split (stratification might help) or revise the model.

Related

When to use Train Validation Test sets

I know this question is quite common but I have looked at all the questions that have been asked before and I still can't understand why we also need a validation set.
I know sometimes people only use a train set and a test set, so why do we also need a validation set?
And how do we use it?
For example, in order to impute missing data, I impute these 3 different sets separately or not?
Thank you!
I will try to answer with an example.
If I'm training a neural network or doing linear regression, and I'm using only train and test data I can check my test data loss for each iteration and stop when my test data loss begins to grow or get a snapshot of the model with the lowest test loss.
Is some sense this is "overfiting" to my test data since i decide when to stop based on that.
If I was using test, train and validation data I can do the same process as above with the validation instead of the test data, and then after i decide when my model is done training, I can test it on the never before seen test data to give me a more unbiased score of my models predictions.
For the second part of the question, I would suggest to treat at least the test data as independent and impute the missing data differently, but it depends on the situation and data.

R: training random forest using PCA data

I have a data set called Data, with 30 scaled and centered features and 1 outcome with column name OUTCOME, referred to 700k records, stored in data.table format. I computed its PCA, and observed that its first 8 components account for the 95% of the variance. I want to train a random forest in h2o, so this is what I do:
Data.pca=prcomp(Data,retx=TRUE) # compute the PCA of Data
Data.rotated=as.data.table(Data.pca$x)[,c(1:8)] # keep only first 8 components
Data.dump=cbind(Data.rotated,subset(Data,select=c(OUTCOME))) # PCA dataset plus outcomes for training
This way I have a dataset Data.dump where I have 8 features that are rotated on the PCA components, and at each record I associated its outcome.
First question: is this rational? or do I have to permute somehow the outcomes vector? or the two things are unrelated?
Then I split Data.dump in two sets, Data.train for training and Data.test for testing, all as.h2o. The I feed them to a random forest:
rf=h2o.randomForest(training_frame=Data.train,x=1:8,y=9,stopping_rounds=2,
ntrees=200,score_each_iteration=T,seed=1000000)
rf.pred=as.data.table(h2o.predict(rf,Data.test))
What happens is that rf.pred seems not so similar to the original outcomes Data.test$OUTCOME. I tried to train a neural network as well, and did not even converge, crashing R.
Second question: is it because I am carrying on some mistake from the PCA treatment? or because I badly set up the random forest? Or I am just dealing with annoying data?
I do not know where to start, as I am new to data science, but the workflow seems correct to me.
Thanks a lot in advance.
The answer to your second question (i.e. "is it the data, or did I do something wrong") is hard to know. This is why you should always try to make a baseline model first, so you have an idea of how learnable the data is.
The baseline could be h2o.glm(), and/or it could be h2o.randomForest(), but either way without the PCA step. (You didn't say if you are doing a regression or a classification, i.e. if OUTCOME is a number or a factor, but both glm and random forest will work either way.)
Going to your first question: yes, it is a reasonable thing to do, and no you don't have to (in fact, should not) involve the outcomes vector.
Another way to answer your first question is: no, it unreasonable. It may be that a random forest can see all the relations itself without needing you to use a PCA. Remember when you use a PCA to reduce the number of input dimensions you are also throwing away a bit of signal, too. You said that the 8 components only capture 95% of the variance. So you are throwing away some signal in return for having fewer inputs, which means you are optimizing for complexity at the expense of prediction quality.
By the way, concatenating the original inputs and your 8 PCA components, is another approach: you might get a better model by giving it this hint about the data. (But you might not, which is why getting some baseline models first is essential, before trying these more exotic ideas.)

different values by fitting a boosted tree twice

I use the R-package adabag to fit boosted trees to a (large) data set (140 observations with 3 845 predictors).
I executed this method twice with same parameter and same data set and each time different values of the accuracy returned (I defined a simple function which gives accuracy given a data set).
Did I make a mistake or is usual that in each fitting different values of the accuracy return? Is this problem based on the fact that the data set is large?
function which returns accuracy given the predicted values and true test set values.
err<-function(pred_d, test_d)
{
abs.acc<-sum(pred_d==test_d)
rel.acc<-abs.acc/length(test_d)
v<-c(abs.acc,rel.acc)
return(v)
}
new Edit (9.1.2017):
important following question of the above context.
As far as I can see I do not use any "pseudo randomness objects" (such as generating random numbers etc.) in my code, because I essentially fit trees (using r-package rpart) and boosted trees (using r-package adabag) to a large data set. Can you explain me where "pseudo randomness" enters, when I execute my code?
Edit 1: Similar phenomenon happens also with tree (using the R-package rpart).
Edit 2: Similar phenomenon did not happen with trees (using rpart) on the data set iris.
There's no reason you should expect to get the same results if you didn't set your seed (with set.seed()).
It doesn't matter what seed you set if you're doing statistics rather than information security. You might run your model with several different seeds to check its sensitivity. You just have to set it before anything involving pseudo randomness. Most people set it at the beginning of their code.
This is ubiquitous in statistics; it affects all probabilistic models and processes across all languages.
Note that in the case of information security it's important to have a (pseudo) random seed which cannot be easily guessed by brute force attacks, because (in a nutshell) knowing a seed value used internally by a security program paves the way for it to be hacked. In science and statistics it's the opposite - you and anyone you share your code/research with should be aware of the seed to ensure reproducibility.
https://en.wikipedia.org/wiki/Random_seed
http://www.grasshopper3d.com/forum/topics/what-are-random-seed-values

Strugling to understand complete predictive model process in R

I'm very new to all this and I have a bit of a mental block on the logic of the process. I am trying to predict customer churn using a database of current and already churned customers. So far I have
1) Taken complete customer database of current customers and already churned customers along with customer service variables etc to use to predict on.
2) Split the data set randomly 70/30 into train and test
3) Using R, I have trained a random forest model to predict make predictions and then compared to the actual status using a confusion matrix.
4) I have ran that model using the test data to check accuracy for identifying the churners
I'm now a bit confused. What I want to do now is take all of our current customers and predict which ones will churn. Have I done this all wrong as alot of the current customers I need to predict if will churn have already been seen by the model as they appear in the training set?
Was I somehow supposed to use a training and test set that will not be part of the dataset I need to make predictions on?
Many thanks for any help.
As far as I have understood your question, I feel you want to know if you've done the right thing by using overlapping examples in your training and test set. You first need to understand that you need to keep your training set separate from your test set. Since your model parameters have been computed based on your training set, for similar examples in the test set, the model will give you the correct prediction, so your accuracy will definitely be positively impacted for those common training and test set examples but that is not the correct thing to do. Your test set should always contain previously unseen examples in order to properly evaluate the performance of your algorithm.
If your current customers (on which you want to test your model) are already there in the training set, you would want to leave them out in the testing process. I'd suggest you perform a check between the training set customers and the current set of customers based on some unique identifier (if present) such as the Customer ID and leave common customers out of your fresh batch of unseen test examples.
It looks to me that you have the standard training-test-validation set problem. If I understood correctly, you want to test the performance of your model (Random Forest) to all the data you have.
Standard classroom way to do this is indeed what you already did: Split the dataset for example 70% training and 30% test/validation set, train the model with training set and test with test set.
Better way to test (and predict for all of the data) is to use Cross-Validation to perform the analysis (https://en.wikipedia.org/wiki/Cross-validation_(statistics)). One example for cross-validation is 10-fold cross-validation: You split your data to 10 equal size blocks, loop over all the blocks and for every iteration use the remaining 9 blocks to train your model and the test the model on the specific block.
What you end up with cross-validation is a more comprehensive knowledge of the performance of your model, as well as the results for all of the customers in your database. Cross-validation mitigates the errors in analysis due to random selection of the test set.
Hope this helps!

Random Forest optimization with tuning and cross-validation

I'm working with a large data set, so hope to remove extraneous variables and tune for an optimal m variables per branch. In R, there are two methods, rfcv and tuneRF, that help with these two tasks. I'm attempting to combine them to optimize parameters.
rfcv works roughly as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
run random forest with remaining variables, reporting cverror and predictions
}
Presently, I've recoded rfcv to work as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
tune for the best m for reduced variable set;
run random forest with remaining variables, reporting cverror and predictions;
}
This, of course, increases the run time by an order of magnitude. My question is how necessary this is (it's been hard to get an idea using toy datasets), and whether any other way could be expected to work roughly as well in far less time.
As always, the answer is it depends on the data. On one hand, if there aren't any irrelevant features, then you can just totally skip feature elimination. The tree building process in the random forest implementation already tries to select predictive features, which gives you some protection against irrelevant ones.
Leo Breiman gave a talk where he introduced 1000 irrelevant features into some medical prediction task that had only a handful of real features from the input domain. When he eliminated 90% of the features using a single filter on variable importance, the next iteration of random forest didn't pick any irrelevant features as predictors in its trees.

Resources