Why predict() in R has to be done on test data - r

In the below code, do I need to follow approach 1 or approach 2.
I am confused why the test data to be used in predict as per approach 1.
Would be great if someone can explain it in detail.
train <- sample(nrow(sales), nrow(sales)*0.6)
test <- sales[-train]
Approach 1
fit <- lm(y~x,data=train)
predict(fit,data=test)
Instead can't I do this way:
Approach 2
fit <- lm(y~x,data=train)
predict(fit,data=train)
fit1 <- lm(y~x,data=test)
predict(fit1,data=test)

Speaking generally, predict() using a model trained on training data, applied to training data, can only be used for introspection about the model as trained. Applying that to (ideally independent) test data makes sense either as a validation of the trained model or a use of the model for further prediction.
In other words, they're not different approaches to the same thing; they accomplish completely different things.

Related

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

Classification with One Class SVM in R

I am trying to code a SVM for classification using a training data-set that contains only one type of class. So, i want to predict if some data is different or not from my data-set.
I used the same data-set as the training for predicting, but unfortunately, the SVM is not predicting well.
library(e1071)
# Data set
high <- c(10,5,14,12,20)
temp <- c(12,15,20,15,9)
x <- cbind(high,temp)
# Create SVM
model <- svm(x,y=NULL,type='one-classification',kernel='linear')
# Predict training data-set
pred <- predict(model,x)
pred
It returns:
TRUE TRUE FALSE FALSE TRUE
It should be TRUE for all of them.
I am working on a similar problem. In reading the vignette's that the e1071 authors have at CRAN I believe that by definition the SVM is going to draw a hyperplane that separates it into 2 classes. In other words, that 3rd item is the most likely to be an outlier. SVM will always define at least one outlier.
I'm not sure traditional supervised learning techniques, such as SVMs, are well suited to training data where you only have 1 class. There's nothing in the data to inform the model how to differentiate between class A and class B.
I think the best you can do with your 1-class training data is to learn a probability density/mass function from the data, and then find how likely a new instance is under the learned probability density. For some more info see the wikipedia article on one-class classification.

Random forest evaluation in R

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

R: Limit/Set values of predicted results from linear model

New to R.
Looking to limit the range of values that can be predicted.
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- lm(G~S+L+M+V,data=df.Train)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
round(predict(m.Train, df.Test, type="response"),digits=1)
#seq(0,4,.1) #Predicted values should fall in this range
I've experimented with the predict() options but no luck.
Is there an option in predict? Should I be limiting it in the model?
Thank you
There are ways to transform your response variable, G in this occasion but there needs to be a good reason to do this. For example, if you want the output to be probabilities between 0 and 1 and your response variable is binary (0,1) then you need a logistic regression.
It all comes down to what data you have and whether a model / transformation of the response variable would be appropriate. In your example you do not specify what the data is and therefore we cannot say anything about which model or which transformation to use.
Setting the above on the side, if you really care about the prediction and do not care about the model or the transformation (but why wouldn't you care?) it looks like your data could use a quasipossion generalised linear model which might provide the output you need:
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- glm(G~S+L+M+V,data=df.Train, family=quasipoisson)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
> predict(m.Train, df.Test, type="response")
1 2 3 4 5
4.000000 2.840834 3.062754 3.615447 4.573276
#probably not as good as you want
The model is using a log link by default which ensures the values will be positive. There is no guarantee that the model will not predict values greater than 4 but since you fed it values of less than 4 (your G variable) then chances are that most of the predictions will follow that distribution (like in this example). You might then need to consider how to treat predictions that go above 4.
In general you should consider carefully which model to choose and which response transformation. The poison model above for example is usually used for count data. However, you should never manipulate predictions on your own so if you choose the lm model in the end make sure you use the predictions it gives.
EDIT
It looks like in your case a non-linear regression might be what you need. The problem using a linear model like lm is that predictions can be greater than the max of the observed cases and less than the min of the observed cases. In which case doing a linear regression might not be appropriate. There are algorithms that will never predict a value greater than the max or less than the min. Such a case might be better suited in your case. One of these algorithms is the k-nearest neighbor for example:
library(FNN)
> knn.reg(df.Train[1:4], test=df.Test[1:4], y=df.Train[5], k=3)
Prediction:
[1] 3.066667 3.066667 3.066667 2.700000 3.100000
As you can see the predictions will never go above 4. That said knn is a local solution algorithm so again you need to research whether this is a good approach or not for your problem and your data. In terms of predictions though it definitely confirms your conditions. Knn is a very easy to understand algorithm that relies on distances between points to calculate predictions.
Hope it helps :)

Regression evaluation in R

Are there any utilities/packages for showing various performance metrics of a regression model on some labeled test data? Basic stuff I can easily write like RMSE, R-squared, etc., but maybe with some extra utilities for visualization, or reporting the distribution of prediction confidence/variance, or other things I haven't thought of. This is usually reported in most training utilities (like caret's train), but only over the training data (AFAICT). Thanks in advance.
This question is really quite broad and should be focused a bit, but here's a small subset of functions written to work with linear models:
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
#general summary
summary(model)
#Visualize some diagnostics
plot(model)
#Coefficient values
coef(model)
#Confidence intervals
confint(model)
#predict values
predict(model)
#predict new values
predict(model, newdata = data.frame(y = 1:10))
#Residuals
resid(model)
#Standardized residuals
rstandard(model)
#Studentized residuals
rstudent(model)
#AIC
AIC(model)
#BIC
BIC(model)
#Cook's distance
cooks.distance(model)
#DFFITS
dffits(model)
#lots of measures related to model fit
influence.measures(model)
Bootstrap confidence intervals for parameters of models can be computed using the recommended package boot. It is a very general package requiring you to write a simple wrapper function to return the parameter of interest, say fit the model with some supplied data and return one of the model coefficients, whilst it takes care of the rest, doing the sampling and computation of intervals etc.
Consider also the caret package, which is a wrapper around a large number of modelling functions, but also provides facilities to compare model performance using a range of metrics using an independent test set or a resampling of the training data (k-fold, bootstrap). caret is well documented and quite easy to use, though to get the best out of it, you do need to be familiar with the modelling function you want to employ.

Resources