The Validation Set Approach - r

I am studying the book ISLR on my own
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
23.26601
If I don't use the attach()
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
97.06483
Why the result change substantially?
Also, I don't know the syntax of this code mean((mpg-predict(lm.fit,data=Auto))[-train]^2);
what does [] represent?
Also, why mpg-predict,we usually use ~ for formula?
I tried to use ?mean but it didn't show the answers.
Many thanks in advance for the help.

The second argument to predict.lm is not "data", it is newdata. So the first set of instruction matched the Auto dataframe to the newdata argument. If you run the second set of instructions with newdata as the parameter, you get the same result:
mean((mpg-predict(lm.fit,newdata=Auto))[-train]^2)
[1] 23.26601
When you execute mpg-predict(lm.fit,newdata=Auto)) you are getting the residuals from the model. You are asking for the difference of the mpg variable value and the prediction for the that variable. (It's just a minus sign between expression mpg - predict(...).
The next part of the code [-train] is removing the training set from the consideration. This is often called the "out-of-bag" residuals when you are doing cross validation.

Instead of attaching, use with
with(Auto, mean((mpg-predict(lm.fit,Auto))[-train]^2))
#[1] 23.26601
The use of - on 'train' in (?Extract - [) is subsetting the elements from the vector by removing those positions created by the sample, before taking the mean of power
with(Auto, (mpg-predict(lm.fit,Auto)))
If we don't use attach or with, the 'mpg' object is not found in the global environment. Therefore, it would result in error
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
Error in mean((mpg - predict(lm.fit, data = Auto))[-train]^2) :
object 'mpg' not found
If the OP got different value, then mpg may be from a different data
Regarding the use of ~ in formula, according to ?lm
Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

Related

Why is predict in R taking Train data instead of Test data? [duplicate]

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

R Cross Validation lm predict function [duplicate]

I am trying to convert Absorbance (Abs) values to Concentration (ng/mL), based on an established linear model & standard curve. I planned to do this by using the predict() function. I am having trouble getting predict() to return the desired results. Here is a sample of my code:
Standards<-data.frame(ng_mL=c(0,0.4,1,4),
Abs550nm=c(1.7535,1.5896,1.4285,0.9362))
LM.2<-lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
Abs<-c(1.7812,1.7309,1.3537,1.6757,1.7409,1.7875,1.7533,1.8169,1.753,1.6721,1.7036,1.6707,
0.3903,0.3362,0.2886,0.281,0.3596,0.4122,0.218,0.2331,1.3292,1.2734)
predict(object=LM.2,
newdata=data.frame(Concentration=Abs[1]))#using Abs[1] as an example, but I eventually want predictions for all values in Abs
Running that last lines gives this output:
> predict(object=LM.2,
+ newdata=data.frame(Concentration=Abs[1]))
1 2 3 4
0.5338437 0.4731341 0.3820697 -0.0732525
Warning message:
'newdata' had 1 row but variables found have 4 rows
This does not seem to be the output I want. I am trying to get a single predicted Concentration value for each Absorbance (Abs) entry. It would be nice to be able to predict all of the entries at once and add them to an existing data frame, but I can't even get it to give me a single value correctly. I've read many threads on here, webpages found on Google, and all of the help files, and for the life of me I cannot understand what is going on with this function. Any help would be appreciated, thanks.
You must have a variable in newdata that has the same name as that used in the model formula used to fit the model initially.
You have two errors:
You don't use a variable in newdata with the same name as the covariate used to fit the model, and
You make the problem much more difficult to resolve because you abuse the formula interface.
Don't fit your model like this:
mod <- lm(log(Standards[['Abs550nm']])~Standards[['ng_mL']])
fit your model like this
mod <- lm(log(Abs550nm) ~ ng_mL, data = standards)
Isn't that some much more readable?
To predict you would need a data frame with a variable ng_mL:
predict(mod, newdata = data.frame(ng_mL = c(0.5, 1.2)))
Now you may have a third error. You appear to be trying to predict with new values of Absorbance, but the way you fitted the model, Absorbance is the response variable. You would need to supply new values for ng_mL.
The behaviour you are seeing is what happens when R can't find a correctly-named variable in newdata; it returns the fitted values from the model or the predictions at the observed data.
This makes me think you have the formula back to front. Did you mean:
mod2 <- lm(ng_mL ~ log(Abs550nm), data = standards)
?? In which case, you'd need
predict(mod2, newdata = data.frame(Abs550nm = c(1.7812,1.7309)))
say. Note you don't need to include the log() bit in the name. R recognises that as a function and applies to the variable Abs550nm for you.
If the model really is log(Abs550nm) ~ ng_mL and you want to find values of ng_mL for new values of Abs550nm you'll need to invert the fitted model in some way.

Getting "Variable Importance" from rpart

I'm performing a tree analysis using rpart, and I need to access the values of "Variable importance" as shown when the rpart object is printed.
Is there a way to do that?
Thanks!
#rawr indicated it in the comments, I'll just make it an answer:
You can extract the variable importance from a rpart object using:
fit$variable.importance
Just adding details on #user7779's answer, you can also access the information you need in the following way:
library(rpart)
my.tree = rpart(y ~ X, data = dta, method = "anova") # I am assuming regression tree.
summary(my.tree)
In the output, among the first lines, you find variable importance. Notice though that here everything is rescaled, thus you will get the relative importance (i.e., numbers are going to sum up to one hundred).

What is the meaning of "~ -1 + ". in R?

I am trying to understand a R script and I came upon this line:
train <- cbind(train[,c(1,2)],model.matrix(~ -1 + .,train[,-c(1,2)]))
train is a data.frame. I think it is trying to combine the first two columns of train with all the other columns after they have been through some sort of matrix manipulation. However, I cannot understand exactly what the model formula(?) seems to be doing. From the comment in the script it's purpose is to turn all the other columns in to 0's and 1's, but I'm not sure how. If someone could clarify that would be great. Thanks!
From ?formula:
The - operator removes the specified terms... [i]t can also used to remove the intercept term: when fitting a linear model y ~ x - 1 specifies a line through the origin.
Further:
There are two special interpretations of . in a formula. The usual one is in the context of a data argument of model fitting functions and means ‘all columns not otherwise in the formula’
So, you have a formula specifying the response is a function of all variables in train[,-c(1,2)], with an intercept at the origin.

lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Resources