Difference in linear regression codes - r

I am self-teaching r from "An Introduction to Statistical Learning: With Applications in R". I am sure I should get the same mean for both codes. However, I get a drastically different result. Can someone please help me find out why am I not getting the same msg? Looks like the first code chunk is wrong. These came from the Auto data set. My predictions and the book's predictions are different. However, the index on which these two were trained was the same.
First Chunk (my code)
set.seed(1)
train_index = sample (392, 196)
Auto$index = c(1:nrow(Auto))
train_df = Auto[train_index,]
test_df = anti_join(Auto, train_df, by="index")
attach(train_df)
lm.fit = lm(mpg ~ horsepower)
predictions = predict(lm.fit, horsepower = test_df$horsepower)
mean((test_df$mpg - predictions)^2)
Second Chunk (book's code - An Introduction to Statistical Learning: With Applications in R)
set. seed (1)
train = sample (392, 196)
lm.fit = lm(mpg ~ horsepower , data = Auto , subset = train)
attach(Auto)
mean (( mpg - predict(lm.fit , Auto))[-train ]^2)

In your code, you’re not specifying the test data correctly in predict(). predict() takes a dataframe containing predictor variables, passed to the newdata argument; instead, you include horsepower = test_df$horsepower, which just gets absorbed by ... and has no effect.
If you instead pass the whole test_df dataframe to newdata, you get the same result as the text.
library(ISLR)
library(dplyr)
set.seed(1)
# OP’s code with change to predict()
train_index = sample(392, 196)
Auto$index = c(1:nrow(Auto))
train_df = Auto[train_index,]
test_df = anti_join(Auto, train_df, by="index")
attach(train_df)
lm.fit = lm(mpg ~ horsepower)
predictions = predict(lm.fit, newdata = test_df)
mean((test_df$mpg - predictions)^2)
# 23.26601
# ISLR code
set.seed (1)
train = sample (392 , 196)
lm.fit = lm(mpg ~ horsepower , data = Auto , subset = train)
attach(Auto)
mean (( mpg - predict(lm.fit , Auto))[-train ]^2)
# 23.26601

Related

Creating function to run k-fold cross validation on glmer object (Leave One Out Cross-Validation)

I am trying to create a function to run a k-fold cross validation on a glmer object.
This is just data I got online (my dataset is quite large) so the model isn't the best but if I can get this to work using this data I should be able to switch it to my dataset quite easily.
I want to do a LOOCV(Leave One Out Cross-Validation)
"LOOCV(Leave One Out Cross-Validation) is a type of cross-validation approach in which each observation is considered as the validation set and the rest (N-1) observations are considered as the training set."
The outline I got was from Caroline's answer on this researchgate thread.
https://www.researchgate.net/post/Does_R_code_for_k-fold_cross_validation_of_a_nested_glmer_model_exist
#load libraries
library(tidyverse)
library(optimx)
library(lme4)
#add example data
Data <- read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
Data <- select(Data, remission, IL6, CRP, DID)
Data
Data$remission<- as.factor(Data$remission)
Data$DID<- as.factor(Data$DID)
#add ROW column
Data <- Data %>% mutate(ROW = row_number())
head(Data)
PTOT=NULL
for (i in 1:8825) { # i in total number of observations in dataset
##Data that will be predicted
DataC1=Data[unique(Data$ROW)==i,]
###To train the model
DataCV=Data[unique(DataC1$ROW)!=i,]
M1 <- glmer(remission ~ 1 + IL6 + CRP + ( 1 | DID ), data = DataCV, family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
P1=predict(M1, DataC1)
names(P1)=NULL
P1
PTOT= c(PTOT, P1)
}
R2cv=1-(sum((remission-PTOT)^2)/(length(PTOT))/(var(remission)))
This is the error I get
"Error: Invalid grouping factor specification, DID"
DataCV is empty.
For example:
i <- 1 ## first time through the loop
DataCV=Data[unique(DataC1$ROW)!=i,]
I think that should have been DataC$ROW), not DataC1$ROW.
A few other comments: a more compact version of your code would look something like this:
## fit the full model
M1 <- glmer(remission ~ 1 + IL6 + CRP + ( 1 | DID ), data = DataC,
family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
res <- numeric(nrow(DataCV))
for (i in 1:nrow(DataCV)) {
new_fit <- update(M1, data = dataC[-i,]
res[i] <- (predict(new_fit, newdata=dataC[i,]) - remission[i])^2
}
For a well-specified model LOOCV is asymptotically equivalent to AIC, so you might be doing a lot of work to get something that's not very different from the AIC (which you can get directly from a single model fit) ...

linear regression in R

I'm trying to predict automobile prices based on a bunch of independent variables using linear regression. The only attributes in my data set that are chr is fuel and color, the rest are either num or int. I omitted fuel because it only has one level.
here is my code:
# Loading Data
car_data = read.csv("Car_Data (1).csv",header =TRUE)
car_data$Fuel <-NULL
car_data$Colour<- as.factor(car_data$Colour)
str(car_data)
set.seed(123)
indx <- sample(2, nrow(car_data), replace = T, prob = c(0.8, 0.2))
train <- car_data[indx == 1, ]
test <- car_data[indx == 2, ]
lmModel <- lm(Price ~ ., data = train)
summary(lmModel)
When I run the summary(lmModel), it shows all NA's for the Error, tvalue, and Pr(>|t|).
Can someone help...
It's possible that your dataset has too few observations in it and you are trying to fit too many features. It would be helpful for reproducibility if you could supply your dataset (or a minimal working example of a similar dataset). Perhaps you could also try running a simpler regression specification to see if that might tease out some errors.
lmModelSimple <- lm(Price ~ Colour, data = train)
summary(lmModelSimple)

Depth and OOB error of a randomForest and randomForestSRC

Here is my code for random forest and rfsrc in R; Is there anyway to include n_estimators and max_depth like sklearn version in my R code ? Also, How can I plot OBB error vs number of trees plot like this?
set.seed(2234)
tic("Time to train RFSRC fast")
fast.o <- rfsrc.fast(Label ~ ., data = train[(1:50000),],forest=TRUE)
toc()
print(fast.o)
#print(vimp(fast.o)$importance)
set.seed(2367)
tic("Time to test RFSRC fast ")
#data(breast, package = "randomForestSRC")
fast.pred <- predict(fast.o, test[(1:50000),])
toc()
print(fast.pred)
set.seed(3)
tic("RF model fitting without Parallelization")
rf <-randomForest(Label~.,data=train[(1:50000),])
toc()
print(rf)
plot(rf)
varImp(rf,sort = T)
varImpPlot(rf, sort=T, n.var= 10, main= "Variable Importance", pch=16)
rf_pred <- predict(rf, newdata=test[(1:50000),])
confMatrix <- confusionMatrix(rf_pred,test[(1:50000),]$Label)
confMatrix
I appreciate your time.
You need to set block.size=1 , and also take note the sampling is without replacement, you can check the vignette for rfsrc:
Unlike Breiman's random forests, the default action here is sampling
without replacement. Thus out-of-bag (OOB) technically means
out-of-sample, but for legacy reasons we retain the term OOB.
So using an example dataset,
library(mlbench)
library(randomForestSRC)
data(Sonar)
set.seed(911)
trn = sample(nrow(Sonar),150)
rf <- rfsrc(Class ~ ., data = Sonar[trn,],ntree=500,block.size=1,importance=TRUE)
pred <- predict(rf,Sonar[-trn,],block.size=1)
plot(rf$err.rate[,1],type="l",col="steelblue",xlab="ntrees",ylab="err.rate",
ylim=c(0,0.5))
lines(pred$err.rate[,1],col="orange")
legend("topright",fill=c("steelblue","orange"),c("test","OOB.train"))
In randomForest:
library(randomForest)
rf <- randomForest(Class ~ ., data = Sonar[trn,],ntree=500)
pred <- predict(rf,Sonar[-trn,],predict.all=TRUE)
Not very sure if there's an easier to get ntrees error:
err_by_tree = sapply(1:ncol(pred$individual),function(i){
apply(pred$individual[,1:i,drop=FALSE],1,
function(i)with(rle(i),values[which.max(lengths)]))
})
err_by_tree = colMeans(err_by_tree!=Sonar$Class[-trn])
Then plot:
plot(rf$err.rate[,1],type="l",col="steelblue",xlab="ntrees",ylab="err.rate",
ylim=c(0,0.5))
lines(err_by_tree,col="orange")
legend("topright",fill=c("steelblue","orange"),c("test","OOB.train"))

sjt.lmer displaying incorrect p-values

I've just noticed that sjt.lmer tables are displaying incorrect p-values, e.g., p-values that do not reflect the model summary. This appears to be a new-ish issue, as this worked fine last month?
Using the provided data and code in the package vignette
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(lme4)
library(sjstats)
load sample data
data(efc)
prepare grouping variables
efc$grp = as.factor(efc$e15relat)
levels(x = efc$grp) <- get_labels(efc$e15relat)
efc$care.level <- rec(efc$n4pstu, rec = "0=0;1=1;2=2;3:4=4",
val.labels = c("none", "I", "II", "III"))
data frame for fitted model
mydf <- data.frame(
neg_c_7 = efc$neg_c_7,
sex = to_factor(efc$c161sex),
c12hour = efc$c12hour,
barthel = efc$barthtot,
education = to_factor(efc$c172code),
grp = efc$grp,
carelevel = to_factor(efc$care.level)
)
fit sample models
fit1 <- lmer(neg_c_7 ~ sex + c12hour + barthel + (1 | grp), data = mydf)
summary(fit1)
p_value(fit1, p.kr =TRUE)
model summary
p_value summary
sjt.lmer output does not show these p-values??
Note that the first summary comes from a model fitted with lmerTest, which computes p-values with df based on Satterthwaite approximation (see first line in output).
p_value(), however, with p.kr = TRUE, uses the Kenward-Roger approximation from package pbkrtest, which is a bit more conservative.
Your output from sjt.lmer() seems to be messed up somehow, and I can't reproduce it with your example. My output looks ok:

How to stack machine learning models in R

I am new to machine learning and R.
I know that there is an R package called caretEnsemble, which could conveniently stack the models in R. However, this package looks has some problems when deals with multi-classes classification tasks.
Temporarily, I wrote some codes to try to stack the models manually and here is the example I worked on:
library(caret)
set.seed(123)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis, predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3 / 4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]
set.seed(62433)
modelFitRF <- train(diagnosis ~ ., data = training, method = "rf")
modelFitGBM <- train(diagnosis ~ ., data = training, method = "gbm",verbose=F)
modelFitLDA <- train(diagnosis ~ ., data = training, method = "lda")
predRF <- predict(modelFitRF,newdata=testing)
predGBM <- predict(modelFitGBM, newdata = testing)
prefLDA <- predict(modelFitLDA, newdata = testing)
confusionMatrix(predRF, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
confusionMatrix(predGBM, testing$diagnosis)$overall[1]
#Accuracy
#0.7926829
confusionMatrix(prefLDA, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
Now I've got three models: modelFitRF, modelFitGBM and modelFitLDA, and three predicted vectors corresponding to such three models based on the test set.
Then I will create a data frame to contain these predicted vectors and the original dependent variable in the test set:
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
And then, I just used such data frame as a new train set to create a stacked model:
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]
#Accuracy
#0.804878
Considering that stacking models usually should improve the accuracy of the predictions, I'de like to believe this might be a right to stack the models. However, I also doubt that here I used the predDF which is created by the predictions from three models with the test set.
I am not sure whether I should use the results from the test set and then apply them back to the test set to get final predictions?
(I am referring to this block below:)
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]

Resources