Leave-One-Out CV implementation for lin. regression - r

I am building a linear regression on the cafe dataset and I want to validate the results by calculationg Leave-One-Out CrossValidation.
I wrote my own function for that, which works if I fit lm() on all data, but when I am using subset of columns (from Stepwise regression), I am getting an error. Consider the following code:
cafe <- read.table("C:/.../cafedata.txt", header=T)
cafe$Date <- as.Date(cafe$Date, format="%d/%m/%Y")
#Delete row 34
cafe <- cafe[-c(34), ]
#wont need date
cafe <- cafe[,-1]
library(DAAG)
#center the data
cafe.c <- data.frame(lapply(cafe[,2:15], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))
cafe.c$Day.of.Week <- cafe$Day.of.Week
cafe.c$Sales <- cafe$Sales
#Leave-One-Out CrossValidation function
LOOCV <- function(fit, dataset){
# Attributes:
#------------------------------
# fit: Fit of the model
# dataset: Dataset to be used
# -----------------------------
# Returns mean of squared errors for each fold - MSE
MSEP_=c()
for (idx in 1:nrow(dataset)){
train <- dataset[-c(idx),]
test <- dataset[idx,]
MSEP_[idx]<-(predict(fit, newdata = test) - dataset[idx,]$Sales)^2
}
return(mean(MSEP_))
}
Then when I fit the simple linear model and call the function, it works:
#----------------Simple Linear regression with all attributes-----------------
fit.all.c <- lm(cafe.c$Sales ~., data=cafe.c)
#MSE:258
LOOCV(fit.all.c, cafe.c)
However when I fit the same lm() only with subset of columns, I get an error:
#-------------------------Linear Stepwise regression--------------------------
step <- stepAIC(fit.all.c, direction="both")
fit.step <- lm(cafe.c$Sales ~ cafe.c$Bread.Sand.Sold + cafe.c$Bread.Sand.Waste
+ cafe.c$Wraps.Waste + cafe.c$Muffins.Sold
+ cafe.c$Muffins.Waste + cafe.c$Fruit.Cup.Sold
+ cafe.c$Chips + cafe.c$Sodas + cafe.c$Coffees
+ cafe.c$Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
5495.069
There were 50 or more warnings (use warnings() to see the first 50)
If I look closer:
idx <- 1
train <- cafe.c[-c(idx)]
test <- cafe.c[idx]
(predict(fit.step, newdata = test) -cafe.c[idx]$Sales)^2
I get MSE for all rows and an error:
Warning message:
'newdata' had 1 row but variables found have 47 rows
EDIT
I have found this question about the error, which says that this error occurs when I give different names to the columns, however this is not the case.

Change your code like the following:
fit.step <- lm(Sales ~ Bread.Sand.Sold + Bread.Sand.Waste
+ Wraps.Waste + Muffins.Sold
+ Muffins.Waste + Fruit.Cup.Sold
+ Chips + Sodas + Coffees
+ Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
# [1] 278.8984
idx <- 1
train <- cafe.c[-c(idx),]
test <- cafe.c[idx,] # need to select the row, not the column
(predict(fit.step, newdata = test) -cafe.c[idx,]$Sales)^2
# 1
# 51.8022
Also, you LOOCV implementation is not correct. You must fit a new model everytime on the train dataset on the leave-1-out fold. Right now you are training the model once on the entire dataset and using the same single model to compute the MSE on held out test dataset for each leave-1-out fold, but ideally you should have different models trained on different training datasets.

Related

Creating function to run k-fold cross validation on glmer object (Leave One Out Cross-Validation)

I am trying to create a function to run a k-fold cross validation on a glmer object.
This is just data I got online (my dataset is quite large) so the model isn't the best but if I can get this to work using this data I should be able to switch it to my dataset quite easily.
I want to do a LOOCV(Leave One Out Cross-Validation)
"LOOCV(Leave One Out Cross-Validation) is a type of cross-validation approach in which each observation is considered as the validation set and the rest (N-1) observations are considered as the training set."
The outline I got was from Caroline's answer on this researchgate thread.
https://www.researchgate.net/post/Does_R_code_for_k-fold_cross_validation_of_a_nested_glmer_model_exist
#load libraries
library(tidyverse)
library(optimx)
library(lme4)
#add example data
Data <- read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
Data <- select(Data, remission, IL6, CRP, DID)
Data
Data$remission<- as.factor(Data$remission)
Data$DID<- as.factor(Data$DID)
#add ROW column
Data <- Data %>% mutate(ROW = row_number())
head(Data)
PTOT=NULL
for (i in 1:8825) { # i in total number of observations in dataset
##Data that will be predicted
DataC1=Data[unique(Data$ROW)==i,]
###To train the model
DataCV=Data[unique(DataC1$ROW)!=i,]
M1 <- glmer(remission ~ 1 + IL6 + CRP + ( 1 | DID ), data = DataCV, family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
P1=predict(M1, DataC1)
names(P1)=NULL
P1
PTOT= c(PTOT, P1)
}
R2cv=1-(sum((remission-PTOT)^2)/(length(PTOT))/(var(remission)))
This is the error I get
"Error: Invalid grouping factor specification, DID"
DataCV is empty.
For example:
i <- 1 ## first time through the loop
DataCV=Data[unique(DataC1$ROW)!=i,]
I think that should have been DataC$ROW), not DataC1$ROW.
A few other comments: a more compact version of your code would look something like this:
## fit the full model
M1 <- glmer(remission ~ 1 + IL6 + CRP + ( 1 | DID ), data = DataC,
family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
res <- numeric(nrow(DataCV))
for (i in 1:nrow(DataCV)) {
new_fit <- update(M1, data = dataC[-i,]
res[i] <- (predict(new_fit, newdata=dataC[i,]) - remission[i])^2
}
For a well-specified model LOOCV is asymptotically equivalent to AIC, so you might be doing a lot of work to get something that's not very different from the AIC (which you can get directly from a single model fit) ...

How to loop over columns to evaluate different fixed effects in consecutive lme4 mixed models and extract the coefficients and P values?

I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275

Loop linear regression different predictor and outcome variables

I'm new to R but am slowly learning it to analyse a data set.
Let's say I have a data frame which contains 8 variables and 20 observations. Of the 8 variables, V1 - V3 are predictors and V4 - V8 are outcomes.
B = matrix(c(1:160),
nrow = 20,
ncol = 8,)
df <- as.data.frame(B)
Using the car package, to perform a simple linear regression, display summary and confidence intervals is:
fit <- lm(V4 ~ V1, data = df)
summary(fit)
confint(fit)
How can I write code (loop or apply) so that R regresses each predictor on each outcome individually and extracts the coefficients and confidence intervals? I realise I'm probably trying to run before I can walk but any help would be really appreciated.
You could wrap your lines in a lapply call and train a linear model for each of your predictors (excluding the target, of course).
my.target <- 4
my.predictors <- 1:8[-my.target]
lapply(my.predictors, (function(i){
fit <- lm(df[,my.target] ~ df[,i])
list(summary= summary(fit), confint = confint(fit))
}))
You obtain a list of lists.
So, the code in my own data that returns the error is:
my.target <- metabdata[c(34)]
my.predictors <- metabdata[c(18 : 23)]
lapply(my.predictors, (function(i){
fit <- lm(metabdata[, my.target] ~ metabdata[, i])
list(summary = summary(fit), confint = confint(fit))
}))
Returns:
Error: Unsupported index type: tbl_df

multiple linear regression: error in user defined function

I have written my function for MLR. However, there seems to an issue with output (see examples in the end).
But when I run the code, line by line, the output is correct.
mlr <- function(dependentvar, dataset) {
x <- model.matrix(dependentvar ~., dataset) # Design Matrix for x
y <- dependentvar # dependent variable
betas <- solve(crossprod(x))%*%crossprod(x,y) # beta values
SST <- t(y)%*%y - (sum(y)^2/dim(dataset)[1]) # total sum of squares
SSres <- t(y)%*%y -(t(betas)%*%crossprod(x,y)) # sum of squares of residuals
SSreg <- SST - SSres # regression sum of squares
sigmasqr <- SSres/(length(y) - dim(dataset)[2]) # variance or (MSE)
varofbeta <- sigmasqr[1]*solve( crossprod(x)) # variance of beta
cat("SST:", SST,"SSresiduals:", SSres,"SSregression:", SSreg, sep = "\n", append = FALSE)
return(betas)
}
To see the problem, try
mlr(trees$Height, trees)
I get the same problem even if I get rid of $
Height <- trees$Height
mlr(Height, trees)
Use the following:
x <- model.matrix(reformulate(".", dependentvar), dataset)
y <- dataset[[dependentvar]]
and pass in dependentvar as a string.
Example:
mlr("Height", trees)

five-fold cross-validation with the use of linear regression

I would like to perform a five-fold cross validation for a regression model of degree 1
lm(y ~ poly(x, degree=1), data).
I generated 100 observations with the following code
set.seed(1)
GenData <- function(n){
x <- seq(-2,2,length.out=n)
y <- -4 - 3*x + 1.5*x^2 + 2*x^3 + rnorm(n,0,0.5)
return(cbind(x,y))
}
GenData(100)
D<-GenData(100)
and my code for this goal is
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(D[te,2] == pred) /dim(test)[1])
}
re<-c(re,mean(t))
where I split my data into training and test.With the training data I run a regression with purpose to make a prediction and comperate it with my test data.But I have the following error
"Error in predict(mult, test)$class :
$ operator is invalid for atomic vectors
In addition: Warning message:
'newdata' had 20 rows but variables found have 80 rows "
So I understand that I have to change something on the line
pred<-predict(lm1,test)
but I dont know what .
Thanks in advance!
lm requires a data frame as input data. Also trying to validate the model by just verifying if the result matches the expected value will not work. You are simulating the irreducible error using normal error.
Here is the updated code:
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- data.frame(D[-te,1:2])
test <- data.frame(D[te,1:2])
lm1 <- lm(y~x , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(abs(D[te,2] - pred)) /dim(test)[1])
}
re<-c(re,mean(t))
In the lm() function, your y variable is cl, a vector not included in the data = argument:
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
No need to include the cl at all. Rather, simply specify x and y by their names in the dataset train, in this case the names are x and y:
names(train)
[1] "x" "y"
So your for loop would then look like:
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
lm1 <- lm(y ~x , data=train)
pred <- predict(lm1,test)
t[i]<- sum(D[te,2] == pred)/dim(test)[1]
}
Also, note that I have added the for loop index i so that values can be added to the object. Lastly, I had to make the D object a dataframe in order for the code to work:
D<-as.data.frame(GenData(100))
Your re object ends up being 0 because your model does not predict any numbers correctly. I would suggest using RMSE as a performance measure for continuous data.

Resources