five-fold cross-validation with the use of linear regression - r

I would like to perform a five-fold cross validation for a regression model of degree 1
lm(y ~ poly(x, degree=1), data).
I generated 100 observations with the following code
set.seed(1)
GenData <- function(n){
x <- seq(-2,2,length.out=n)
y <- -4 - 3*x + 1.5*x^2 + 2*x^3 + rnorm(n,0,0.5)
return(cbind(x,y))
}
GenData(100)
D<-GenData(100)
and my code for this goal is
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(D[te,2] == pred) /dim(test)[1])
}
re<-c(re,mean(t))
where I split my data into training and test.With the training data I run a regression with purpose to make a prediction and comperate it with my test data.But I have the following error
"Error in predict(mult, test)$class :
$ operator is invalid for atomic vectors
In addition: Warning message:
'newdata' had 20 rows but variables found have 80 rows "
So I understand that I have to change something on the line
pred<-predict(lm1,test)
but I dont know what .
Thanks in advance!

lm requires a data frame as input data. Also trying to validate the model by just verifying if the result matches the expected value will not work. You are simulating the irreducible error using normal error.
Here is the updated code:
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- data.frame(D[-te,1:2])
test <- data.frame(D[te,1:2])
lm1 <- lm(y~x , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(abs(D[te,2] - pred)) /dim(test)[1])
}
re<-c(re,mean(t))

In the lm() function, your y variable is cl, a vector not included in the data = argument:
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
No need to include the cl at all. Rather, simply specify x and y by their names in the dataset train, in this case the names are x and y:
names(train)
[1] "x" "y"
So your for loop would then look like:
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
lm1 <- lm(y ~x , data=train)
pred <- predict(lm1,test)
t[i]<- sum(D[te,2] == pred)/dim(test)[1]
}
Also, note that I have added the for loop index i so that values can be added to the object. Lastly, I had to make the D object a dataframe in order for the code to work:
D<-as.data.frame(GenData(100))
Your re object ends up being 0 because your model does not predict any numbers correctly. I would suggest using RMSE as a performance measure for continuous data.

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

"non-conforming parameters in function :" in simple linear regression using JAGS

I am super new to JAGS and Bayesian statistics, and have simply been trying to follow the Chapter 22 on Bayesian statistics in Crawley's 2nd Edition R Book. I copy the code down exactly as it appears in the book for the simple linear model: growth = a + b *tannin, where there are 9 rows of two continuous variables: growth and tannins. The data and packages are this:
install.packages("R2jags")
library(R2jags)
growth <- c(12,10,8,11,6,7,2,3,3)
tannin <- c(0,1,2,3,4,5,6,7,8)
N <- c(1,2,3,4,5,6,7,8,9)
bay.df <- data.frame(growth,tannin,N)
The ASCII file looks like this:
model{
for(i in 1:N) {
growth[i] ~ dnorm(mu[i],tau)
mu[i] <- a+b*tannin[i]
}
a ~ dnorm(0.0, 1.0E-4)
b ~ dnorm(0.0, 1.0E-4)
sigma <- 1.0/sqrt(tau)
tau ~ dgamma(1.0E-3, 1.0E-3)
}
But then, when I use this code:
> practicemodel <- jags(data=data.jags,parameters.to.save = c("a","b","tau"),
+ n.iter=100000, model.file="regression.bugs.txt", n.chains=3)
I get an error message that says:
module glm loaded
Compiling model graph
Resolving undeclared variables
Deleting model
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Non-conforming parameters in function :
The problem has been solved!
Basically the change is from N <- (1,2...) to N <- 9, but there is one other solution as well, where no N is specified in the beginning. You can specify N inside the data.jags function as the number of rows in the data frame; data.jags = list(growth=bay.df$growth, tannin=bay.df$tannin, N=nrow(bay.df)).
Here is the new code:
# Make the data frame
growth <- c(12,10,8,11,6,7,2,3,3)
tannin <- c(0,1,2,3,4,5,6,7,8)
# CHANGED : This is for the JAGS code to know there are 9 rows of data
N <- 9 code
bay.df <- data.frame(growth,tannin)
library(R2jags)
# Now, write the Bugs model and save it in a text file
sink("regression.bugs.txt") #tell R to put the following into this file
cat("
model{
for(i in 1:N) {
growth[i] ~ dnorm(mu[i],tau)
mu[i] <- a+b*tannin[i]
}
a ~ dnorm(0.0, 1.0E-4)
b ~ dnorm(0.0, 1.0E-4)
sigma <- 1.0/sqrt(tau)
tau ~ dgamma(1.0E-3, 1.0E-3)
}
", fill=TRUE)
sink() #tells R to stop putting things into this file.
#tell jags the names of the variables containing the data
data.jags <- list("growth","tannin","N")
# run the JAGS function to produce the function:
practicemodel <- jags(data=data.jags,parameters.to.save = c("a","b","tau"),
n.iter=100000, model.file="regression.bugs.txt", n.chains=3)
# inspect the model output. Important to note that the output will
# be different every time because there's a stochastic element to the model
practicemodel
# plots the information nicely, can visualize the error
# margin for each parameter and deviance
plot(practicemodel)
Thanks for the help! I hope this helps others.

Leave-One-Out CV implementation for lin. regression

I am building a linear regression on the cafe dataset and I want to validate the results by calculationg Leave-One-Out CrossValidation.
I wrote my own function for that, which works if I fit lm() on all data, but when I am using subset of columns (from Stepwise regression), I am getting an error. Consider the following code:
cafe <- read.table("C:/.../cafedata.txt", header=T)
cafe$Date <- as.Date(cafe$Date, format="%d/%m/%Y")
#Delete row 34
cafe <- cafe[-c(34), ]
#wont need date
cafe <- cafe[,-1]
library(DAAG)
#center the data
cafe.c <- data.frame(lapply(cafe[,2:15], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))
cafe.c$Day.of.Week <- cafe$Day.of.Week
cafe.c$Sales <- cafe$Sales
#Leave-One-Out CrossValidation function
LOOCV <- function(fit, dataset){
# Attributes:
#------------------------------
# fit: Fit of the model
# dataset: Dataset to be used
# -----------------------------
# Returns mean of squared errors for each fold - MSE
MSEP_=c()
for (idx in 1:nrow(dataset)){
train <- dataset[-c(idx),]
test <- dataset[idx,]
MSEP_[idx]<-(predict(fit, newdata = test) - dataset[idx,]$Sales)^2
}
return(mean(MSEP_))
}
Then when I fit the simple linear model and call the function, it works:
#----------------Simple Linear regression with all attributes-----------------
fit.all.c <- lm(cafe.c$Sales ~., data=cafe.c)
#MSE:258
LOOCV(fit.all.c, cafe.c)
However when I fit the same lm() only with subset of columns, I get an error:
#-------------------------Linear Stepwise regression--------------------------
step <- stepAIC(fit.all.c, direction="both")
fit.step <- lm(cafe.c$Sales ~ cafe.c$Bread.Sand.Sold + cafe.c$Bread.Sand.Waste
+ cafe.c$Wraps.Waste + cafe.c$Muffins.Sold
+ cafe.c$Muffins.Waste + cafe.c$Fruit.Cup.Sold
+ cafe.c$Chips + cafe.c$Sodas + cafe.c$Coffees
+ cafe.c$Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
5495.069
There were 50 or more warnings (use warnings() to see the first 50)
If I look closer:
idx <- 1
train <- cafe.c[-c(idx)]
test <- cafe.c[idx]
(predict(fit.step, newdata = test) -cafe.c[idx]$Sales)^2
I get MSE for all rows and an error:
Warning message:
'newdata' had 1 row but variables found have 47 rows
EDIT
I have found this question about the error, which says that this error occurs when I give different names to the columns, however this is not the case.
Change your code like the following:
fit.step <- lm(Sales ~ Bread.Sand.Sold + Bread.Sand.Waste
+ Wraps.Waste + Muffins.Sold
+ Muffins.Waste + Fruit.Cup.Sold
+ Chips + Sodas + Coffees
+ Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
# [1] 278.8984
idx <- 1
train <- cafe.c[-c(idx),]
test <- cafe.c[idx,] # need to select the row, not the column
(predict(fit.step, newdata = test) -cafe.c[idx,]$Sales)^2
# 1
# 51.8022
Also, you LOOCV implementation is not correct. You must fit a new model everytime on the train dataset on the leave-1-out fold. Right now you are training the model once on the entire dataset and using the same single model to compute the MSE on held out test dataset for each leave-1-out fold, but ideally you should have different models trained on different training datasets.

Potential bug in R's `polr` function when run from a function environment?

I may have found some sort of bug polr function (ordinal / polytomous regression) of the MASS library in R. The problem seems to be related to use of coef() on the summary object, but maybe is not.
The problem occurs in a function of type:
pol_me <- function(d){
some_x <- d[,1]
mod <- polr(some_x ~ d[,2])
pol_sum <- summary(mod)
return(pol_sum)
}
To illustrate, I simulate some data for an ordinal regression model.
set.seed(2016)
n=1000
x1 <- rnorm(n)
x2 <- 2*x1 +rnorm(n)
make_ord <- function(y){
y_ord <- y
y_ord[y < (-1)] <- 1
y_ord[y >= (-1) & y <= (1)] <- 2
y_ord[y >= 1] <- 3
y_ord <- as.factor(y_ord)
}
x1 <- make_ord(x1)
dat <- data.frame(x1,x2)
When we now call the function:
library(MASS)
pol_me(d = dat)
We get error
Error in eval(expr, envir, enclos) : object 'some_x' not found
I do not think this should logically happen at this point. In fact when we define alternative function in which the model command is replaced by a linear model lm on a numerical dependent variable, i.e.
mod <- lm(as.numeric(some_x) ~ d[,2])
The resulting function works fine.
Is this really a bug or a programming problem in my code and how can I get pol_me to run?
summary(polr(dat[,1] ~ dat[,2])) returns semi-error message Re-fitting to get Hessian and it's the cause of the error. polr's argument Hess = T will solve your problem. (?polr says Hess: logical for whether the Hessian (the observed information matrix) should be returned. Use this if you intend to call summary or vcov on the fit.)
pol_me <- function(d){
some_x <- d[,1]
mod <- polr(some_x ~ d[,2], Hess = T) # modify
pol_sum <- summary(mod)
return(pol_sum)
}

Why is leave-one-out cross-validation of GLM model (package=boot) failing when data contains NaN's?

This is a fairly simple procedure - refitting GLM model with subset of data (training set) and calculating the accuracy of the prediction on the remaining data. I am trying to run a "leave-one-out" strategy on a data set (i.e. training subset is length = n-1) using the cv.glm function of the package boot.
Am I doing something wrong, or is this really the case that the function doesn't seem to handle NA's? I'm guessing that this is fairly easy to program on my own, but I would appreciate any advise if there is some other mistake that I am making. Cheers.
Example:
require(boot)
#create data
n <- 100
x <- runif(n)
e <- rnorm(n, sd=100)
a <- 5
b <- 3
y <- exp(a + b*x) + e
plot(y ~ x)
plot(y ~ x, log="y")
#make some y's NaN
set.seed(1)
y[sample(n, 0.1*n)] <- NaN
#fit glm model
df <- data.frame(y=y, x=x)
glm.fit <- glm(y ~ x, data=df, family=gaussian(link="log"))
summary(glm.fit)
#calculate mean error of prediction (leave-one-out cross-validation)
cv.res <- cv.glm(df, glm.fit)
cv.res$delta
[1] NA NA
You're right. The function is not set up to handle NAs. The various options for the na.action argument of the glm() function don't really help, either. The easiest way to deal with it, is to remove the NAs from the data frame at the outset.
sub <- df[!is.na(df$y), ]
glm.fit <- glm(y ~ x, data=sub, family=gaussian(link="log"))
summary(glm.fit)
# calculate mean error of prediction (leave-one-out cross-validation)
cv.res <- cv.glm(sub, glm.fit)
cv.res$delta

Resources