Calculating RMSE for Simulated Linear Regression - r

I am trying to calculate the RMSE for the simulated data. But the output gives NaN for the RMSE. Below is the code I am using.
library(caret)
RMSE <- function(x,y) sqrt(mean((x-y)^2))
sim.regression<-function(n.obs=200,coefficients=c(3,1.5,0,0,2,0,0,0),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
train.data<-y[1:150]
train.data<-data.frame(train.data)
test.data<-y[151:200]
test.data<-data.frame(test.data)
prediction <- predict(lm(y~M),test.data)
RMSE.data<-RMSE(prediction, test.data$y)
return (list(x=M,y=y,coeff=coefficients, RMSE=RMSE.data))
}
set.seed(2000)
sim.regression(100)

Welcome to SO. There were few issues in the code:
Assuming that you are trying to learn/predict 'y' based on 'M', you have to combine M and y and make a data frame.
After that only, you should split first 150 for train and remaining for test.
Then you train on train.data and predict on test.data
Also, since you have hardcoded [1:150] and [150:200] for train-test split, you will have to pass 200 as in sim.regression(200).
Corrected code below:
library(caret)
RMSE <- function(x,y) sqrt(mean((x-y)^2))
sim.regression<-function(n.obs=200,coefficients=c(3,1.5,0,0,2,0,0,0),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
data<-data.frame(M,y)
train.data <- data[1:150,]
test.data<-data[151:200,]
prediction <- predict(lm(y~., data=train.data),test.data)
RMSE.data<-RMSE(prediction, test.data$y)
return (list(x=M,y=y,coeff=coefficients, RMSE=RMSE.data))
}
set.seed(2000)
sim.regression(200)
Prints:
$RMSE
0.0755869850491716

Related

How to convert one-fold cross-validation to K-fold cross-validation in R

I have a GAM model for which I would like to calculate AUC, TSS (True Skill Statistic) and RMSE through 5-fold cross-validation in R. Unfortunately, the caret package does not support GAM and therefore cannot be used. As I didn’t find any alternative, I tried to build the code for cross-validation myself, and it works well, with the only problem that it is only one-fold cross-validation. Could anybody help me to make this 5-fold? Sorry if this is an elementary question, I am new to R.
sample <- sample(c(TRUE, FALSE), nrow(DF), replace=TRUE, prob=c(0.8,0.2))
train <- DF[sample, ]
test <- DF[!sample, ]
predicted <- predict(GAM, test, type="response")
# Calculating RMSE
RMSE(test$Y, predicted)
# Calculating AUC
auc(test$Y, predicted)
GAM_TSS <- gam(Y ~ X1 + X2 + X3 + X4 + s(X5, k = 3), train, family = "binomial")
test$pred <- predict(GAM_TSS, type="response", newdata=test)
roc.curve <- roc(test$Y, test$pred, ci=T)
plot(roc.curve)
threshold <- 0.1
CM <- confusionMatrix(factor(test$pred>threshold), factor(test$P_A==1), positive="TRUE")
CM <- CM$byClass
Sensitivity <- CM[['Sensitivity']]
Specificity <- CM[['Specificity']]
# Calculating TSS
TSS = Sensitivity + Specificity - 1
TSS
I have come across precisely this problem with GAM in the past. My approach was to create a vector to split data randomly into parts as equally sized as possible, then loop through the fold ids as follows:
k <- 5
FoldID <- rep(1:k, ceiling(nrow(modelData)/k))
length(FoldID) <- nrow(modelData)
FoldID <- sample(FoldID, replace = FALSE)
for(fold in 1:k){
train_data <- modelData[FoldID != fold, ]
val_data <- modelData[FoldID == fold, ]
# Create training model and predictions
# Calculate RMSE data etc.
# Add a line with fold validation results to a dataframe
}
# Calculate column means of your validation results frame
I will leave you to fill in the gaps to suit your own requirements. It would also be a good idea to add an outer loop (outside the FoldID creation) for repeats.

Two methods of recovering fitted values from a Bayesian Structural Time Series model yield different results

Two conceptually plausible methods of retrieving in-sample predictions (or "conditional expectations") of y[t] given y[t-1] from a bsts model yield different results, and I don't understand why.
One method uses the prediction errors returned by bsts (defined as e=y[t] - E(y[t]|y[t-1]); source: https://rdrr.io/cran/bsts/man/one.step.prediction.errors.html):
library(bsts)
get_yhats1 <- function(fit){
# One step prediction errors defined as e=y[t] - yhat (source: )
# Recover yhat by y-e
bsts.pred.errors <- bsts.prediction.errors(fit, burn=SuggestBurn(0.1, fit))$in.sample
predictions <- t(apply(bsts.pred.errors, 1, function(e){fit$original.series-e}))
return(predictions)
}
Another sums the contributions of all model component at time t.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
Fit a model:
## Air passengers data
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(y, state.specification=ss, niter=500, family='gaussian')
Compute and compare predictions using each of the functions
p1 <- get_yhats1(bsts.model)
p2 <- get_yhats2(bsts.model)
# Compare predictions for t=1:5, first MCMC iteration:
p1[1,1:5]; p2[1,1:5]
I'm the author of bsts.
The 'prediction errors' in bsts come from the filtering distribution. That is, they come from p(state | past data). The state contributions come from the smoothing distribution, i.e. p(state | all data). The filtering distribution looks backward in time, while the smoothing distribution looks both forward and backward. One typically needs the filtering distribution while using a fitted model, and the smoothing distribution while fitting the model in the first place.

How to recover fitted values from BSTS poisson model (in R)?

I am trying to recover in-sample predictions (fitted values) from a bsts model with a specified poisson response using the bsts package in R. The following results in an error: Prediction errors are not supported for Poisson or logit models.
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log10(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(Y, state.specification=ss, niter=150, family='poisson')
bsts.prediction.errors(bsts.model)
Is there a way to retrieve predictions on model-training data with a poisson model in bsts?
One way to do it is to extract the contribution of each model component at time t and sum them.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
get_yhats2(bsts.model)
But I also posted here, showing that this method didn't necessarily match expectations I had even in the Gaussian case.

Bootstrapping residuals of a linear model

Suppose I want to assess the goodness of a linear model before and after leaving out a covariate, and I want to implement some kind of bootstrapping.
I tried to bootstrap the sum of residuals of both models and then I applied the Kolmogorov-Smirnov test to assess if the two are the same distributions.
The minimal working code:
lm.statistic.resid <- function(data,i){
d<-data[i,]
r.gressor <- colnames(data)[1]
c.variates <- colnames(data)[-1]
lm.boot <- lm(data=d)
out <- sum(resid(lm.boot))
return(out)
}
df.restricted <- mtcars[ , names(mtcars) != c("wt")]
classical.lm <- lm(mtcars)
restricted.lm <- lm(df.restricted)
boot.regression.full = boot(df,
statistic=lm.statistic.resid,
R=1000)
boot.regression.restricted = boot(df.restricted,
statistic=lm.statistic.resid,
R=1000)
x <- boot.regression.restricted$t
y <- boot.regression.full$t
ks.test(x,y)
However, I get kind of the same result both in removing wt (which statistically significant) and am (which is not).
I should expect a smaller p-value in case I remove wt.

five-fold cross-validation with the use of linear regression

I would like to perform a five-fold cross validation for a regression model of degree 1
lm(y ~ poly(x, degree=1), data).
I generated 100 observations with the following code
set.seed(1)
GenData <- function(n){
x <- seq(-2,2,length.out=n)
y <- -4 - 3*x + 1.5*x^2 + 2*x^3 + rnorm(n,0,0.5)
return(cbind(x,y))
}
GenData(100)
D<-GenData(100)
and my code for this goal is
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(D[te,2] == pred) /dim(test)[1])
}
re<-c(re,mean(t))
where I split my data into training and test.With the training data I run a regression with purpose to make a prediction and comperate it with my test data.But I have the following error
"Error in predict(mult, test)$class :
$ operator is invalid for atomic vectors
In addition: Warning message:
'newdata' had 20 rows but variables found have 80 rows "
So I understand that I have to change something on the line
pred<-predict(lm1,test)
but I dont know what .
Thanks in advance!
lm requires a data frame as input data. Also trying to validate the model by just verifying if the result matches the expected value will not work. You are simulating the irreducible error using normal error.
Here is the updated code:
ind<-sample(1:100)
re<-NULL
k<-20
teams<- 5
t<-NULL
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- data.frame(D[-te,1:2])
test <- data.frame(D[te,1:2])
lm1 <- lm(y~x , data=train)
pred <- predict(lm1,test)
t<- c(t, sum(abs(D[te,2] - pred)) /dim(test)[1])
}
re<-c(re,mean(t))
In the lm() function, your y variable is cl, a vector not included in the data = argument:
cl <- D[-te,2]
lm1 <- lm(cl ~train[,1] , data=train)
No need to include the cl at all. Rather, simply specify x and y by their names in the dataset train, in this case the names are x and y:
names(train)
[1] "x" "y"
So your for loop would then look like:
for (i in 1:teams) {
te<- ind[ ((i-1)*k+1):(i*k)]
train <- D[-te,1:2]
test <- D[te,1:2]
lm1 <- lm(y ~x , data=train)
pred <- predict(lm1,test)
t[i]<- sum(D[te,2] == pred)/dim(test)[1]
}
Also, note that I have added the for loop index i so that values can be added to the object. Lastly, I had to make the D object a dataframe in order for the code to work:
D<-as.data.frame(GenData(100))
Your re object ends up being 0 because your model does not predict any numbers correctly. I would suggest using RMSE as a performance measure for continuous data.

Resources