R: Why does lapply() double up my results? - r

I'm writing a function for getting diagnostics and test error from a series of linear regression models.
My input is a list of lists. Each list carries the information for its own model.
model.1 <- list("medv","~.","Boston_Ready")
names(model.1) <- c("response", "input","dataset")
model.2 <- list("medv","~lstat","Boston_Ready")
names(model.2) <- c("response", "input","dataset")
models <- list(model.1,model.2)
My function calculates regression diagnostics when given one list that has the dataframe, response variable and inputs.
TestError <- function(model){
library('boot')
df <- get(model$dataset)
formula <- paste(model$response,model$input)
response <- model$response
##Diagnostics
fit <- lm(formula,data=df)
fit_summ <- summary(fit)
F_Stat <- fit_summ$fstatistic[1]
Adj_R_Sq <- fit_summ$adj.r.squared
RSS <- with(fit_summ, df[2] * sigma^2)
AIC <- AIC(fit)
BIC <- BIC(fit)
##Cross-Validation
#5-fold cross validation
glm.fit <- glm(formula, data=df)
cv.err <- cv.glm(df, glm.fit, K=5)
Five.Fold_MSE <- cv.err$delta[1]
#10-fold cross validation
glm.fit <- glm(formula, data=df)
cv.err <- cv.glm(df, glm.fit, K=10)
Ten.Fold_MSE <- cv.err$delta[1]
#LOOCV
glm.fit <- glm(formula, data=df)
cv.err <- cv.glm(df, glm.fit)
LOOCV_MSE <- cv.err$delta[1]
#Output
label <- c("lm","formula =",paste(model$response,model$input), "data= ",model$dataset)
print(paste(label))
Results <- (c(LOOCV_MSE,Five.Fold_MSE,Ten.Fold_MSE,F_Stat,Adj_R_Sq, RSS, AIC, BIC))
names(Results) <- c("LOOCV MSE", "5-Fold MSE", "10-Fold MSE","F-Stat","Adjusted R^2","RSS","AIC","BIC")
print(Results)
}
For some reason, the output generates the same thing twice
lapply(models,TestError)
> lapply(models,TestError)
[1] "lm" "formula =" "medv ~." "data= " "Boston_Ready"
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.3250332 0.3288020 0.3251508 114.3744328 0.6918372 152.5405737 853.2181335 903.9365735
[1] "lm" "formula =" "medv ~lstat" "data= " "Boston_Ready"
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.4597660 0.4622565 0.4593045 601.6178711 0.5432418 230.2061197 1043.4596316 1056.1392416
[[1]]
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.3250332 0.3288020 0.3251508 114.3744328 0.6918372 152.5405737 853.2181335 903.9365735
[[2]]
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.4597660 0.4622565 0.4593045 601.6178711 0.5432418 230.2061197 1043.4596316 1056.1392416
Is that due to a quirk with lapply()?

Because at the end of your function, you have print(result), so it is actually printing out your model then return it as list value.

Related

Displaying RMSE in summary when running multiple univariate linear regressions

I wrote a function to run univariate linear regressions for multiple variables at a time. However, in the summary table, I noticed that the RMSE is missing. How do I also display the RMSE to each of these regressions?
Here is my code and here is what my output looks like:
my.data <- read.csv("filename.csv", header=TRUE)
variables <-names(my.data[1:30])
my.list <- lapply(variables, function(var){formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
summary(res.linear)
})
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.367075060 4.46417498 5.2343547 3.017975e-06
variable1 0.008312962 0.04747918 0.1750865 8.616917e-01
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.347246142 1.48314397 3.605345 0.0006984638
variable2 0.008342116 0.01577413 0.528848 0.5991611451
We may extract the residuals from the summary output, get the squared mean and take the square root and cbind with the extracted coefficients
my.list <- lapply(variables, function(var){
formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
smry <- summary(res.linear)
RMSE <- sqrt(mean(smry$residuals^2))
cbind(coef(smry), RMSE = RMSE)
})

How can I use test data to calculate the MSE for a training model in R?

set.seed(1234)
training.samples=RealEstate$Y.house.price.of.unit.area%>%createDataPartition(p=0.75,list=FALSE)
train.data=RealEstate[training.samples,]
test.data=RealEstate[-training.samples,]
Price.Model1=lm(Y.house.price.of.unit.area~factor(X1.transaction.date)+
X2.house.age+
X3.distance.to.the.nearest.MRT.station+
X4.number.of.convenience.stores+
X5.latitude+
X6.longitude,
data=train.data)
Would this be correct?
mean((test.data$Y.house.price.of.unit.area-predict(Price.Model1))^2)
I'm getting this warning so I'm not sure if I'm doing it right:
Warning in test.data$Y.house.price.of.unit.area - predict(Price.Model1) :
longer object length is not a multiple of shorter object length
Mean squared error is defined as:
To calculate it in R:
Fit the model with the training data
Use the test data to get predictions with the predict() function
Use the predicted and actual values of the test data to calculate MSE
Using some fake data...
test_ix <- floor(runif(nrow(mtcars) * 0.2, 1, nrow(mtcars)))
train <- mtcars[-test_ix, ]
X_test <- mtcars[test_ix, ] %>%
select(!mpg)
y_test <- mtcars[test_ix, "mpg"]
fit <- lm(mpg ~ ., data = train)
yhat <- predict(fit, X_test)
mse <- mean((y_test - yhat) ** 2)
To obtain the RMSE take the square root of the MSE.

How do you resample LDA in r?

I thought that using bootstrap would resample my LDA but I am not sure. Additionally, if bootstrap does work, I am not sure how to code a bootstrap in r
Here is my LDA code:
library('MASS')
n=nrow(iris)
train = sample(n ,size = floor(n*0.75), replace = F)
train.species =Species[train]
test.species=Species[-train]
lda.fit = lda(Species~. , data=iris, subset=train)
The code below uses boot() library to perform bootstrap on iris dataset using LDA to get the standard errors for the LD1 and LD2 coefficients. Also, initial part of the code shows LDA fitting without bootstrap with the same coefficients.
# Library
library(MASS)
library(boot)
# Get data
data(iris)
names(iris) <- gsub("\\.", "", names(iris)) #remove dots from column names
# Split data into train and test sets
train_index <-sample(seq(nrow(iris)),floor(dim(iris)[1]*0.75))
train <- iris[train_index,]
test <- iris[-train_index,]
test_Y <- test[, c('Species')]
test_X <- subset(test, select=-c(Species))
#### LDA without bootstrap:
# Fit LDA to train data:
lda.fit = lda(Species ~ . , data=train)
lda.fit
# Predict test_Y based on lda.fit above
lda.pred <- predict(lda.fit, test_X)
lda.class <- lda.pred$class
# Confusion matrix
table(lda.class, test_Y)
#### LDA with bootstrap:
# Fit LDA to train data: to get standard errors for coefficients
set.seed(1)
boot.fn <- function(data,index){
return(coefficients(lda(Species ~ SepalLength + SepalWidth + PetalLength + PetalWidth, data=data, subset=index)))
}
# Call boot(): This returns LD1 and LD2 for each predictor
boot(train, boot.fn, 1000)
# NOTE: Here, in Bootstrap Statistics output, t1* to t4* are LD1 coefficients and t5* to t8* are LD2 coefficients

How to compare a model with no random effects to a model with a random effect using lme4?

I can use gls() from the nlme package to build mod1 with no random effects.
I can then compare mod1 using AIC to mod2 built using lme() which does include a random effect.
mod1 = gls(response ~ fixed1 + fixed2, method="REML", data)
mod2 = lme(response ~ fixed1 + fixed2, random = ~1 | random1, method="REML",data)
AIC(mod1,mod2)
Is there something similar to gls() for the lme4 package which would allow me to build mod3 with no random effects and compare it to mod4 built using lmer() which does include a random effect?
mod3 = ???(response ~ fixed1 + fixed2, REML=T, data)
mod4 = lmer(response ~ fixed1 + fixed2 + (1|random1), REML=T, data)
AIC(mod3,mod4)
With modern (>1.0) versions of lme4 you can make a direct comparison between lmer fits and the corresponding lm model, but you have to use ML --- it's hard to come up with a sensible analogue of the "REML criterion" for a model without random effects (because it would involve a linear transformation of the data that set all of the fixed effects to zero ...)
You should be aware that there are theoretical issues with information-theoretic comparisons between models with and without variance components: see the GLMM FAQ for more information.
library(lme4)
fm1 <- lmer(Reaction~Days+(1|Subject),sleepstudy, REML=FALSE)
fm0 <- lm(Reaction~Days,sleepstudy)
AIC(fm1,fm0)
## df AIC
## fm1 4 1802.079
## fm0 3 1906.293
I prefer output in this format (delta-AIC rather than raw AIC values):
bbmle::AICtab(fm1,fm0)
## dAIC df
## fm1 0.0 4
## fm0 104.2 3
To test, let's simulate data with no random effect (I had to try a couple of random-number seeds to get an example where the among-subject std dev was actually estimated as zero):
rr <- simulate(~Days+(1|Subject),
newparams=list(theta=0,beta=fixef(fm1),
sigma=sigma(fm1)),
newdata=sleepstudy,
family="gaussian",
seed=103)[[1]]
ss <- transform(sleepstudy,Reaction=rr)
fm1Z <- update(fm1,data=ss)
VarCorr(fm1Z)
## Groups Name Std.Dev.
## Subject (Intercept) 0.000
## Residual 29.241
fm0Z <- update(fm0,data=ss)
all.equal(c(logLik(fm0Z)),c(logLik(fm1Z))) ## TRUE
While I agree that with Ben that the simplest solution is to set REML=FALSE, the maximum REML likelihood for a model without random effects is well defined and is fairly straightforward to compute via the well known relation
between the ordinary profile likelihood function and the restricted likelihood.
The following code simulates data for which the estimated variance of the random intercept of a LMM ends up at 0 such that the maximum restricted log likelihood of the LMM should be equal to the restricted likelihood of the model without any random effects included.
The restricted likelihood of the LM is computed via the above formula and evaluates to the same value as that of the LMM.
An even simpler alternative is to use glmmTMB:
library(lme4)
#> Loading required package: Matrix
# simulate some toy data for which the LMM ends up at the boundary
set.seed(5)
n <- 100 # the sample size
x <- rnorm(n)
y <- rnorm(n)
group <- factor(rep(1:10,10))
# fit the LMM via REML
mod1 <- lmer(y ~ x + (1|group), REML=TRUE, control=lmerControl(boundary.tol=1e-8))
#> boundary (singular) fit: see ?isSingular
logLik(mod1)
#> 'log Lik.' -147.8086 (df=4)
# fit a model without random effects and compute its maximum REML log likelihood
mod0 <- lm(y ~ x)
p <- length(coef(mod0)) # number of fixed effect parameters
X <- model.matrix(mod0) # the fixed effect design matrix
sigma.REML <- summary(mod0)$sigma # REMLE of sigma
# the maximum ordinary log likelihood evaluated at the REML estimates
logLik.lm.at.REML <- sum(dnorm(residuals(mod0), 0, sigma.REML, log=TRUE))
# the restricted log likelihood of the model without random effects (via above formula)
logLik.lm.at.REML + p/2*log(2*pi) - 1/2*(- p*log(sigma.REML^2) + determinant(crossprod(X))$modulus)
#> [1] -147.8086
#> attr(,"logarithm")
#> [1] TRUE
library(glmmTMB)
data <- data.frame(y,x,group)
logLik(glmmTMB(y~x, family = gaussian(), data=data, REML=TRUE))
#> 'log Lik.' -147.8086 (df=3)
logLik(glmmTMB(y~x+(1|group), family = gaussian(), data=data, REML=TRUE))
#> 'log Lik.' -147.8086 (df=4)

Comparing GLM models using predict

Suppose I have two models created by calling glm() on the same data but with different formulas and/or families. Now I want to compare which model is better by predicting on an unknown data. Something like this:
mod1 <- glm(formula1, family1, data)
mod2 <- glm(formula2, family2, data)
mu1 <- predict(mod1, newdata, type = "response")
mu2 <- predict(mod2, newdata, type = "response")
How can I tell which of the predictions mu1 or mu2 is better?
Is there some simple command to compute the log likelihood of a prediction?
It would be easier to answer this with a reproducible example.
It often makes more sense to choose a family a priori rather than according too goodness of fit -- for example, if you have count (non-negative integer) responses with no obvious upper bound, your only real choice that lies strictly within the exponential family is Poisson.
set.seed(101)
x <- runif(1000)
mu <- exp(1+2*x)
y <- rgamma(1000,shape=3,scale=mu/3)
d <- data.frame(x,y)
New data:
nd <- data.frame(x=runif(100))
nd$y <- rgamma(100,shape=3,scale=exp(1+2*nd$x)/3)
Fit Gamma and Gaussian:
mod1 <- glm(y~x,family=Gamma(link="log"),data=d)
mod2 <- glm(y~x,family=gaussian(link="log"),data=d)
Predictions:
mu1 <- predict(mod1, newdata=nd, type="response")
mu2 <- predict(mod2, newdata=nd, type="response")
Extract shape/scale parameters:
sigma <- sqrt(summary(mod2)$dispersion)
shape <- MASS::gamma.shape(mod1)$alpha
Root mean squared error:
rmse <- function(x1,x2) sqrt(mean((x1-x2)^2))
rmse(mu1,nd$y) ## 5.845
rmse(mu2,nd$y) ## 5.842
Negative log likelihoods:
-sum(dgamma(nd$y,shape=shape,scale=mu1/shape,log=TRUE)) ## 276.84
-sum(dnorm(nd$y,mean=mu2,sd=sigma,log=TRUE)) ## 318.4

Resources