I am working on an assignment where I have to evaluate the predictive model based on RMSE (Root Mean Squared Error) using the test data. I have already built a linear regression model to predict wine quality (numeric) using all available predictor variables based on the train data. Below is my current code. The full error is "Error: Problem with mutate() column regression1.
i regression1 = predict(regression1, newdata = my_type_test).
x no applicable method for 'predict' applied to an object of class "c('double', 'numeric')"
install.packages("rsample")
library(rsample)
my_type_split <- initial_split(my_type, prop = 0.7)
my_type_train <- training(my_type_split)
my_type_test <- testing(my_type_split)
my_type_train
regression1 <- lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
density + pH + sulphates + alcohol, data = my_type_train)
summary(regression1)
regression1
install.packages("caret")
library(caret)
install.packages("yardstick")
library(yardstick)
library(tidyverse)
my_type_test <- my_type_test %>%
mutate(regression1 = predict(regression1, newdata = my_type_test)) %>%
rmse(my_type_test, price, regression1)
Many of the steps you take are probably unnecessary.A minimal example that should achieve the same thing:
# Set seed for reproducibility
set.seed(42)
# Take the internal 'mtcars' dataset
data <- mtcars
# Get a random 80/20 split for the number of rows in data
split <- sample(
size = nrow(data),
x = c(TRUE, FALSE),
replace = TRUE,
prob = c(0.2, 0.8)
)
# Split the data into train and test sets
train <- data[split, ]
test <- data[!split, ]
# Train a linear model
fit <- lm(mpg ~ disp + hp + wt + qsec + am + gear, data = train)
# Predict mpg in test set
prediction <- predict(fit, test)
Result:
> caret::RMSE(prediction, test$mpg)
[1] 4.116142
Related
I'm a graduate student using a linear regression (count) model to understand drivers of fish movement into and out of tidal wetlands. I am currently trying to generate a publication-worthy model summary table in r. I've been using the sel.table function which has been working well for this purpose.
However, I've been unable to generate a column that contains the individual model formulas. Below is my code which is based off of some nice instructions for using the MuMIn package. https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples
So to recap, my question pertains to the last line of code below,
How can I insert model formulas into a model selection table.**
install.packages("MuMIn")
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
#add a column for model names
sel.table$Model <- rownames(sel.table)
#replace model name with formulas
for(i in 1:nrow(sel.table)) sel.table$Model[i]<- as.character(formula(paste(sel.table$Model[i])))[3]
#Any help on this topic would be greatly appreciated!
UPDATED CODE
My method of pulling out model names is pretty clunky but otherwise this code seems to generate what I intended (a complete model selection table). Also, I'm not sure if the model coefficients are displayed properly but I hope to follow up on this for my final answer.
data = mtcars
#write linear models
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp + disp, data = data),
model4 <- lm(mpg ~ cyl * hp + disp + wt + drat, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)
#slightly rename intercept column
names(sel.table)[1]="Intercept"
#select variables to display in model summary table
sel.table <- sel.table %>%
select(Intercept,cyl,hp,disp,wt,drat,df,logLik,AICc,delta)
#round numerical coumns
sel.table[,1:6]<- round(sel.table[,1:6],2)
sel.table[,8:10]<-round(sel.table[,8:10],2)
#add a column for model (row) names
sel.table$Model <- rownames(sel.table)
#extract model formulas
form <- data.frame(name = as.character(lapply(models, `[[`, c(10,2))))
#generate a column with model (row) numbers (beside associated model formulas)
form <- form %>%
mutate(Model=(1:4))
#merge model table and model formulas
sum_table <- merge (form,sel.table,by="Model")
#rename model equation column
names(sum_table)[2]="Formula"
print <- flextable(head(sum_table))
print <- autofit(print)
print
6/1/20 UPDATE:
Below is an image that describes two issues that I'm having with the code. I've found a workaround to the first question but I'm still investigating the second.
see details here
Models end up being misnumbered
Model formula columns are being generated for each model
I believe there is a part of the code missing in the examples you followed, that is why your code does not work.
The easiest way to generate formula-like strings is simply to deparse the right hand side of the model formulas (i.e. 3-rd element):
sapply(get.models(out.put, TRUE), function(mo) deparse(formula(mo)[[3]], width.cutoff = 500))
or, if you want A*B's expanded into A + B + A:B:
sapply(get.models(out.put, TRUE), function(mo) deparse(terms(formula(mo), simplify = TRUE)[[3]], width.cutoff = 500))
Update: the original example code improved and simplified:
library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
#! selection table
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)
library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)
#slightly rename intercept column
colnames(sel.table)[3] <- 'Intercept'
# #select summary columns for model selection table
# sel.table <- sel.table %>%
# select(Model,formula,Intercept,df,logLik,AICc,delta,weight)
print <- flextable(head(sel.table))
print <- autofit(print)
print
Since your question isn't reproducible, i'll try with something else and maybe that's what you're looking for:
data = mtcars
models = list(
model1 = lm(mpg ~ cyl, data = data),
model2 = lm(mpg ~ cyl + hp, data = data)
)
data.frame(name = as.character(lapply(models, `[[`, c(10,2))),
other.column = NA)
#> name other.column
#> 1 mpg ~ cyl NA
#> 2 mpg ~ cyl + hp NA
Created on 2020-05-28 by the reprex package (v0.3.0)
The formula (call) of a lm object is on position 10 of the list. You can actually count when you type model1$. You can use rownames() instead of a column, but that's not recommended.
EDIT AFTER REPRODUCIBLE EXAMPLE
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
# create an object that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
# formulas as names
sel.table$name = as.character(lapply(models, `[[`, c(10,2)))
# reordering
sel.table = sel.table[, c(6,1,2,3,4,5)]
sel.table
#> name df logLik AICc delta weight
#> 3 mpg ~ cyl 5 -78.14329 168.5943 0.000000 0.5713716
#> 1 mpg ~ cyl + hp 3 -81.65321 170.1636 1.569298 0.2607054
#> 2 mpg ~ cyl * hp 4 -80.78092 171.0433 2.449068 0.1679230
Created on 2020-05-31 by the reprex package (v0.3.0)
I try to obtain the first three coefficients for Cauchy's dispersion equation for Silicon. Using a csv containing the refractive index for some wavelengths (that you can find here), I try to fit the following model :
library(readr)
library(tidyverse)
library(magrittr)
library(modelr)
library(broom)
library(splines)
# CSV parsing
RefractiveIndexINFO <- read_csv("./silicon-index.csv")
# Cleaning the output of the csv-parsing
indlong = tibble(RefractiveIndexINFO$`Wavelength. µm`,RefractiveIndexINFO$n)
names(indlong) = c('w','n')
# Remove some wavelengths that might not fit
indlong_non_uv = indlong %>% filter(indlong$w >= 0.4)
# Renaming variables
w = indlong_non_uv$w
n = indlong_non_uv$n
# Creating the non linear model
model = nls(n ~ a + b*ns(w,-2) + c*ns(w,-4), data = indlong_non_uv)
# Gathering informations on the fitted model
cor(indlong_non_uv$n,predict(model))
tidy(model)
Which gives the following error :
Error in c * ns(w, -4) : non-numeric argument to binary operator
How can I circumvent this situation and get the three coefficients (a,b,c) in a row ?
Obviously, using model = nls(n ~ a + b*ns(w,-2), data = indlong_non_uv) does not give an error.
Try this:
library(readr)
library(tidyverse)
library(magrittr)
library(modelr)
library(broom)
library(splines)
# CSV parsing
RefractiveIndexINFO <- read_csv("aspnes.csv")
RefractiveIndexINFO <- RefractiveIndexINFO[1:46,]
RefractiveIndexINFO <- as.data.frame(apply(RefractiveIndexINFO,2,as.numeric))
names(RefractiveIndexINFO) <- c('w','n')
indlong_non_uv = RefractiveIndexINFO %>% filter(RefractiveIndexINFO$w >= 0.4)
# Creating the nonlinear model
model <- nls(n ~ a + b*w^(-2) + c*w^(-4), data = indlong_non_uv,
start=list(a=1, b=1, c=1))
# Gathering informations on the fitted model
cor(indlong_non_uv$n,predict(model))
# [1] 0.9991006
tidy(model)
# term estimate std.error statistic p.value
# 1 a 3.65925186 0.039368851 92.947896 9.686805e-20
# 2 b -0.04981151 0.024099580 -2.066904 5.926046e-02
# 3 c 0.05282668 0.003306895 15.974707 6.334197e-10
Alternatively, you can use linear regression:
model2 <- lm(n ~ I(w^(-2)) + I(w^(-4)), data = indlong_non_uv)
summary(model2)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 3.659252 0.039369 92.948 < 2e-16 ***
# I(w^(-2)) -0.049812 0.024100 -2.067 0.0593 .
# I(w^(-4)) 0.052827 0.003307 15.975 6.33e-10 ***
I am trying to find model with lowest AIC. Models are returned from two for loops that make possible combinations of columns. I am unable to make the function return model with lowest AIC. The code below demonstrates where I got stuck:
rm(list = ls())
data <- iris
data <- data[data$Species %in% c("setosa", "virginica"),]
data$Species = ifelse(data$Species == 'virginica', 0, 1)
mod_headers <- names(data[1:ncol(data)-1])
f <- function(mod_headers){
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- c(tab[,j])
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[c(mod_tab_new)], family = binomial(link = "logit"))
}
}
best_model <- model[which(AIC(model)[order(AIC(model))][1])]
print(best_model)
}
f(mod_headers)
Any suggestions? Thanks!
I replaced your for loops with vectorised alternatives
library(tidyverse)
library(iterators)
# Column names you want to use in glm model, saved as list
whichcols <- Reduce("c", map(1:length(mod_headers), ~lapply(iter(combn(mod_headers,.x), by="col"),function(y) c(y))))
# glm model results using selected column names, saved as list
models <- map(1:length(whichcols), ~glm(Species ~., data=data[c(whichcols[[.x]], "Species")], family = binomial(link = "logit")))
# selects model with lowest AIC
best <- models[[which.min(sapply(1:length(models),function(x)AIC(models[[x]])))]]
Output
Call: glm(formula = Species ~ ., family = binomial(link = "logit"),
data = data[c(whichcols[[.x]], "Species")])
Coefficients:
(Intercept) Petal.Length
55.40 -17.17
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 1.208e-09 AIC: 4
Using your loop, just put all the models in one list.
Then compute the AIC of all these models.
Finally return the model with the minimum AIC.
f <- function(mod_headers) {
models <- list()
k <- 1
for (i in 1:length(mod_headers)) {
tab <- combn(mod_headers, i)
for(j in 1:ncol(tab)) {
mod_tab_new <- c(tab[, j], "Species")
models[[k]] <- glm(Species ~ ., data = data[mod_tab_new],
family = binomial(link = "logit"))
k <- k + 1
}
}
models[[which.min(sapply(models, AIC))]]
}
glm() uses an iterative re-weighted least squares algorithm. The algorithm reaches the maximum number of iterations before it converges - changing this parameter helps in your case:
glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
There was another issue using which, I replaced it with an if after each model fit to compare to the lowest AIC so far. However, I think there are better solutions than this for-loop approach.
f <- function(mod_headers){
lowest_aic <- Inf # added
best_model <- NULL # added
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- tab[, j]
mod_tab_new <- c(tab_new, "Species")
model <- glm(Species ~., data=data[mod_tab_new], family = binomial(link = "logit"), control = list(maxit = 50))
if(AIC(model) < lowest_aic){ # added
lowest_aic <- AIC(model) # added
best_model <- model # added
}
}
}
return(best_model)
}
I am interested in calculating estimates and standard errors for linear combinations of coefficients after a linear regression in R. For example, suppose I have the regression and test:
data(mtcars)
library(multcomp)
lm1 <- lm(mpg ~ cyl + hp, data = mtcars)
summary(glht(lm1, linfct = 'cyl + hp = 0'))
This will estimate the value of the sum of the coefficients on cyl and hp, and provide the standard error based on the covariance matrix produced by lm.
But, suppose I want to cluster my standard errors, on a third variable:
data(mtcars)
library(multcomp)
library(lmtest)
library(multiwayvcov)
lm1 <- lm(mpg ~ cyl + hp, data = mtcars)
vcv <- cluster.vcov(lm1, cluster = mtcars$am)
ct1 <- coeftest(lm1,vcov. = vcv)
ct1 contains the SEs for my clustering by am. However, if I try to use the ct1 object in glht, you get an error saying
Error in modelparm.default(model, ...) :
no ‘coef’ method for ‘model’ found!
Any advice on how to do the linear hypothesis with the clustered variance covariance matrix?
Thanks!
glht(ct1, linfct = 'cyl + hp = 0') won't work, because ct1 is not a glht object and can not be coerced to such via as.glht. I don't know whether there is a package or an existing function to do this, but this is not a difficult job to work out ourselves. The following small function does it:
LinearCombTest <- function (lmObject, vars, .vcov = NULL) {
## if `.vcov` missing, use the one returned by `lm`
if (is.null(.vcov)) .vcov <- vcov(lmObject)
## estimated coefficients
beta <- coef(lmObject)
## sum of `vars`
sumvars <- sum(beta[vars])
## get standard errors for sum of `vars`
se <- sum(.vcov[vars, vars]) ^ 0.5
## perform t-test on `sumvars`
tscore <- sumvars / se
pvalue <- 2 * pt(abs(tscore), lmObject$df.residual, lower.tail = FALSE)
## return a matrix
matrix(c(sumvars, se, tscore, pvalue), nrow = 1L,
dimnames = list(paste0(paste0(vars, collapse = " + "), " = 0"),
c("Estimate", "Std. Error", "t value", "Pr(>|t|)")))
}
Let's have a test:
data(mtcars)
lm1 <- lm(mpg ~ cyl + hp, data = mtcars)
library(multiwayvcov)
vcv <- cluster.vcov(lm1, cluster = mtcars$am)
If we leave .vcov unspecified in LinearCombTest, it is as same as multcomp::glht:
LinearCombTest(lm1, c("cyl","hp"))
# Estimate Std. Error t value Pr(>|t|)
#cyl + hp = 0 -2.283815 0.5634632 -4.053175 0.0003462092
library(multcomp)
summary(glht(lm1, linfct = 'cyl + hp = 0'))
#Linear Hypotheses:
# Estimate Std. Error t value Pr(>|t|)
#cyl + hp == 0 -2.2838 0.5635 -4.053 0.000346 ***
If we provide a covariance, it does what you want:
LinearCombTest(lm1, c("cyl","hp"), vcv)
# Estimate Std. Error t value Pr(>|t|)
#cyl + hp = 0 -2.283815 0.7594086 -3.00736 0.005399071
Remark
LinearCombTest is upgraded at Get p-value for group mean difference without refitting linear model with a new reference level, where we can test any combination with combination coefficients alpha:
alpha[1] * vars[1] + alpha[2] * vars[2] + ... + alpha[k] * vars[k]
rather than just the sum
vars[1] + vars[2] + ... + vars[k]
When I run a cluster standard error panel specification with plm and lfe I get results that differ at the second significant figure. Does anyone know why they differ in their calculation of the SE's?
set.seed(572015)
library(lfe)
library(plm)
library(lmtest)
# clustering example
x <- c(sapply(sample(1:20), rep, times = 1000)) + rnorm(20*1000, sd = 1)
y <- 5 + 10*x + rnorm(20*1000, sd = 10) + c(sapply(rnorm(20, sd = 10), rep, times = 1000))
facX <- factor(sapply(1:20, rep, times = 1000))
mydata <- data.frame(y=y,x=x,facX=facX, state=rep(1:1000, 20))
model <- plm(y ~ x, data = mydata, index = c("facX", "state"), effect = "individual", model = "within")
plmTest <- coeftest(model,vcov=vcovHC(model,type = "HC1", cluster="group"))
lfeTest <- summary(felm(y ~ x | facX | 0 | facX))
data.frame(lfeClusterSE=lfeTest$coefficients[2],
plmClusterSE=plmTest[2])
lfeClusterSE plmClusterSE
1 0.06746538 0.06572588
The difference is in the degrees-of-freedom adjustment. This is the usual first guess when looking for differences in supposedly similar standard errors (see e.g., Different Robust Standard Errors of Logit Regression in Stata and R). Here, the problem can be illustrated when comparing the results from (1) plm+vcovHC, (2) felm, (3) lm+cluster.vcov (from package multiwayvcov).
First, I refit all models:
m1 <- plm(y ~ x, data = mydata, index = c("facX", "state"),
effect = "individual", model = "within")
m2 <- felm(y ~ x | facX | 0 | facX, data = mydata)
m3 <- lm(y ~ facX + x, data = mydata)
All lead to the same coefficient estimates. For m3 the fixed effects are explicitly reported while they are not for m1 and m2. Hence, for m3 only the last coefficient is extracted with tail(..., 1).
all.equal(coef(m1), coef(m2))
## [1] TRUE
all.equal(coef(m1), tail(coef(m3), 1))
## [1] TRUE
The non-robust standard errors also agree.
se <- function(object) tail(sqrt(diag(object)), 1)
se(vcov(m1))
## x
## 0.07002696
se(vcov(m2))
## x
## 0.07002696
se(vcov(m3))
## x
## 0.07002696
And when comparing the clustered standard errors we can now show that felm uses the degrees-of-freedom correction while plm does not:
se(vcovHC(m1))
## x
## 0.06572423
m2$cse
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX))
## x
## 0.06746538
se(cluster.vcov(m3, mydata$facX, df_correction = FALSE))
## x
## 0.06572423