can not use Non-standard evaluation in self-define function in r - r

I want to write a function that extracts some information from gam model.
I can do this without self-define function (df is what I wanted):
library(mgcv)
library(tidyverse)
model = gam(mpg ~ cyl, data = mtcars)
result = summary(model)$p.table
estimate = result[2,1]
se = result[2,2]
df = data.frame(estimate = estimate, se = se)
df
Then I wrapped it with a self-define function:
my_gam <- function(y, x, data){
model = gam(y ~ x, data = data)
result = summary(model)$p.table
estimate = result[2,1]
se = result[2,2]
df = data.frame(estimate = estimate, se = se)
df
}
But I can not use my function correctly.
my_gam(y = mpg, x = cyl, data = mtcars)
Error in eval(predvars, data, env) : object 'cyl' not found
my_gam(y = 'mpg', x = 'cyl', data = mtcars)
Error in gam(y ~ x, data = data) :
Not enough (non-NA) data to do anything meaningful
Is that a way I can get the df just as the first code block when I run my_gam(y = mpg, x = cyl, data = mtcars).
Any help will be highly appreciated!!

You can use reformulate/as.formula to construct the formula.
library(mgcv)
my_gam <- function(y, x, data){
model = gam(reformulate(x, y), data = data)
result = summary(model)$p.table
estimate = result[2,1]
se = result[2,2]
df = data.frame(estimate = estimate, se = se)
df
}
my_gam(y = 'mpg', x = 'cyl', data = mtcars)
# estimate se
#1 -2.876 0.3224

We can construct a formula with paste which would be fast
my_gam <- function(y, x, data){
model <- gam(as.formula(paste(y, "~", x)), data = data)
result <- summary(model)$p.table
estimate <- result[2,1]
se <- result[2,2]
df <- data.frame(estimate = estimate, se = se)
df
}
my_gam(y = 'mpg', x = 'cyl', data = mtcars)
# estimate se
#1 -2.87579 0.3224089
Or another option is to pass a formula as argument
my_gam <- function(fmla, data){
model <- gam(fmla, data = data)
result <- summary(model)$p.table
estimate <- result[2,1]
se <- result[2,2]
df <- data.frame(estimate = estimate, se = se)
df
}
my_gam(mpg ~ cyl, data = mtcars)
# estimate se
# 1 -2.87579 0.3224089

Related

Passing glm family into function and compare two fitted models using a likelihood-ratio test

I have a function that imputes data, fits models, and compares models. I would like to be able to pass the glm family into the function. However, that does not work.
library(mice)
fun_1 <- function(my_data, my_family) {
imp <- mice(my_data, seed = 51009, print = FALSE)
mi1 <- with(data = imp, expr = glm(bmi ~ age + hyp + chl, family = my_family))
mi0_1 <- with(data = imp, expr = glm(bmi ~ age + hyp, family = my_family))
mi0_2 <- with(data = imp, expr = glm(bmi ~ age + chl, family = my_family))
lapply(list(mi0_1, mi0_2), function(x) mice::D3(fit1 = mi1, fit0 = x))
}
fun_1(my_data = nhanes2, my_family = "gaussian")
When I run that function I get an error:
Error in glm(formula = bmi ~ 1, family = my_family, data = cbind(data, :
object 'my_family' not found
I was surprised since passing the family into that function here works well:
fun_2 <- function(my_data, my_family) {
glm(bmi ~ age + hyp + chl, family = my_family, data = my_data)
}
fun_2(my_data = nhanes2, my_family = "gaussian")
And if I only pass the data into fun_1 then everything runs well
fun_3 <- function(my_data) {
imp <- mice(my_data, seed = 51009, print = FALSE)
mi1 <- with(data = imp, expr = glm(bmi ~ age + hyp + chl, family = "gaussian"))
mi0_1 <- with(data = imp, expr = glm(bmi ~ age + hyp, family = "gaussian"))
mi0_2 <- with(data = imp, expr = glm(bmi ~ age + chl, family = "gaussian"))
lapply(list(mi0_1, mi0_2), function(x) mice::D3(fit1 = mi1, fit0 = x))
}
fun_3(my_data = nhanes2)
Strangely enough it is not the regressions that fail in fun_1 but it is the likelihood ratio test (D3) later on.
What am I doing wrong? How can I make this work?

Histogram of AIC for each models

Hello How can I create a histogram for the difference of the AICs of each models to the AIC of the full model.?
#AIC of the full model
Y <- modelTT$aic
#AICs for each of the n models.
X <- lapply(listOfModels,function(xx) xx$aic)
so basically I want to do the X - Y first. Then I need to create the histogram of each of the difference values from largest to smallest.
Another alternative using broom()
df = data.frame(a = sample(1:10, replace = TRUE, 24),
b = sample(25:40, replace = TRUE, 24),
c = sample(0:1, replace = TRUE, 24))
model1 = lm(a ~ b + c, df)
model2 = lm(b ~ c, df )
model3 = lm(a ~ c, df)
library(broom)
library(ggplot2)
library(dplyr)
mod1 = glance(model1) %>% mutate(model = "m1")
mod2 = glance(model2) %>% mutate(model = "m2")
mod3 = glance(model3) %>% mutate(model = "m3")
models = bind_rows(mod1, mod2, mod3)
models %>% ggplot(aes(model,AIC)) + geom_bar(stat = "identity")
Gives the following
A generic data.frame
db<-data.frame(y=c(1,2,3,4,5,6,7,8,9),x1=c(9,8,7,6,5,4,3,2,1),x2=c(9,9,7,7,5,5,3,3,1))
A list of lm models
LM_modesl<-NULL
LM_modesl[[1]]<-lm(y ~ x1+x2 , data = db)
LM_modesl[[2]] <- lm(y ~ x1 , data = db)
LM_modesl[[3]] <- lm(y ~ x2 , data = db)
AIC calculation
AIC<-lapply(LM_modesl,AIC)
Decreasing plot
plot(sort(unlist(AIC),decreasing = T),type="h")

Using MAE as the error function for a linear model

I'd like to perform linear regression, however instead of using RMSE as my error function, I'd like to use MAE (Mean Absolute Error).
Is there a package that would allow me to do this?
You may use caret and Metrics packages.
library(caret)
data("mtcars")
maeSummary <- function (data,
lev = NULL,
model = NULL) {
require(Metrics)
out <- mae(data$obs, data$pred)
names(out) <- "MAE"
out
}
mControl <- trainControl(summaryFunction = maeSummary)
set.seed(123)
lm_model <- train(mpg ~ wt,
data = mtcars,
method = "lm",
metric = "MAE",
maximize = FALSE,
trControl = mControl)
> lm_model$metric
[1] "MAE"
Probably late to the party, but here is a solution using CVXR package for optimisation.
library(CVXR)
# defining variables to be tuned during optimisation
coefficient <- Variable(1)
intercept <- Variable(1)
# defining the objective i.e. minimizing the sum af absolute differences (MAE)
objective <- Minimize(sum(abs(mtcars$disp - (mtcars$hp * coefficient) - intercept)))
# optimisation
problem <- Problem(objective)
result <- solve(problem)
# result
result$status
mae_coefficient <- result$getValue(coefficient)
mae_intercept <- result$getValue(intercept)
lm_coeff_intrc <- lm(formula = disp ~ hp, data = mtcars)$coefficients
library(tidyverse)
ggplot(mtcars, aes(hp, disp)) +
geom_point() +
geom_abline(
slope = lm_coeff_intrc["hp"],
intercept = lm_coeff_intrc["(Intercept)"],
color = "red"
) +
geom_abline(
slope = mae_coefficient,
intercept = mae_intercept,
color = "blue"
)
df <- mtcars %>%
select(disp, hp) %>%
rownames_to_column() %>%
mutate(
mae = disp - hp * mae_coefficient - mae_intercept,
lm = disp - hp * lm_coeff_intrc["hp"] - lm_coeff_intrc["(Intercept)"]
)
df %>%
select(mae, lm) %>%
pivot_longer(cols = 1:2) %>%
group_by(name) %>%
summarise(
mae = sum(abs(value))
)

Calling Variables in a Formula in r

I want to generalize my function in. I have defined form = y~x, and want to call y so that I use it within the function. y should be user (dynamic) defined and that is why I need a way to call it from the form. This is part of the code I tried and newdata is calculated within the function.
form = y ~ x
newdata = y
trial = function(form, x){
y = newdata
reg = lm(form, data = data.frame(x, newdata))
reg
}
If you want your function to work you could try :
form = y ~ x
newdata = 3*(1:100)+2
trial = function(f=form, x){
y = newdata
reg = lm(f, data = data.frame(x, y=newdata))
reg
}
trial(x=1:100) # or trial(form,1:100)
Call:
lm(formula = f, data = data.frame(x, y = newdata))
Coefficients:
(Intercept) x
2 3 # as expected
The think is, in your previous function, the form was an argument and R did not understand that it was the form you defined. Using f with a default value of form solves this issue.
Was that what you wanted ?
Note that if you want to call the function and define y in the call you could do :
form = y ~ x
trial = function(f=form, x, y){
reg = lm(f, data = data.frame(x, y))
reg
}
trial(x=1:100,y=3*(1:100)+2)
Call:
lm(formula = f, data = data.frame(x, y))
Coefficients:
(Intercept) x
2 3
From what I understood from your comments this might be closer to what you expected :
form = y ~ x
trial = function(f=form, x, y){
y = 5*y+2
reg = lm(f, data = data.frame(x, y))
reg
}
trial(x=1:100,y=3*(1:100))
Call:
lm(formula = f, data = data.frame(x, y))
Coefficients:
(Intercept) x
2 15
You call the function specifying x and giving a "first value" for y. Then, in the function, y is transformed (quite an easy transformation here) and then the regression is done.
Note that when you call the function, y is not the variable y but the argument in the function. If you have a variable z equal to 3*(1:100) you can do trial(x=1:100,y=z)
I tried this out and it turned out to be what I needed
form = y ~ x
trial = function(form, x){
.
.
.
newdata = something
.
.
new.form <- as.formula(call("~", form[[2]], form[[3]]))
assign(deparse(form[[2]]), newdata)
reg = lm(new.form, data = data.frame(x, newdata))
reg
}
Something like this?
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm(weight ~ group)
form <- as.formula(y ~ x)
trial = function(form,y, x){
reg = lm(form, data = data.frame(x=x, y=y))
reg
}
trial(form,weight,group)

How to graph my multiple linear regression model (caret)?

I have created an multiple linear regression model and would now like to plot it. But I can't seem to figure it out. Any help would be greatly appreciated! I used baruto to find the feature attributes and then used train() to get the model. When I try to plot model_lm I get the error:
There are no tuning parameters with more than 1 value.
Here is my code at what I have attempted so far:
rt_train <- rttotal2
rt_train$year <- NULL
#rt_train$box_office <- NULL
#impute na and address multicoliniearity
preproc <- preProcess(rt_train, method = c("knnImpute","center",
"scale"))
rt_proc <- predict(preproc, rt_train)
rt_proc$box_office <- rt_train$box_office
sum(is.na(rt_proc))
titles <- rt_proc$titles
rt_proc$titles <- NULL
#rt_train$interval <- as.factor(rt_train$interval)
dmy <- dummyVars(" ~ .", data = rt_proc,fullRank = T)
rt_transform <- data.frame(predict(dmy, newdata = rt_proc))
index <- createDataPartition(rt_transform$interval, p =.75, list = FALSE)
train_m <- rt_transform[index, ]
rt_test <- rt_transform[-index, ]
str(rt_train)
y_train <- train_m$box_office
y_test <-rt_test$box_office
train_m$box_office <- NULL
rt_test$box_office <- NULL
#selected feature attributes
boruta.train <- Boruta(interval~., train_m, doTrace =1)
#graph to see most important var to interval
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
plot(boruta.train, xlab = "", xaxt = "n")
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
#get most important attributes
final.boruta <- TentativeRoughFix(boruta.train)
print(final.boruta)
getSelectedAttributes(final.boruta, withTentative = F)
boruta.rt_df <- attStats(final.boruta)
boruta.rt_df
boruta.rt_df <- setDT(boruta.rt_df, keep.rownames = TRUE)[]
predictors <- boruta.rt_df %>%
filter(., decision =="Confirmed") %>%
select(., rn)
predictors <- unlist(predictors)
control <- trainControl(method="repeatedcv",
number=10,
repeats=6)
#look at residuals
#p-value is very small so reject H0 that predictors have no effect so
#we can use rotten tomatoes to predict box_office ranges
train_m$interval <- NULL
model_lm <- train(train_m[,predictors],
y_train, method='lm',
trControl = control, tuneLength = 10)
model_lm #.568
#
plot(model_lm)
plot(model_lm)
z <- varImp(object=model_lm)
z <- setDT(z, keep.rownames = TRUE)
z$model <- NULL
z$calledFrom <- NULL
row.names(z)
plot(varImp(object=model_lm),main="Linear Model Variable Importance")
predictions<-predict.train(object=model_lm,rt_test[,predictors],type="raw")
table(predictions)
#get coeff
interc <- coef(model_lm$finalModel)
slope <- coef(model_lm$finalModel)
ggplot(data = rt_train, aes(y = box_office)) +
geom_point() +
geom_abline(slope = slope, intercept = interc, color = 'red')
This is what some of my input looks like. Thank you!!
Here is an example using the inbuilt data set cars:
data(cars, package = "datasets")
library(caret)
build the model
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 6)
model_lm <- train(dist ~ speed, data = cars, method='lm',
trControl = control, tuneLength = 10)
I will assume you would like to plot the final model.
You can use the caret predict.train function to get the predictions from the model and plot them:
pred <- predict(model_lm, cars)
pred <- data.frame(pred = pred, speed = cars$speed)
additionally you can provide the cars data set to geom point and plot the observations:
library(ggplot2)
ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x=speed, y = dist))
if you would like to obtain the confidence or prediction interval you can use the predict.lm function on model_lm$finalModel:
Here is an example for the prediction interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "prediction")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_int <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
or the confidence interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "confidence")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_conf <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
plotting them side by side:
library(cowplot)
plot_grid(pred_int, pred_conf)
to plot the linear dependence on two variables you can use a 3D plot, for more than 3 it will be a problem.

Resources