How to regularize the intercept with glmnet - r

I know that glmnet does not regularize the intercept by default, but I would like to do it anyway. I was taking a look at this question and tried to do what whuber suggested (adding a constant variable and turning the parameter intercept to FALSE) , but as a result glmnet is not fitting the added constant as well.
library(dplyr)
library(glmnet)
X <-
mtcars %>%
mutate(intercept = 1) %>%
select(-c(mpg)) %>%
as.matrix()
y <-
mtcars %>%
select(mpg) %>%
as.matrix()
model <- glmnet(X, y, intercept = FALSE, alpha = 0, lambda = 0)
coef(model)

Related

How to use augment with a model on new data

It is fairly straightforward to use the augment function from the Broom package in R to add predictions back into a tibble. Viz.
df <- iris %>%
nest(data = everything()) %>%
mutate(model = map(data, function(x) lm(Sepal.Length ~ Sepal.Width, data = x)),
pred = map2(model, data, ~augment(.x, newdata = .y))) %>%
unnest(pred)
However, when I take a linear model trained on one set of data and try and predict on new data I receive the following error.
mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)
df2 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
nest(data = everything()) %>%
mutate(pred = map2(mod, data, ~augment(.x, newdata = .y)))
# Error: Problem with `mutate()` input `pred`.
# x No augment method for objects of class numeric
# i Input `pred` is `map2(mod, data, ~augment(.x, newdata = .y))`.
How should I use augment to fit new data? Is using an external model object (in the example above this is mod) the best practice or is there a more elegant way?
Since there is only one model we can do this without using map.
library(dplyr)
df1 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
tidyr::nest(data = everything()) %>%
summarise(pred = broom::augment(mod, newdata = data[[1]]),
mod = list(mod),
data = data)
Having just posted the question, I think I have an answer. I won't accept the answer for 48 hours just in case someone contradicts or provides a more comprehensive one.
In the example, map2 expects mod as a vector or list but it is a model object. Putting mod into the tibble as a list object suppresses the error and correctly calculates predictions.
mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)
df2 <- iris %>%
mutate(Sepal.Width = Sepal.Width + rnorm(1)) %>%
nest(data = everything()) %>%
mutate(mod = list(mod)) %>% #! this is the additional step
mutate(pred = map2(mod, data, ~augment(.x, newdata = .y))) %>%
unnest(pred)
Alternatively, coerce the external model object as list.
...
mutate(pred = map2(list(mod), data, ~augment(.x, newdata = .y))) %>%
...

Fit models with robust standard errors

I am using the following R code to run several linear regression models and extract results to dataframe:
library(tidyverse)
library(broom)
data <- mtcars
outcomes <- c("wt", "mpg", "hp", "disp")
exposures <- c("gear", "vs", "am")
models <- expand.grid(outcomes, exposures) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~factor(", Var2, ")")) %>%
group_by(model_id = row_number(),frm) %>%
do(tidy(lm(.$frm, data = data))) %>%
mutate(lci = estimate-(1.96*std.error),
uci = estimate+(1.96*std.error))
How can I modify my code to use robust standard errors similar to STATA?
* example of using robust standard errors in STATA
regress y x, robust
There is a comprehensive discussion about the robust standard errors in lm models at stackexchange.
You can update your code in the following way:
library(sandwich)
models <- expand.grid(outcomes, exposures) %>%
group_by(Var1) %>% rowwise() %>%
summarise(frm = paste0(Var1, "~factor(", Var2, ")")) %>%
group_by(model_id = row_number(),frm) %>%
do(cbind(
tidy(lm(.$frm, data = data)),
robSE = sqrt(diag(vcovHC(lm(.$frm, data = data), type="HC1"))) )
) %>%
mutate(
lci = estimate - (1.96 * std.error),
uci = estimate + (1.96 * std.error),
lciR = estimate - (1.96 * robSE),
uciR = estimate + (1.96 * robSE)
)
The important line is this:
sqrt(diag(vcovHC(lm(.$frm, data = data), type="HC1"))) )
Function vcovHC returns covariance matrix. You need to extract variances on the diagonal diag and take compute a square root sqrt.

Using MAE as the error function for a linear model

I'd like to perform linear regression, however instead of using RMSE as my error function, I'd like to use MAE (Mean Absolute Error).
Is there a package that would allow me to do this?
You may use caret and Metrics packages.
library(caret)
data("mtcars")
maeSummary <- function (data,
lev = NULL,
model = NULL) {
require(Metrics)
out <- mae(data$obs, data$pred)
names(out) <- "MAE"
out
}
mControl <- trainControl(summaryFunction = maeSummary)
set.seed(123)
lm_model <- train(mpg ~ wt,
data = mtcars,
method = "lm",
metric = "MAE",
maximize = FALSE,
trControl = mControl)
> lm_model$metric
[1] "MAE"
Probably late to the party, but here is a solution using CVXR package for optimisation.
library(CVXR)
# defining variables to be tuned during optimisation
coefficient <- Variable(1)
intercept <- Variable(1)
# defining the objective i.e. minimizing the sum af absolute differences (MAE)
objective <- Minimize(sum(abs(mtcars$disp - (mtcars$hp * coefficient) - intercept)))
# optimisation
problem <- Problem(objective)
result <- solve(problem)
# result
result$status
mae_coefficient <- result$getValue(coefficient)
mae_intercept <- result$getValue(intercept)
lm_coeff_intrc <- lm(formula = disp ~ hp, data = mtcars)$coefficients
library(tidyverse)
ggplot(mtcars, aes(hp, disp)) +
geom_point() +
geom_abline(
slope = lm_coeff_intrc["hp"],
intercept = lm_coeff_intrc["(Intercept)"],
color = "red"
) +
geom_abline(
slope = mae_coefficient,
intercept = mae_intercept,
color = "blue"
)
df <- mtcars %>%
select(disp, hp) %>%
rownames_to_column() %>%
mutate(
mae = disp - hp * mae_coefficient - mae_intercept,
lm = disp - hp * lm_coeff_intrc["hp"] - lm_coeff_intrc["(Intercept)"]
)
df %>%
select(mae, lm) %>%
pivot_longer(cols = 1:2) %>%
group_by(name) %>%
summarise(
mae = sum(abs(value))
)

Predict for wider range

I would like to fit the data and predict y values for wider x range.
Lets assume I have 'iris' data set and use following data for prediction from this post
library(dplyr)
cc <- iris %>%
group_by(Species) %>%
do({
mod <- nlsLM(Sepal.Length ~ k*Sepal.Width/2+U, start=c(k=10,U=5), data = ., trace=F, control = nls.lm.control(maxiter=100))
pred <- predict(mod, newdata =.["Sepal.Width"])
data.frame(., pred)
})
This is the fitting plot
I want to fit this data with wider Sepal width range such that
new.range<- data.frame(x=seq(2,10,length.out=20))
and modify the script
pred <- predict(mod, newdata =new.range)
TO plot new.range fitting
library(ggplot2)
ggplot(cc,aes(y=Sepal.Length,x=Sepal.Width ,col=factor(Species)))+
geom_point()+
facet_wrap(~Species)+
geom_line(aes(x=new.range,y=pred),size=1)
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, : arguments imply differing number of rows:
20, 150
I cannot understand why getting this error. I suppose that pred is calculated from new.range so they should have the same length?
similar posts
using-predict-in-nls
trouble-with-predict-function-in-r
predict-maybe-im-not-understanding-it?
This is something that achieves what you want. The cause for your original problem is that in your regression, the predictor's name is Sepal.width not x, and your prediction doesn't use your new.range at all, so you have to do something like new.range<- data.frame(Sepal.Width=seq(2,10,length.out=50)) to make predictions on your new.range.
Another problem is that you have to make the new.range's length to be 50, so that the pred and new.range fit in the original data.frame.
And then you can draw the plot you want, note that the new.range becomes Sepal.Width.1.
library(dplyr)
cc <- iris %>%
group_by(Species) %>%
do({
mod <- nlsLM(Sepal.Length ~ k*Sepal.Width/2+U, start=c(k=10,U=5), data = ., trace=F, control = nls.lm.control(maxiter=100))
new.range<- data.frame(Sepal.Width=seq(2,10,length.out=50))
pred <- predict(mod, newdata =new.range)
# pred <- predict(mod, newdata =.["Sepal.Width"])
data.frame(., new.range, pred)
})
library(ggplot2)
ggplot(cc,aes(y=Sepal.Length,x=Sepal.Width ,col=factor(Species)))+
geom_point()+
facet_wrap(~Species)+
geom_line(aes(x=Sepal.Width.1,y=pred),size=1)

dplyr, do(), extracting parameters from model without losing grouping variable

A slightly changed example from the R help for do():
by_cyl <- group_by(mtcars, cyl)
models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
coefficients<-models %>% do(data.frame(coef = coef(.$mod)[[1]]))
In the dataframe coefficients, there is the first coefficient of the linear model for each cyl group. My question is how can I produce a dataframe that contains not only a column with the coefficients, but also a column with the grouping variable.
===== Edit: I extend the example to try to make more clear my problem
Let's suppose that I want to extract the coefficients of the model and some prediction. I can do this:
by_cyl <- group_by(mtcars, cyl)
getpars <- function(df){
fit <- lm(mpg ~ disp, data = df)
data.frame(intercept=coef(fit)[1],slope=coef(fit)[2])
}
getprediction <- function(df){
fit <- lm(mpg ~ disp, data = df)
x <- df$disp
y <- predict(fit, data.frame(disp= x), type = "response")
data.frame(x,y)
}
pars <- by_cyl %>% do(getpars(.))
prediction <- by_cyl %>% do(getprediction(.))
The problem is that the code is redundant because I am fitting the model two times. My idea was to build a function that returns a list with all the information:
getAll <- function(df){
results<-list()
fit <- lm(mpg ~ disp, data = df)
x <- df$disp
y <- predict(fit, data.frame(disp= x), type = "response")
results$pars <- data.frame(intercept=coef(fit)[1],slope=coef(fit)[2])
results$prediction <- data.frame(x,y)
results
}
The problem is that I don't know how to use do() with the function getAll to obtain for example just a dataframe with the parameters (like the dataframe pars).
Like this?
coefficients <-models %>% do(data.frame(coef = coef(.$mod)[[1]], group = .[[1]]))
yielding
coef group
1 40.87196 4
2 19.08199 6
3 22.03280 8
Using the approach of Hadley Wickham in this video:
library(dplyr)
library(purrr)
library(broom)
fitmodel <- function(d) lm(mpg ~ disp, data = d)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(mod = map(data, fitmodel),
pars = map(mod, tidy),
pred = map(mod, augment))
pars <- by_cyl %>% unnest(pars)
prediction <- by_cyl %>% unnest(pred)

Resources