I'm running an elastic net on a generalized linear model with the glmnet and caret packages in R.
My response variable is cost (where cost > $0) and hence I'd like to specify a Gaussian family with a log link for my GLM. However glmnet doesn't seem to allow me to specify (link="log") as follows:
> lasso_fit <- glmnet(x, y, alpha=1, family="gaussian"(link="log"), lambda.min.ratio=.001)
I've tried different variants, with and without quotations, but no luck. The glmnet documentation doesn't discuss how to include a log link.
Am I missing something? Does family="gaussian" already implicitly assume a log link?
It is a bit confusing, but the family argument in glmnet and glm are quite different. In glm, you can specify a character like "gaussian", or you can specify a function with some arguments, like gaussian(link="log"). In glmnet, you can only specify the family with a character, like "gaussian", and there is no way to automatically set the link through that argument.
The default link for gaussian is the identity function, that is, no transformation at all. But, remember that a link function is just a transformation of your y variable; you can just specify it yourself:
glmnet(x, log(y), family="gaussian")
Also note that the default link for the poisson family is log, but the objective function will change. See the Details under ?glmnet in the first couple of paragraphs.
Your comments have led me to rethink my answer; I have evidence that it is not correct.
As you point out, there is a difference between E[log(Y)] and log(E[Y]). I think what the above code does is to fit E[log(Y)], which is not what you want. Here is some code to generate data and confirm what you noted in the comments:
# Generate data
set.seed(1)
x <- replicate(3,runif(1000))
y <- exp(2*x[,1] + 3*x[,2] + x[,3] + runif(1000))
df <- data.frame(y,x)
# Run the model you *want*
glm(y~., family=gaussian(link="log"), data=df)$coef
# (Intercept) X1 X2 X3
# 0.4977746 2.0449443 3.0812333 0.9451073
# Run the model you *don't want* (in two ways)
glm(log(y)~., family=gaussian(link='identity'), data=df)$coef
# (Intercept) X1 X2 X3
# 0.4726745 2.0395798 3.0167274 0.9957110
lm(log(y)~.,data=df)$coef
# (Intercept) X1 X2 X3
# 0.4726745 2.0395798 3.0167274 0.9957110
# Run the glmnet code that I suggested - getting what you *don't want*.
library(glmnet)
glmnet.model <- glmnet(x,log(y),family="gaussian", thresh=1e-8, lambda=0)
c(glmnet.model$a0, glmnet.model$beta[,1])
# s0 V1 V2 V3
# 0.4726745 2.0395798 3.0167274 0.9957110
I know that this is an old question, but in the current version of glmnet (4.0-2) it is possible to use glm family functions as arguments to "family" instead of a character string, so you could use:
glmnet(x, y, family=gaussian(link="log"))
Note that the package is faster when you use the string arguments.
Reference:
https://glmnet.stanford.edu/articles/glmnetFamily.html
Related
I'm currently trying to run a granger causality analysis in R/R Studio. I am receiving errors about aliased coefficients when using the function grangertest(). From my understanding, this occurs because there is perfect multicolinearity between the variables.
Due to having a very large number of pairwise comparisons (e.g. 200+), I would like to simply run the granger with the aliased coefficients as per normal rather than returning an error. According to one answer here, the solution is or was to add set singular.ok=TRUE, but either I am doing it incorrectly the answer s out of date. I've tried checking the documentation, but have come up empty. Any help would be appreciated.
library(lmtest)
x <- c(0,1,2,3)
y <- c(0,3,6,9)
grangertest(x,y,1) # I want this to run successfully even if there are aliased coefficients.
grangertest(x,y,1, singular.ok=TRUE) # this also doesn't work
"Error in waldtest.lm(fm, 2, ...) :
there are aliased coefficients in the model"
Additionally is there a way to flag x and y are actually aliased variables? There seem to be some answers like here but I'm having issues getting it working properly.
alias((x~ y))
Thanks in advance.
After some investigation and emailing the creator of the grangertest package, they sent me this solution. The solution should run on aliased variables when granger test does not. When the variables are not aliased, the solution should give the same values as the normal granger test.
library(lmtest)
library(dynlm)
# Some data that is multicolinear
x <- c(0,1,2,3,4)
y <- c(0,3,6,9,12)
# Some data that is not multicolinear
# x <- c(0,125,200,230,777)
# y <- c(0,3,6,9,200)
# Convert to time series (this is an important step)
x=ts(x)
y=ts(y)
# This will run even when the data is multicolinear (but also when it is not)
# and is functionally the same as running the granger test (which by default uses the waldtest
m1 = dynlm(x ~ L(x, 1:1) + L(y, 1:1))
m2 = dynlm(x ~ L(x, 1:1))
result <-anova(m1, m2, test="F")
# This will fail if the data is multicolinear or aliased but should give the same results as the anova otherwise (F value and P value etc)
#grangertest(y,x,1)
I want to persist a lm object to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS/readRDS, but I'd like to have an ASCII file instead of a binary file. At a more general level, I'd like to know why my idioms for reading in dput output in general is not behaving as I'd expect.
Below are examples of making a simple fit, and successful and unsuccessful recreations of the model:
dat_train <- data.frame(x=1:4, z=c(1, 2.1, 2.9, 4))
fit <- lm(z ~ x, dat_train)
rm(dat_train) # Just to make sure fit is not dependent upon `dat_train existence`
dat_score <- data.frame(x=c(1.5, 3.5))
## This works (of course)
predict(fit, dat_score)
# 1 2
# 1.52 3.48
Saving to binary file works:
## http://stackoverflow.com/questions/5118074/reusing-a-model-built-in-r
saveRDS(fit, "model.RDS")
fit2 <- readRDS("model.RDS")
predict(fit2, dat_score)
# 1 2
# 1.52 3.48
So does this (dput it in the R session not to a file):
fit2 <- eval(dput(fit))
predict(fit2, dat_score)
# 1 2
# 1.52 3.48
But if I persist file to disk, I cannot figure out how to get back into normal shape:
dput(fit, file = "model.R")
fit3 <- source("model.R")$value
# Error in is.data.frame(data): object 'dat_train' not found
predict(fit3, dat_score)
# Error in predict(fit3, dat_score): object 'fit3' not found
Trying to be explicit with the eval does not work either:
## http://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string
dput(fit, file="model.R")
fit4 <- eval(parse(text=paste(readLines("model.R"), collapse=" ")))
# Error in is.data.frame(data): object 'dat_train' not found
predict(fit4, dat_score)
# Error in predict(fit4, dat_score): object 'fit4' not found
In both cases above, I expect fit3 and fit4 to both work, but they don't recompile into a lm object that I can use with predict().
Can anyone advise me on how I can persist a model to a file with a structure(...) ASCII-like structure, and then re-read it back in as a lm object I can use in predict()? And why my current methods are not working?
Step 1:
You need to control de-parsing options:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
You can read more on all possible options in ?.deparseOpts.
The "quoteExpressions" wraps all calls / expressions / languages with quote, so that they are not evaluated when you later re-parse it. Note:
source is doing parsing;
call field in your fitted "lm" object is a call:
fit$call
# lm(formula = z ~ x, data = dat_train)
So, without "quoteExpressions", R will try to evaluate lm call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train, which will not exist in your new R session.
The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model (the model frame), qr (the compact QR matrix) and terms (terms info), etc all have attributes. You want to keep them all.
If you don't set control, the default setting with:
control = c("keepNA", "keepInteger", "showAttributes")
will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.
You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.
------
Step 2:
The above step will get source working correctly. You can recover your model:
fit1 <- source("model.R")$value
However, it is not yet ready for generic functions like summary and predict to work. Why?
The critical issue is the terms object in fit1 is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms and fit1$terms, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call, it has side-effect for terms. So we need to reconstruct terms as best as we can.
Fortunately, it is sufficient to do:
fit1$terms <- terms.formula(fit1$terms)
Though this still does not recover all information in fit$terms (like variable classes are missing), it is readily a valid "terms" object.
Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.
Once this is done, we can successfully use predict (and summary, too):
predict(fit1) ## no `newdata` given, using model frame `fit1$model`
# 1 2 3 4
#1.03 2.01 2.99 3.97
predict(fit1, dat_score) ## with `newdata`
# 1 2
#1.52 3.48
-------
Conclusion remark:
Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals, fitted.values are long vectors, and qr and model are huge matrices / data frames. So think about this.
This is an important update!
As mentioned in the previous answer, the most challenging bit is to recover $terms as best as we can. The suggested method using terms.formula works for OP's example, but not for the following with bs() and poly():
dat <- data.frame(x1 = runif(20), x2 = runif(20), x3 = runif(20), y = rnorm(20))
library(splines)
fit <- lm(y ~ bs(x1, df = 3) + poly(x2, degree = 3) + x3, data = dat)
rm(dat)
If we follow the previous answer:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
fit1 <- source("model.R")$value
fit1$terms <- terms.formula(fit1$terms)
We will see that summary.lm and anova.lm work correctly, but not predict.lm:
predict(fit1, newdata = data.frame(x1 = 0.5, x2 = 0.5, x3 = 0.5))
Error in bs(x1, df = 3) : could not find function "bs"
This is because ".Environment" attribute of $terms is missing. We need
environment(fit1$terms) <- .GlobalEnv
Now run above predict again we see a different error:
Error in poly(x2, degree = 3) :
'degree' must be less than number of unique points
This is because we are missing "predvars" attributes for safe / correct prediction of bs() and poly().
A remedy is that we need to dput such special attribute additionally:
dput(attr(fit$terms, "predvars"), control = "quoteExpressions", file = "predvars.R")
then read and add it
attr(fit1$terms, "predvars") <- source("predvars.R")$value
Now running predict works correctly.
Note that "dataClass" attribute of $terms is also missing, but this does not seem to cause any problem for any generic functions.
I'm trying to predict an lm object using predict.lm. However, I would like to use manually inserted coefficients.
To do this I tried:
model$coefficients <- coeff
(where "coeff" is a vector of correct coefficients)
which would indeed modify the coefficients as I want. Nevertheless, when I execute
predict.lm(model, new.data)
I just get predictions calculated with the "old" parameters. Is there a way I could force predict.lm to use the new ones?
Post Scriptum: I need to do this to fit a bin-smooth (also called regressogram).
In addition, when I predict "by hand" (i.e. using matrix multiplication) the results are fine, hence I'm quite sure that the problem lies in the predict.lm not recognizing my new coefficients.
Thanks in advance for the help!
Hacking the $coefficients element does indeed seem to work. Can you show what doesn't work for you?
dd <- data.frame(x=1:5,y=1:5)
m1 <- lm(y~x,dd)
m1$coefficients <- c(-2,1)
m1
## Call:
## lm(formula = y ~ x, data = dd)
##
## Coefficients:
## [1] -2 1
predict(m1,newdata=data.frame(x=7)) ## 5 = -2+1*7
predict.lm(...) gives the same results.
I would be very careful with this approach, checking each time you do something different with the hacked model.
In general it would be nice if predict and simulate methods took a newparams argument, but they don't in general ...
I'm trying to learn about ridge regression, and I am using R. From what I understand the following should be the same beta.r1 and beta.r2 in the code below are the same
library(MASS)
n=50
v1=runif(n)
v2=v1+2
V=cbind(1,v1,v2)
w=3+v1+v2
I=diag(3)
lambda=2 #arbitrarily chosen
beta.r1=solve(t(V)%*%V+lambda*I)%*%t(V)%*%w
#Using library(MASS)
fit=lm.ridge(w~v1+v2,lambda=2, Inter=FALSE)
beta.r2=coef(fit)
#Shouldn't beta.r1 and beta.r2 be the same?
I think the variable scaling performed in the lm.ridge code (which you can access by typing lm.ridge into your R console) that likely cause differences. The code scales each variable by its root-mean-squared value:
Xscale <- drop(rep(1/n, n) %*% X^2)^0.5
X <- X/rep(Xscale, rep(n, p))
Your code does not perform any variable scaling.
The variable scaling is hinted at on the ?lm.ridge help page in the description of what is returned by lm.ridge:
scales: scalings used on the X matrix.
Therefore you can access the scaling used by lm.ridge:
fit$scales
# v1 v2
# 0.2650311 0.2650311
How can I get values of Z - statistics as a vector from glm object?
For example, I have
fit <- glm(y ~ 0 + x,binomial)
How can I access the column Pr(>|z|) the same way I get estimates of coefficients with fit$coef?
I believe that
coef(summary(fit))[,"Pr(>|z|)"]
will get you what you want. (summary.glm() returns an object that has a coef() method that returns the coefficient table.) (By the way, if accessor methods exist it's better to use them than to directly access the components of the fitted model -- e.g. coef(fit) is better than fit$coef.)
pull out p-values and r-squared from a linear regression gives a similar answer.
I would suggest methods(class="summary.glm") to find available accessor methods, but it's actually a little bit trickier than that because the default methods (in this case coef.default()) may also be relevant ...
PS if you want the Z values, coef(summary(fit))[,"z value"] should do it (your question is a little bit ambiguous: usually when people say "Z statistic" they mean the want the value of the test statistic, rather than the p value)
You can access to the info you want by doing
utils::data(anorexia, package="MASS") # Some data
anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),
family = gaussian, data = anorexia) # a glm model
summary(anorex.1)
summary(anorex.1)$coefficients[,3] # vector of t-values
(Intercept) Prewt TreatCont TreatFT
3.716770 -3.508689 -2.163761 2.138933
summary(anorex.1)$coefficients[,4] # vector of p-values
(Intercept) Prewt TreatCont TreatFT
0.0004101067 0.0008034250 0.0339993147 0.0360350847
summary(anorex.1)$coefficients[,3:4] # matrix
t value Pr(>|t|)
(Intercept) 3.716770 0.0004101067
Prewt -3.508689 0.0008034250
TreatCont -2.163761 0.0339993147
TreatFT 2.138933 0.0360350847
str function will show you where each element within an object is located, and [[ accessors (better than $, as pointed out by #DWin and #Ben Bolker) will extract the info for you. Try str(summary(anorex.1)) to see what this function does.
I use the summary syntax like: summary(my_model)[[1]][[2]]. You can try to use different combinations of numbers in "[[]]" to extract the data required. Of course, if I correctly understood your question :)