LASSO-type regressions with non-negative continuous dependent variable (dependent var) - r

I am using "glmnet" package (in R) mostly to perform regularized linear regression.
However I am wondering if it can perform LASSO-type regressions with non-negative (integer) continuous (dependent) outcome variable.
I can use family = poisson, but the outcome variable is not specifically "count" variable. It is just a continuous variable with lower limit 0.
I aware of "lower.limits" function, but I guess it is for covariates (independent variables). (Please correct me if my understanding of this function not right.)
I look forward to hearing from you all! Thanks :-)

You are right that setting lower limit in glmnet is meant for covariates. Poisson will set a lower limit to zero because you exponentiate to get back the "counts".
Going along those lines, most likely it will work if you transform your response variable. One quick way is to take the log of your response variable, do the fit and transform it back, this will ensure that it's always positive. you have to deal with zeros
An alternative is a power transformation. There's a lot to think about and I can only try a two parameter box-cox with a dataset since you did not provide yours:
library(glmnet)
library(mlbench)
library(geoR)
data(BostonHousing)
data = BostonHousing
data$chas=as.numeric(data$chas)
# change it to min 0 and max 1
data$medv = (data$medv-min(data$medv))/diff(range(data$medv))
Then here I use a quick approximation via pca (without fitting all the variables) to get the suitable lambda1 and lambda2 :
bcfit = boxcoxfit(object = data[,14],
xmat = prcomp(data[,-14],scale=TRUE,center=TRUE)$x[,1:2],
lambda2=TRUE)
bcfit
Fitted parameters:
lambda lambda2 beta0 beta1 beta2 sigmasq
0.42696313 0.00001000 -0.83074178 -0.09876102 0.08970137 0.05655903
Convergence code returned by optim: 0
Check the lambda2, it is the one thats critical for deciding whether you get a negative value.. It should be rather small.
Create the functions to power transform:
bct = function(y,l1,l2){((y+l2)^l1 -1)/l1}
bctinverse = function(y,l1,l2){(y*l1+1)^(1/l1) -l2}
Now we transform the response:
data$medv_trans = bct(data$medv,bcfit$lambda[1],bcfit$lambda[2])
And fit glmnet:
fit = glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans,nlambda=500)
Get predictions over all lambdas, and you can see there's no negative predictions once you transform back:
pred = predict(fit,as.matrix(data[,1:13]))
range(bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]))
[1] 0.006690685 0.918473356
And let's say we do a fit with cv:
fit = cv.glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans)
pred = predict(fit,as.matrix(data[,1:13]))
pred_transformed = bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]
plot(data$medv,pred_transformed,xlab="orig response",ylab="predictions")

Related

How to specify random effects names in a newdata data.frame used in predict() function? - lme4

I have a problem using the predict() function in lme4.
More precisely, I am not very clear on how to declare the names of random effects to be used in the newdata data frame, which I feed the predict() function with, in order to get some predictions.
I will try and describe my problem in detail.
Data
The data I am working with is longitudinal. I have 119 observations, for each of which I have several (6-7) measurements for each observation, which represent the size of proteins, which aggregate in time and grow bigger (let's call it LDL).
Model
The model used to describe this process is a Richard's curve (generalized logistic function), which can be written as
Now, I fit a separate curve for the group of measurements of each observation, with the following fixed, random effects, and variables:
alpha_fix - a fixed effect for alpha
alpha|Obs - a random effect for alpha, which varies among observations
gamma_fix - a fixed effect for gamma
gamma|Obs - a random effect for gamma, which varies among observations
delta_f - a fixed effect
Time - a continuous variable, time in hours
LDL - response variable, continuous, representing size of proteins at time point t.
Predictions
Once I fit the model, I want to use it to predict the value of LDL at a specific time point, for each observation. In order to do this, I need to use the predict function and assign a data frame for newdata. reading through the documentation here, it says the following:
If any random effects are included in re.form (see below), newdata
must contain columns corresponding to all of the grouping variables
and random effects used in the original model, even if not all are
used in prediction; however, they can be safely set to NA in this case
Now, the way I understand this, I need to have a data frame newdata, which in my case contains the following columns: "Time", "Obs", "alpha_f", "gamma_f", "delta_f", as well as two columns for the random effects of alpha and gamma, respectively. However, I am not sure how these two columns with random effects should be named, in order for the predict() function to understand them. I tried with "alpha|Obs" and "gamma|Obs", as well as "Obs$alpha", "Obs$gamma", but both throw the error
Error in FUN(X[[i]], ...) : random effects specified in re.form
that were not present in original model.
I was wondering whether anyone has any idea what the correct way to do this is.
For completeness, the code used to fit the model is provided below:
ModelFunction = function (alpha, gamma, delta, Time) {
15 + (alpha-15) / (1 + exp ((gamma-Time) / delta))
}
ModelGradient = deriv(body(ModelFunction) [[2]], namevec = c ("alpha", "gamma", "delta"), function.arg = ModelFunction)
starting_conditions = c (alpha = 5000, gamma = 1.5, delta = 0.2) # Based on visual observation
fit = nlmer (
LDL ~
ModelGradient (alpha, gamma, delta, Time)
~ (gamma | Obs) + (alpha | Obs),
start = starting_conditions,
nlmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 100000)),
data = ldlData)
I would really appreciate it if someone could give me some advice.
Thanks for your time!

mgcv: Error Model has more coefficients than data, related to the argument by in the gam()

In case a, the gam code in mgcv R package is working well.
library(mgcv)
dat <- gamSim(1,n=400,dist="normal",scale=2)
num_knots = nrow(dat)
fit <- gam(y~s(x0, bs = "cr", k = num_knots, m=2),data=dat)
summary(fit)
But after I added the argument by in the gam(), it reported the error "Model has more coefficients than data".
fit <- gam(y~s(x0, bs = "cr", k = num_knots, m=2, by = x1),data=dat)
The error confuses me because I thought adding the by argument to create the interaction between the smoothing term and the parametric term should not increase the number of unknown coefficients, though it turns out that I am wrong. Where was I wrong?
When you pass a continuous variable to by, what you are getting is varying coefficient model where the effect of x1 varies as a smooth function of x0.
What is happening in the first case is that because of identifiability constraints being applied to the basis expansion for x0, you requested num_knots basis functions but actually got num_knots - 1 basis functions. When you add the intercept you get num_knots coefficients which is OK to fit with this model as it is a penalised spline (though you probably want method = 'REML'). The identifiability constraint is applied because there is a basis function (or combination) that is confounded with the model intercept and you can't fit two constant terms in the model and have them be uniquely identified.
In the second case, the varying coefficient model, there isn't the same issue, so when you ask for num_knots basis functions plus an intercept you are trying to fit a model with 401 coefficients with 400 observations which isn't going to work.

Scenario development with GAM models

I'm working with a mgcv::gam model in R to generate predictions in which the relationship between time (year) and the outcome variable (out) varies. For example, in one scenario, I'd like to force time to affect the outcome variable in a linear manner, in another a marginally decreasing manner, and in another, I'd like to specify specific slopes of the time-outcome interaction. I'm unsure how to force the prediction to treat the interaction between time and the outcome variable in a specific manner:
res <- gam(out ~ s(time) + s(GEOID, bs='re'), data = df, method = "REML")
pred <- predict(gam, newdata = ndf, type="response", se=T)
There isn't an interaction betweentime and out; here time has a potentially non-linear effect on out.
Are we talking about trying to force certain shapes for the function of time? If so, you will need to estimate different models; use time if you want a linear effect:
res_lin <- gam(out ~ time + s(GEOID, bs='re'), data = df, method = "REML")
and look at shape constrained p splines to enforce montonicity or concave/convex relationships.
The scam package has these sorts of constraints and uses mgcv with GCV smoothness selection to fit the shape constrained models.
As for specifying a specific slope for the linear effect of time, I think you'll need to include time as an offset in the model. So say the slope you want is 0.5 I think you need to do + offset(I(0.5*time)) because an offset has by definition a coefficient of 1. I would double check this though as I might have messed up my thinking here.

What are the differences between directly plotting the fit function and plotting the predicted values(they have same shape but different ranges)?

I am trying to learn gam() in R for a logistic regression using spline on a predictor. The two methods of plotting in my code gives the same shape but different ranges of response in the logit scale, seems like an intercept is missing in one. Both are supposed to be correct but, why the differences in range?
library(ISLR)
attach(Wage)
library(gam)
gam.lr = gam(I(wage >250) ~ s(age), family = binomial(link = "logit"), data = Wage)
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
pred=predict(gam.lr, newdata = list(age = age.grid), type = "link")
par(mfrow = c(2,1))
plot(gam.lr)
plot(age.grid, pred)
I expected that both of the methods would give the exact same plot. plot(gam.lr) plots the additive effects of each component and since here there's only one so it is supposed to give the predicted logit function. The predict method is also giving me estimates in the link scale. But the actual outputs are on different ranges. The minimum value of the first method is -4 while that of the second is less than -7.
The first plot is of the estimated smooth function s(age) only. Smooths are subject to identifiability constraints as in the basis expansion used to parametrise the smooth, there is a function or combination of functions that are entirely confounded with the intercept. As such, you can't fit the smooth and an intercept in the same model as you could subtract some value from the intercept and add it back to the smooth and you have the same fit but different coefficients. As you can add and subtract an infinity of values you have an infinite supply of models, which isn't helpful.
Hence identifiability constraints are applied to the basis expansions, and the one that is most useful is to ensure that the smooth sums to zero over the range of the covariate. This involves centering the smooth at 0, with the intercept then representing the overall mean of the response.
So, the first plot is of the smooth, subject to this sum to zero constraint, so it straddles 0. The intercept in this model is:
> coef(gam.lr)[1]
(Intercept)
-4.7175
If you add this to values in this plot, you get the values in the second plot, which is the application of the full model to the data you supplied, intercept + f(age).
This is all also happening on the link scale, the log odds scale, hence all the negative values.

Difference between glmnet() and cv.glmnet() in R?

I'm working on a project that would show the potential influence a group of events have on an outcome. I'm using the glmnet() package, specifically using the Poisson feature. Here's my code:
# de <- data imported from sql connection
x <- model.matrix(~.,data = de[,2:7])
y <- (de[,1])
reg <- cv.glmnet(x,y, family = "poisson", alpha = 1)
reg1 <- glmnet(x,y, family = "poisson", alpha = 1)
**Co <- coef(?reg or reg1?,s=???)**
summ <- summary(Co)
c <- data.frame(Name= rownames(Co)[summ$i],
Lambda= summ$x)
c2 <- c[with(c, order(-Lambda)), ]
The beginning imports a large amount of data from my database in SQL. I then put it in matrix format and separate the response from the predictors.
This is where I'm confused: I can't figure out exactly what the difference is between the glmnet() function and the cv.glmnet() function. I realize that the cv.glmnet() function is a k-fold cross-validation of glmnet(), but what exactly does that mean in practical terms? They provide the same value for lambda, but I want to make sure I'm not missing something important about the difference between the two.
I'm also unclear as to why it runs fine when I specify alpha=1 (supposedly the default), but not if I leave it out?
Thanks in advance!
glmnet() is a R package which can be used to fit Regression models,lasso model and others. Alpha argument determines what type of model is fit. When alpha=0, Ridge Model is fit and if alpha=1, a lasso model is fit.
cv.glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. Bias Variance advantages is usually the motivation behind using such model validation methods. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda.
In your example, you can do plot(reg) OR reg$lambda.min to see the value of lambda which results in the smallest CV error. You can then derive the Test MSE for that value of lambda. By default, glmnet() will perform Ridge or Lasso regression for an automatically selected range of lambda which may not give the lowest test MSE. Hope this helps!
Hope this helps!
Between reg$lambda.min and reg$lambda.1se ; the lambda.min obviously will give you the lowest MSE, however, depending on how flexible you can be with the error, you may want to choose reg$lambda.1se, as this value would further shrink the number of predictors. You may also choose the mean of reg$lambda.min and reg$lambda.1se as your lambda value.

Resources