I want to do a ridge regression in R by using glmnet or lm.ridge.
I need to do this regression with log(Y)
cost ~ size + weight ⇒ log(cost) ~ size + weight
However, I found that there is no link like glm for glmnet or lm.ridge.
Any ideas for this issue?
Use the alpha input parameter (with 0 value) for the ?glmnet function. As the documentation says:
alpha=1 is the lasso penalty, and alpha=0 the ridge penalty.
Try something like the following:
glmnet(x=cbind(size, weight), y=log(cost), alpha=0, family='gaussian')
or may be with poission regression
glmnet(x=cbind(size, weight), y=cost, alpha=0, family='poission')
If your input data is not too huge, you can calculate the learnt weights by ridge regression from the training data directly by using the formula solve(t(X)%*%X + λ*I)%*%(t(X)%*%y), where X is your input variables matrix, y is response variable and I is the identity matrix, you can learn the best value of the lambda parameter using cross-validation from a held-out dataset.
Related
I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))
I analyzed non-linear regression using nls package.
power<- nls(formula= agw~a*area^b, data=calibration_6, start=list(a=1, b=1))
summary(power)
I heard in non-linear model, R-squared is not valid and rather than R-squared, we usually show residual standard error which R provides
However, I just want to know what R-squared is. Is that possible to check R-squared in nls package?
Many thanks!!!
OutPut
I found the solution. This method might not be correct in terms of statistics (As R^2 is not valid in non-linear model), but I just want see the overall goodness of fit for my non-linear model.
Step 1> to transform data as log (common logarithm)
When I use non-linear model, I can't check R^2
nls(formula= agw~a*area^b, data=calibration, start=list(a=1, b=1))
Therefore, I transform my data to log
x1<- log10(calibration$area)
y1<- log10(calibration$agw)
cal<- data.frame (x1,y1)
Step 2> to analyze linear regression
logdata<- lm (formula= y1~ x1, data=cal)
summary(logdata)
Call:
lm(formula = y1 ~ x1)
This model provides, y= -0.122 + 1.42x
But, I want to force intercept to zero, therefore,
Step 3> to force intercept to zero
logdata2<- lm (formula= y1~ 0 + x1)
summary(logdata2)
Now the equation is y= 1.322x, which means log (y) = 1.322 log (x),
so it's y= x^1.322.
In power curve model, I force intercept to zero. The R^2 is 0.9994
I am using glmnet to train the logistic regression model and then try to obtain the coefficients with the specific lambda. I used the simple example here:
load("BinomialExample.RData")
fit = glmnet(x, y, family = "binomial")
coef(fit, s = c(0.05,0.01))
I have checked the values of fit$lambda, however, I could not find the specific values of 0.05 or 0.01 in fit$lambda. So how could coef return the coefficients with a lambda not in the fit$lambda vector.
This is explained in the help for coef.glmnet, specifically the exact argument:
exact
This argument is relevant only when predictions are made at values of s (lambda) different from those used in the fitting of the original model. If exact=FALSE (default), then the predict function uses linear interpolation to make predictions for values of s (lambda) that do not coincide with those used in the fitting algorithm. While this is often a good approximation, it can sometimes be a bit coarse. With exact=TRUE, these different values of s are merged (and sorted) with object$lambda, and the model is refit before predictions are made.
Hi I'm looking for some clarification here.
Context: I want to draw a line in a scatterplot that doesn't appear parametric, therefore I am using geom_smooth() in a ggplot. It automatically returns geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method. I gather gam stands for generalized additive models and it has a cubic spline used.
Are the following perceptions correct?
-Loess estimates the response at specific values.
-Splines are approximations that connect different piecewise functions that fit the data (which make up the generalized additive model), and cubic splines are the specific type of spline used here.
Lastly, When should splines be used, when should loess be used?
I'm interested into apply a Jackknife analysis to in order to quantify the uncertainty of my coefficients estimated by the logistic regression. I´m using a glm(family=’binomial’) because my independent variable is in 0 - 1 format.
My dataset has 76000 obs, and I'm using 7 independent variables plus an offset. The idea involves to split the data in let’s say 5 random subsets and then obtaining the 7 estimated parameters by dropping one subset at a time from the dataset. Then I can estimate uncertainty of the parameters.
I understand the procedure but I'm unable to do it in R.
This is the model that I'm fitting:
glm(f_ocur ~ altitud + UTM_X + UTM_Y + j_sin + j_cos + temp_res + pp +
offset(log(1/off)), data = mydata, family = 'binomial')
Does anyone have an idea of how can I make this possible?
Jackknifing a logistic regression model is incredibly inefficient. But an easy time intensive approach would be like this:
Formula <- f_ocur~altitud+UTM_X+UTM_Y+j_sin+j_cos+temp_res+pp+offset(log(1/off))
coefs <- sapply(1:nrow(mydata), function(i)
coef(glm(Formula, data=mydata[-i, ], family='binomial'))
)
This is your matrix of leave-one-out coefficient estimates. The covariance matrix of this matrix estimates the covariance matrix of the parameter estimates.
A significant time improvement could be had by using glm's workhorse function, glm.fit. You can go even farther by linearizing the model (use one-step estimation, limit niter in the Newton Raphson algorithm to one iteration only, using Jackknife SEs for the one-step estimators are still robust, unbiased, the whole bit...)