glm switches coefficient names for interactions - r

I use the R code:
dat<-data.frame(p1=c(0,1,1,0,0), GAMMA.1=c(1,2,3,4,3), VAR1=c(2,2,1,3,4), GAMMA.2=c(1,1,3,4,1))
form <- p1 ~ GAMMA.1:VAR1 + GAMMA.2:VAR1
mod <- glm(formula=form, data=dat, family=binomial)
(coef <- coefficients(mod))
# (Intercept) GAMMA.1:VAR1 VAR1:GAMMA.2
# 1.7974974 -0.2563667 -0.2181079
As we can see the names of coef for the interaction GAMMA.2:VAR1 is not in the same order as in form (we have VAR1:GAMMA.2 instead). For several reasons, I need the output
# (Intercept) GAMMA.1:VAR1 GAMMA.2:VAR1
# 1.7974974 -0.2563667 -0.2181079
without changing the names of the coefficients afterwards. Specifically, I want the same names for the coefficients as I used in the form object (without switching as in the code above). Can I tell glm() not to switch the names of the interactions?

The answer is no, not without a lot of rewriting functions. The order of the label of interaction terms is determined by the terms.formula function, which itself is determined by the termsform function buried deep in the C code. There are no parameters that you can pass termsform that give you the behaviour that you want (although keep.order looked promising, it does not do what you want).
You would have to rewrite the terms.formula function to "swap back" the names after output from termsform, and then override the terms.formula function with your patched version, but are you sure you want that? It will be far easier to change the names of the coefficients afterwards.
You could also use terms.formula preemptively, and determine how your formula would be reordered, using and then create a mapping vector.
dat<-data.frame(p1=c(0,1,1,0,0), GAMMA.1=c(1,2,3,4,3), VAR1=c(2,2,1,3,4), GAMMA.2=c(1,1,3,4,1))
form <- p1 ~ GAMMA.1:VAR1 + GAMMA.2:VAR1
new.names<-labels(terms(form,data=dat,keep.order=TRUE))
names(new.names)<-as.character(form[[3]][-1])
new.names
# GAMMA.1:VAR1 GAMMA.2:VAR1
# "GAMMA.1:VAR1" "VAR1:GAMMA.2"
You could use that vector to map names if you had the need later on.

I have two possible workarounds for you.
One workaround comes from the observation that terms within interaction labels are ordered according to their order of appearance in the formula. In your example, the order is GAMMA.1, VAR1, GAMMA.2. You may be able to rewrite your formula with a different ordering so that the formula and coefficient names match:
form <- p1 ~ VAR1:GAMMA.1 + VAR1:GAMMA.2
mod <- glm(formula=form, data=dat, family=binomial)
coefficients(mod)
# (Intercept) VAR1:GAMMA.1 VAR1:GAMMA.2
# 1.7974974 -0.2563667 -0.2181079
Another workaround is to rename the coefficients according to the formula when you pull them out:
rhs_terms <- c("(Intercept)",as.character(form[[3]][2:length(form[[3]])]))
(coef <- setNames(coefficients(mod), rhs_terms))
# (Intercept) GAMMA.1:VAR1 GAMMA.2:VAR1
# 1.7974974 -0.2563667 -0.2181079

Related

Combining grep() family of functions with a conditional if statment

I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?
According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)
Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"

Computing slope of changing data

In R, I have a dataset of (x, y) points that is constantly being updated via simulation (values are appended to the end of the dataset).
I would like to compute the slope (via a linear model) of the line created by the data using only the last 10 listed datapoints.
The confusion here arises from the fact that the data are changing, and so I suspect a loop may be needed to iterate over the indices of the datapoints.
In R, one usually does something like
linreg <- lm(y ~ x, data = d) # set up linear model
summary.linreg <- summary(linreg) # output summary of model
beta1 <- coef(summary.linreg)[2] # extract slope
The change that is needed in my case is in linreg, specifically
linreg <- lm(y[?] ~ x[?], data = d) # subset response and predictor
For a non-changing dataset of 10 x-y points, one simply does [?] = [1:10] and the problem is solved. In my case though, I am at a standstill as to the best way to proceed efficiently.
Any thoughts?
No, don't subset inside the formula. Subset the data.frame. Inside your loop, after each database update, do this:
linreg <- lm(y ~ x, data = tail(d, 10))
If you want to loop over a data.frame rows, do this:
linreg <- lm(y ~ x, data = d[i:(i+9),])
If your data.frame is large and you only need the slope, you should use the more low-level function lm.fit for better performance. There might also be packages that provide functions for rolling regression.

R: Finding solutions for new x values with nlmrt

Good day,
I have tried to figure this out, but I really can't!! I'll supply an example of my data in R:
x <- c(36,71,106,142,175,210,246,288,357)
y <- c(19.6,20.9,19.8,21.2,17.6,23.6,20.4,18.9,17.2)
table <- data.frame(x,y)
library(nlmrt)
curve <- "y~ a + b*exp(-0.01*x) + (c*x)"
ones <- list(a=1, b=1, c=1)
Then I use wrapnls to fit the curve and to find a solution:
solve <- wrapnls(curve, data=table, start=ones, trace=FALSE)
This is all fine and works for me. Then, using the following, I obtain a prediction of y for each of the x values:
predict(solve)
But how do I find the prediction of y for new x values? For instance:
new_x <- c(10, 30, 50, 70)
I have tried:
predict(solve, new_x)
predict(solve, 10)
It just gives the same output as:
predict(solve)
I really hope someone can help! I know if I use the values of 'solve' for parameters a, b, and c and substitute them into the curve formula with the desired x value that I would be able to this, but I'm wondering if there is a simpler option. Also, without plotting the data first.
Predict requires the new data to be a data.frame with column names that match the variable names used in your model (whether your model has one or many variables). All you need to do is use
predict(solve, data.frame(x=new_x))
# [1] 18.30066 19.21600 19.88409 20.34973
And that will give you a prediction for just those 4 values. It's somewhat unfortunate that any mistakes in specifying the new data results in the fitted values for the original model being returned. An error message probably would have been more useful, but oh well.

Extracting the Model Object in R from str()

I have a logit model object fit using glm2. The predictors are continuous and time varying so I am using basis splines. When I predict(FHlogit, foo..,) the model object it provides a prediction. All is well.
Now, what I would like to do is extract the part of FHLogit and the basis matrix the provides the prediction. I do not want to extract information about the model from str(FHLogit) I am trying to extract the part that says Beta * Predictor = 2. So, I can manipulate the basis matrix for each predictor
I don't think using basis splines will affect this. If so, please provide a reproducible example.
Here's a simple case:
df1 <- data.frame(y=c(0,1,0,1),
x1=seq(4),
x2=c(1,3,2,6))
library(glm2)
g1 <- glm2(y ~ x1 + x2, data=df1)
### default for type is "link"
> stats::predict.glm(g1, type="link")
1 2 3 4
0.23809524 0.66666667 -0.04761905 1.14285714
Now, being unsure how these no.s were arrived at we can look at the source for the above, with predict.glm. We can see that type="link" is the simplest case, returning
pred <- object$fitted.values # object is g1 in this case
These values are the predictions resulting from the original data * the coefficients, which we can verify with e.g.
all.equal(unname(predict.glm(g1, type="link")[1]),
unname(coef(g1)[1] + coef(g1)[2]*df1[1, 2] + coef(g1)[3]*df1[1, 3]))

Create function to automatically create plots from summary(fit <- lm( y ~ x1 + x2 +... xn))

I am running the same regression with small alterations of x variables several times. My aim is after having determined the fit and significance of each variable for this linear regression model to view all all major plots. Instead of having to create each plot one by one, I want a function to loop through my variables (x1...xn) from the following list.
fit <-lm( y ~ x1 + x2 +... xn))
The plots I want to create for all x are
1) 'x versus y' for all x in the function above
2) 'x versus predicted y
3) x versus residuals
4) x versus time, where time is not a variable used in the regression but provided in the dataframe the data comes from.
I know how to access the coefficients from fit, however I am not able to use the coefficient names from the summary and reuse them in a function for creating the plots, as the names are characters.
I hope my question has been clearly described and hasn't been asked already.
Thanks!
Create some mock data
dat <- data.frame(x1=rnorm(100), x2=rnorm(100,4,5), x3=rnorm(100,8,27),
x4=rnorm(100,-6,0.1), t=(1:100)+runif(100,-2,2))
dat <- transform(dat, y=x1+4*x2+3.6*x3+4.7*x4+rnorm(100,3,50))
Make the fit
fit <- lm(y~x1+x2+x3+x4, data=dat)
Compute the predicted values
dat$yhat <- predict(fit)
Compute the residuals
dat$resid <- residuals(fit)
Get a vector of the variable names
vars <- names(coef(fit))[-1]
A plot can be made using this character representation of the name if you use it to build a string version of a formula and translate that. The four plots are below, and the are wrapped in a loop over all the vars. Additionally, this is surrounded by setting ask to TRUE so that you get a chance to see each plot. Alternatively you arrange multiple plots on the screen, or write them all to files to review later.
opar <- par(ask=TRUE)
for (v in vars) {
plot(as.formula(paste("y~",v)), data=dat)
plot(as.formula(paste("yhat~",v)), data=dat)
plot(as.formula(paste("resid~",v)), data=dat)
plot(as.formula(paste("t~",v)), data=dat)
}
par(opar)
The coefficients are stored in the fit objects as you say, but you can access them generically in a function by referring to them this way:
x <- 1:10
y <- x*3 + rnorm(1)
plot(x,y)
fit <- lm(y~x)
fit$coefficient[1] # intercept
fit$coefficient[2] # slope
str(fit) # a lot of info, but you can see how the fit is stored
My guess is when you say you know how to access the coefficients you are getting them from summary(fit) which is a bit harder to access than taking them directly from the fit. By using fit$coeff[1] etc you don't have to have the name of the variable in your function.
Three options to directly answer what I think was the question: How to access the coefficients using character arguments:
x <- 1:10
y <- x*3 + rnorm(1)
fit <- lm(y~x)
# 1
fit$coefficient["x"]
# 2
coefname <- "x"
fit$coefficient[coefname]
#3
coef(fit)[coefname]
If the question was how to plot the various functions then you should supply a sufficiently complex construction (in R) to allow demonstration of methods with a well-specified set of objects.

Resources