Converting R formula format to mathematical equation - r

When we fit a statistical model in R, say
lm(y ~ x, data=dat)
We use R's special formula syntax: "y~x"
Is there something that converts from such a formula to the corresponding equation? In this case it could be written as:
y = B0 + B1*x
This would be very useful! For one, because with more complicated formulae I don't trust my translation. Second, in scientific papers written with R/Sweave/knitr, sometimes the model should be reported in equation form and for fully reproducible research, we'd like to do this in automated fashion.

Just had a quick play and got this working:
# define a function to take a linear regression
# (anything that supports coef() and terms() should work)
expr.from.lm <- function (fit) {
# the terms we're interested in
con <- names(coef(fit))
# current expression (built from the inside out)
expr <- quote(epsilon)
# prepend expressions, working from the last symbol backwards
for (i in length(con):1) {
if (con[[i]] == '(Intercept)')
expr <- bquote(beta[.(i-1)] + .(expr))
else
expr <- bquote(beta[.(i-1)] * .(as.symbol(con[[i]])) + .(expr))
}
# add in response
expr <- bquote(.(terms(fit)[[2]]) == .(expr))
# convert to expression (for easy plotting)
as.expression(expr)
}
# generate and fit dummy data
df <- data.frame(iq=rnorm(10), sex=runif(10) < 0.5, weight=rnorm(10), height=rnorm(10))
f <- lm(iq ~ sex + weight + height, df)
# plot with our expression as the title
plot(resid(f), main=expr.from.lm(f))
Seems to have lots of freedom about what variables are called, and whether you actually want the coefficients in there as well—but seems good for a start.

Related

Best way to tell if a formula contains a random effect?

I have a list of formulas that I would like to fit in a loop using a function. Some of these formulas are random effects models and others are straightforward linear models. I want the function to detect whether the model contains a random effect and if so, use lmer() to fit the model. Otherwise, it should use lm(). Any suggestions on how to check this condition (other than converting the formula to a string and checking for parentheses)? At this stage, they have the same class so I can't just check that. I could also use error handling to catch when lmer() returns an error from a model without a random effect and reroute towards regular lm(), but this also seems unnecessarily messy.
Example below:
fit_models <- function(formula_list) {
models <- list()
for(ii in seq_along(formula_list)) {
if(formula_list[[ii]] is lmer) { # Enter condition here
print("lmer")
} else {
print("lm")
}
}
}
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
fit_models(formulas)
I would say
length(lme4::findbars(f))>0
should reliably detect formulas containing a random-effects component (in the lme4 sense).
From the right hand side of a formula for a mixed-effects model,
determine the pairs of expressions that are separated by the
vertical bar operator.
This is (implicitly) the test that's done in the lme4 code, here ...
The symbols in formulas don't have inherent meanings. A function can reinterpret the symbols to mean whatever they like. So just because there is a "|", that doesn't mean necessarily that that's a formula that has a random effect. That's just how lmer chose to interpret that symbol.
Given that formulas are basically just ordered collections of unevaluated symbols, there's not much more you can do than a basic equality check for a symbol operating on just the formula itself. Rather than a strait up character conversion, you could use all.names. So something like
f2 <- formula(y ~ 1 + x + (1 + x | z))
all.names(f2)
# [1] "~" "y" "+" "+" "x" "(" "|" "+" "x" "z"
"|" %in% all.names(f2)
# [1] TRUE
This won't be fooled if you have something like formula(`a|b` ~ x) where a|b is a (terrible) column name.
You can just convert the formula to a character and look for the pipe operator |:
f1 <- formula(y ~ x)
f2 <- formula(y ~ 1 + x + (1 + x | z))
formulas <- c(f1, f2)
sapply(formulas, function(x) any(grepl("\\|", as.character(x))))
#> [1] FALSE TRUE

R: interaction between continuous and categorical vars in 'isat' regression ('gets' package)

I want to calculate the differential response of y to x (continuous) depending on the categorical variable z.
In the standard lm setup:
lm(y~ x:z)
However, I want to do this while allowing for Impulse Indicator Saturation (IIS) in the 'gets' package. However, the following syntax produces an error:
isat(y, mxreg=x:z, iis=TRUE)
The error message is of the form:
"Error in solve.qr(out, tol = tol, LAPACK = LAPACK) :
singular matrix 'a' in 'solve"
1: In x:z :
numerical expression has 96 elements: only the first used
2: In x:z :
numerical expression has 96 elements: only the first used"
How should I modify the syntax?
Thank you!
At the moment, alas, isat doesn't provide the same functionality as lm on categorical/character variables, nor on using * and :. We hope to address that in a future release.
In the meantime you'll have to create distinct variables in your dataset representing the interaction. I guess something like the following...
library(gets)
N <- 100
x <- rnorm(N)
z <- c(rep("A",N/4),rep("B",N/4),rep("C",N/4),rep("D",N/4))
e <- rnorm(N)
y <- 0.5*x*as.numeric(z=="A") + 1.5*x*as.numeric(z=="B") - 0.75*x*as.numeric(z=="C") + 5*x*as.numeric(z=="D") + e
lm.reg <- lm(y ~ x:z)
arx.reg.0 <- arx(y,mxreg=x:z)
data <- data.frame(y,x,z,stringsAsFactors=F)
for(i in z[duplicated(z)==F]) {
data[[paste("Zx",i,sep=".")]] <- data$x * as.numeric(data$z==i)
}
arx.reg.1 <- arx(data$y,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")])
isat.1 <- isat(data$y,mc=TRUE,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")],max.block.size=20)
Note that as you'll be creating dummies for each category, there's a chance those dummies will cause singularity of your matrix of explanatory variables (if, as in my example, isat automatically uses 4 blocks). Using the argument max.block.size enables you to avoid this problem.
Let me know if I haven't addressed your particular point.

Error in Gradient Descent Calculation

I tried to write a function to calculate gradient descent for a linear regression model. However the answers I was getting does not match the answers I get using the normal equation method.
My sample data is:
df <- data.frame(c(1,5,6),c(3,5,6),c(4,6,8))
with c(4,6,8) being the y values.
lm_gradient_descent <- function(df,learning_rate, y_col=length(df),scale=TRUE){
n_features <- length(df) #n_features is the number of features in the data set
#using mean normalization to scale features
if(scale==TRUE){
for (i in 1:(n_features)){
df[,i] <- (df[,i]-mean(df[,i]))/sd(df[,i])
}
}
y_data <- df[,y_col]
df[,y_col] <- NULL
par <- rep(1,n_features)
df <- merge(1,df)
data_mat <- data.matrix(df)
#we need a temp_arr to store each iteration of parameter values so that we can do a
#simultaneous update
temp_arr <- rep(0,n_features)
diff <- 1
while(diff>0.0000001){
for (i in 1:(n_features)){
temp_arr[i] <- par[i]-learning_rate*sum((data_mat%*%par-y_data)*df[,i])/length(y_data)
}
diff <- par[1]-temp_arr[1]
print(diff)
par <- temp_arr
}
return(par)
}
Running this function,
lm_gradient_descent(df,0.0001,,0)
the results I got were
c(0.9165891,0.6115482,0.5652970)
when I use the normal equation method, I get
c(2,1,0).
Hope someone can shed some light on where I went wrong in this function.
You used the stopping criterion
old parameters - new parameters <= 0.0000001
First of all I think there's an abs() missing if you want to use this criterion (though my ignorance of R may be at fault).
But even if you use
abs(old parameters - new parameters) <= 0.0000001
this is not a good stopping criterion: it only tells you that progress has slowed down, not that it's already sufficiently accurate. Try instead simply to iterate for a fixed number of iterations. Unfortunately it's not that easy to give a good, generally applicable stopping criterion for gradient descent here.
It seems that you have not implemented a bias term. In a linear model like this, you always want to have an additional additive constant, i.e., your model should be like
w_0 + w_1*x_1 + ... + w_n*x_n.
Without the w_0 term, you usually won't get a good fit.
I know this is a couple of weeks old at this point but I'm going to take a stab at for several reasons, namely
Relatively new to R so deciphering your code and rewriting it is good practice for me
Working on a different Gradient Descent problem so this is all fresh to me
Need the stackflow points and
As far as I can tell you never got a working answer.
First, regarding your data structures. You start with a dataframe, rename a column, strip out a vector, then strip out a matrix. It would be a lot easier to just start with an X matrix (capitalized since its component 'features' are referred to as xsubscript i) and a y solution vector.
X <- cbind(c(1,5,6),c(3,5,6))
y <- c(4,6,8)
We can easily see what the desired solutions are, with and without scaling by fitting a linear fit model. (NOTE We only scale X/features and not y/solutions)
> lm(y~X)
Call:
lm(formula = y ~ X)
Coefficients:
(Intercept) X1 X2
-4 -1 3
> lm(y~scale(X))
Call:
lm(formula = y ~ scale(X))
Coefficients:
(Intercept) scale(X)1 scale(X)2
6.000 -2.646 4.583
With regards to your code, one of the beauties of R is that it can perform matrix multiplication which is significantly faster than using loops.
lm_gradient_descent <- function(X, y, learning_rate, scale=TRUE){
if(scale==TRUE){X <- scale(X)}
X <- cbind(1, X)
theta <- rep(0, ncol(X)) #your old temp_arr
diff <- 1
old.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
while(diff>0.000000001){
theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
new.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
diff <- abs(old.error - new.error)
old.error <- new.error
}
return(theta)
}
And to show it works...
> lm_gradient_descent(X, y, .01, 0)
[,1]
[1,] -3.9360685
[2,] -0.9851775
[3,] 2.9736566
vs expected of (-4, -1, 3)
For what its worth while I agree with #cfh that I would prefer a loop with a defined number of iterations, I'm actually not sure you need the abs function. If diff < 0 then your function is not converging.
Finally rather than using something like old.error and new.error I'd suggest using a a vector that records all errors. You can then plot that vector to see how quickly your function converges.

Use variable in GLM quasi specification

I'm fitting a GLM to some data, using a quasi-likelihood approach (family=quasi(...)).
I'd like to use a variable, p in the variance specification, like so:
family = quasi(link=log, variance=mu^p)
This however doesn't work (it no longer recongises mu).
Is there any way to get R to just insert the value of p in the expression before it is evaluated, so I can use pinstead of a number?
Here's an example that doesn't work:
set.seed(1)
x <- runif(100)
y <- x^2+2*x+sin(2*pi*x) + rnorm(100)
fitModel <- function(x,y, p) {
model <- glm(y~x, family=quasi(link=log, variance=mu^p))
return(model)
}
fitModel(x,y,2)
Thanks!
The family function does fancy parsing which means the paste0 solution suggested in the comments won't work without jumping through considerable hoops. Also, the following function fails if any of the y values are <= 0, so I changed the example a little bit (if you do have negative response values you'll have to think about what you want to do about this ...)
set.seed(1)
x <- seq(2,10,length=100)
y <- x^2+2*x+sin(2*pi*x) + rnorm(100,)
What I did was to create a quasi family object, then modify its variance function on the fly.
pfamily <- quasi(link="log",variance="mu")
fitModel <- function(x,y, p) {
pfamily[["variance"]] <- function(mu) mu^p
model <- glm(y~x, family=pfamily)
model
}
fitModel(x,y,2)
fitModel(x,y,1)
For what it's worth, this variant should be able to do arbitrary values of p, so e.g. you can draw a curve over the variance power:
dfun <- function(p) {
deviance(fitModel(x,y,p))
}
pvec <- seq(0.1,3,by=0.1)
dvec <- sapply(pvec,dfun)
par(las=1,bty="l")
plot(pvec,dvec,type="b",xlab="variance power",ylab="deviance")

Create function to automatically create plots from summary(fit <- lm( y ~ x1 + x2 +... xn))

I am running the same regression with small alterations of x variables several times. My aim is after having determined the fit and significance of each variable for this linear regression model to view all all major plots. Instead of having to create each plot one by one, I want a function to loop through my variables (x1...xn) from the following list.
fit <-lm( y ~ x1 + x2 +... xn))
The plots I want to create for all x are
1) 'x versus y' for all x in the function above
2) 'x versus predicted y
3) x versus residuals
4) x versus time, where time is not a variable used in the regression but provided in the dataframe the data comes from.
I know how to access the coefficients from fit, however I am not able to use the coefficient names from the summary and reuse them in a function for creating the plots, as the names are characters.
I hope my question has been clearly described and hasn't been asked already.
Thanks!
Create some mock data
dat <- data.frame(x1=rnorm(100), x2=rnorm(100,4,5), x3=rnorm(100,8,27),
x4=rnorm(100,-6,0.1), t=(1:100)+runif(100,-2,2))
dat <- transform(dat, y=x1+4*x2+3.6*x3+4.7*x4+rnorm(100,3,50))
Make the fit
fit <- lm(y~x1+x2+x3+x4, data=dat)
Compute the predicted values
dat$yhat <- predict(fit)
Compute the residuals
dat$resid <- residuals(fit)
Get a vector of the variable names
vars <- names(coef(fit))[-1]
A plot can be made using this character representation of the name if you use it to build a string version of a formula and translate that. The four plots are below, and the are wrapped in a loop over all the vars. Additionally, this is surrounded by setting ask to TRUE so that you get a chance to see each plot. Alternatively you arrange multiple plots on the screen, or write them all to files to review later.
opar <- par(ask=TRUE)
for (v in vars) {
plot(as.formula(paste("y~",v)), data=dat)
plot(as.formula(paste("yhat~",v)), data=dat)
plot(as.formula(paste("resid~",v)), data=dat)
plot(as.formula(paste("t~",v)), data=dat)
}
par(opar)
The coefficients are stored in the fit objects as you say, but you can access them generically in a function by referring to them this way:
x <- 1:10
y <- x*3 + rnorm(1)
plot(x,y)
fit <- lm(y~x)
fit$coefficient[1] # intercept
fit$coefficient[2] # slope
str(fit) # a lot of info, but you can see how the fit is stored
My guess is when you say you know how to access the coefficients you are getting them from summary(fit) which is a bit harder to access than taking them directly from the fit. By using fit$coeff[1] etc you don't have to have the name of the variable in your function.
Three options to directly answer what I think was the question: How to access the coefficients using character arguments:
x <- 1:10
y <- x*3 + rnorm(1)
fit <- lm(y~x)
# 1
fit$coefficient["x"]
# 2
coefname <- "x"
fit$coefficient[coefname]
#3
coef(fit)[coefname]
If the question was how to plot the various functions then you should supply a sufficiently complex construction (in R) to allow demonstration of methods with a well-specified set of objects.

Resources