Quadratic GLM in R with interactions? - r

So I have a question of utilizing quadratic (second order) predictors with GLMs in R. Basically I have three predictor variables (x, y, z) and a response variable (let's call it ozone).
X, Y, and Z are not pquadratic predictors yet so I square them
X2<- x^2 (same for y and z)
Now I understand that if I wanted to model ozone based off of these predictor variables I would use the poly() or polym() function
However, when it comes to using interaction terms between these three variables...that's where I get lost. For example, if i wanted to model the interaction between the quadratic predictors of X and Y I believe I would be typing in something like this
ozone<- x+ x2 + y+y2+ x*y +x2*y + x*y2 + x2*y2 + x*y (I hope this is right)
My question is, is there an easier way of inputting this (with three terms that's a lot of typing). My other question is why does the quadratic predictor flip signs in the coefficients? When I just run the predictor variable X the coefficient is positive but when I use a quadratic predictor the coefficient almost always ends up being negative.

Related

Why do calculated residuals differ between R functions `lm()` and `lm.fit()`

I'm trying to switch from lm() to the faster lm.fit() in order to calculate r² values from large matrices faster. (I don't think I can use cor(), per Function to calculate R2 (R-squared) in R, when x is a matrix.)
Why do lm() and lm.fit() calculate different fitted values and residuals?
set.seed(0)
x <- matrix(runif(50), 10)
y <- 1:10
lm(y ~ x)$residuals
lm.fit(x, y)$residuals
I wasn't able to penetrate the lm() source code to figure out what could be contributing to the difference...
From ?lm.fit x "should be design matrix of dimension n * p", where p is the number of coefficients. You therefore have to pass a vector of ones for the intercept to get the same model.
Thus estimating
lm.fit(cbind(1,x), y)
will give the same parameters as
lm(y ~ x)

Polynomial fitting with R using poly vs. I function

I'm trying to understanding polynomial fitting with R. From my research on the internet, there apparently seems to be two methods. Assuming I want to fit a cubic curve ax^3 + bx^2 + cx + d into some dataset, I can either use:
lm(dataset, formula = y ~ poly(x, 3))
or
lm(dataset, formula = y ~ x + I(x^2) + I(x^3))
However, as I try them in R, I ended up with two different curves with complete different intercepts and coefficients. Is there anything about polynomial I'm not getting right here?
This comes down to what the different functions do. poly generates orthonormal polynomials. Compare the values of poly(dataset$x, 3) to I(dataset$x^3). Your coefficients will be different because the values being passed directly into the linear model (as opposed to indirectly, through either the I or poly function) are different.
As 42 pointed out, your predicted values will be fairly similar. If a is your first linear model and b is your second, b$fitted.values - a$fitted.value should be fairly close to 0 at all points.
I got it now. There seems to be a difference between R computation of raw polynomial vs orthogonal polynomial. Thanks, everyone for the help.

How can the multivariate linear regression be linear in nature?

As per my limited knowledge, linear functions have only two variables which define it, namely x and y.
However, as per multivariate linear regression,
h(x)=(theta transpose vector)*(x vector)
where theta transpose vector = (n+1)x1 vector of parameters
x vector = input variables x0, x1, x2 ....., xn
There are multiple variables involved. Does it not change the nature of the graph and consequently the nature of the function itself?
linear functions have only two variables which define it, namely x and y
This is not accurate; the definition of a linear function is a function that is linear in its independent variables.
What you refer to is simply the special case of only one independent variable x, where
y = a*x + b
and the plot in the (x, y) axes is a straight line, hence the historical origin of the term "linear" itself.
In the general case of k independent variables x1, x2, ..., xk, the linear function equation is written as
y = a1*x1 + a2*x2 + ... + ak*xk + b
whose form you can actually recognize immediately as the same with the multiple linear regression equation.
Notice that your use of the term multivariate is also wrong - you actually mean multivariable, i.e. multiple independent variables (x's); the first term means multiple dependent variables (y's):
Note that multivariate regression is distinct from multivariable
regression, which has only one dependent variable.
(source)

Mixed model starting values for lme4

I am trying to fit a mixed model using the lmer function from the lme4 package. However, I do not understand what should be input to the start parameter.
My purpose is to use a simple linear regression to use the coefficients estimated there as starting values to the mixed model.
Lets say that my model is the following:
linear_model = lm(y ~ x1 + x2 + x3, data = data)
coef = summary(linear_model)$coefficients[- 1, 1] #I remove the intercept
result = lmer(y ~ x1 + x2 + x3 | x1 + x2 + x3, data = data, start = coef)
This example is an oversimplified version of what I am doing since I won't be able to share my data.
Then I get the following kind of error:
Error during wrapup: incorrect number of theta components (!=105) #105 is the value I get from the real regression I am trying to fit.
I have tried many different solutions, trying to provide a list and name those values theta like I saw suggested on some forums.
Also the Github code test whether the length is appropriate but I cant find to what it refers to:
# Assign the start value to theta
if (is.numeric(start)) {
theta <- start
}
# Check the length of theta
length(theta)!=length(pred$theta)
However I can't find where pred$theta is defined and so I don't understand where that value 105 is coming from.
Any help ?
A few points:
lmer doesn't in fact fit any of the fixed-effect coefficients explicitly; these are profiled out so that they are solved for implicitly at each step of the nonlinear estimation process. The estimation involves only a nonlinear search over the variance-covariance parameters. This is detailed (rather technically) in one of the lme4 vignettes (eqs. 30-31, p. 15). Thus providing starting values for the fixed-effect coefficients is impossible, and useless ...
glmer does fit fixed-effects coefficients explicitly as part of the nonlinear optimization (as #G.Grothendieck discusses in comments), if nAGQ>0 ...
it's admittedly rather obscure, but the starting values for the theta parameters (the only ones that are explicitly optimized in lmer fits) are 0 for the off-diagonal elements of the Cholesky factor, 1 for the diagonal elements: this is coded here
ll$theta[] <- is.finite(ll$lower) # initial values of theta are 0 off-diagonal, 1 on
... where you need to know further that, upstream, the values of the lower vector have been coded so that elements of the theta vector corresponding to diagonal elements have a lower bound of 0, off-diagonal elements have a lower bound of -Inf; this is equivalent to starting with an identity matrix for the scaled variance-covariance matrix (i.e., the variance-covariance matrix of the random-effects parameters divided by the residual variance), or a random-effects variance-covariance matrix of (sigma^2 I).
If you have several random effects and big variance-covariance matrices for each, things can get a little hairy. If you want to recover the starting values that lmer will use by default you can use lFormula() as follows:
library(lme4)
ff <- lFormula(Reaction~Days+(Days|Subject),sleepstudy)
(lwr <- ff$reTrms$lower)
## [1] 0 -Inf 0
ifelse(lwr==0,1,0) ## starting values
## [1] 1 0 1
For this model, we have a single 2x2 random-effects variance-covariance matrix. The theta parameters correspond to the lower-triangle Cholesky factor of this matrix, in column-wise order, so the first and third elements are diagonal, and the second element is off-diagonal.
The fact that you have 105 theta parameters worries me; fitting such a large random-effects model will be extremely slow and take an enormous amount of data to fit reliably. (If you know your model makes sense and you have enough data you might want to look into faster options, such as using Doug Bates's MixedModels package for Julia or possibly glmmTMB, which might scale better than lme4 for problems with large theta vectors ...)
your model formula, y ~ x1 + x2 + x3 | x1 + x2 + x3, seems very odd. I can't figure out any context in which it would make sense to have the same variables as random-effect terms and grouping variables in the same model!

Two stage least squares regression with biomial response variable

Hi I'd like to run two stage least squares regression with binomial response variable.
For continuous response variable, I use "tsls" option from R package "sem".
Here are my commands and want to know I'm doing this right.
(x: endogenous variable, z: instrumental variable, y: response variable (0 or 1))
xhat<-lm(x~z)$fitted.values
R<-glm(y~xhat, family=binomial)
R$residuals<-c(y - R$coef[1]+x*R$coef[2])
Thank you
This isn't quite right. For the glm() function in R, the family=binomial option defaults to a logistic regression. So, your residual calculation does not transform the residuals as you want.
You can use the residuals.glm() function to automatically generate the residuals for a generalized linear model in general. You can use the residuals() function for a linear model. Similarly, you can use the predict() function, instead of the lm(x~z)$fitted.values to get the xhats. However, since in this case you have xhat and x (not the same variable), use code close to your solution above (but using the plogis() function for the logit transformation):
stageOne <- lm(x ~ z)
xhat <- predict(stageOne)
stageTwo <- glm(y ~ xhat, family=binomial)
residuals <- y - plogis(stageTwo$coef[1] + stageTwo$coef[2]*x)
A nice feature of the predict() and residuals() or residuals.glm() functions is that they can be extended to other datasets.

Resources