Why do calculated residuals differ between R functions `lm()` and `lm.fit()` - r

I'm trying to switch from lm() to the faster lm.fit() in order to calculate r² values from large matrices faster. (I don't think I can use cor(), per Function to calculate R2 (R-squared) in R, when x is a matrix.)
Why do lm() and lm.fit() calculate different fitted values and residuals?
set.seed(0)
x <- matrix(runif(50), 10)
y <- 1:10
lm(y ~ x)$residuals
lm.fit(x, y)$residuals
I wasn't able to penetrate the lm() source code to figure out what could be contributing to the difference...

From ?lm.fit x "should be design matrix of dimension n * p", where p is the number of coefficients. You therefore have to pass a vector of ones for the intercept to get the same model.
Thus estimating
lm.fit(cbind(1,x), y)
will give the same parameters as
lm(y ~ x)

Related

Estimating bias in linear regression and linear mixed model in R simulation

I want to run simulations to estimate bias in linear model and linear mixed model. The bias is E(beta)-beta where beta is the association between my X and Y.
I generated my X variable from a normal distribution and Y from a multivariate normal distribution.
I understand how I can calculate E(beta) from simulations, which is the sum of beta estimates from all simulations divided by the total number of simulation, but I am not sure how I can estimate true beta.
meanY <- meanY + X*betaV
This is how I generated the meanY (betaV is the effect size) that is then used to generate multivariate Y outcome as shown below.
Y[jj,] <- rnorm(nRep, mean=meanY[jj], sd=sqrt(varY))
I understand how I can calculate E(beta) from simulations, which is the sum of beta estimates from all simulations divided by the total number of simulation, but I am not sure how I can estimate the true beta.
From my limited understanding, true beta is not obtained from the data but from the setting where I set fixed beta value.
Based on how I generated my data, how can I estimate the true beta?
There are a couple of methods of simulating bias. I'll take an easy example using a linear model. A linear mixed model could likely use a similar approach, however i am not certain it would go well for a generalized linear mixed model (I am simply not certain).
A simple method for estimating bias, when working with a simple linear model, is to 'choose' which model to estimate ones bias from. Lets say for example Y = 3 + 4 * X + e. I have chosen beta <- c(3,4), and as such i need to only simulate my data. For a linear model, the model assumptions are
Observations are independent
Observations are normally distributed
The mean can be described as by the linear predictor
Using these 3 assumptions, simulating a fixed design is simple.
set.seed(1)
xseq <- seq(-10,10)
xlen <- length(xseq)
nrep <- 100
#Simulate X given a flat prior (uniformly distributed. A normal distribution would likely work fine as well)
X <- sample(xseq, size = xlen * nrep, replace = TRUE)
beta <- c(3, 4)
esd = 1
emu <- 0
e <- rnorm(xlen * nrep, emu, esd)
Y <- cbind(1, X) %*% beta + e
fit <- lm(Y ~ X)
bias <- coef(fit) -beta
>bias
(Intercept) X
0.0121017239 0.0001369908
which indicates a small bias. To test if this bias is significant, we could perform a wald-test or t-test (or replicate the process 1000 times, and check the distribution of outcomes).
#Simulate linear model many times
model_frame <- cbind(1,X)
emany <- matrix(rnorm(xlen * nrep * 1000, emu, esd),ncol = 1000)
#add simulated noise. Sweep adds X %*% beta across all columns of emany
Ymany <- sweep(emany, 1, model_frame %*% beta, "+")
#fit many models simulationiously (lm is awesome!)
manyFits <- lm(Y~X)
#Plot density of fitted parameters
par(mfrow=c(1,2))
plot(density(coef(manyFits)[1,]), main = "Density of intercept")
plot(density(coef(manyFits)[2,]), main = "Density of beta")
#Calculate bias, here i use sweep to substract beta across all rows of my coefficients
biasOfMany <- rowMeans(sweep(coef(manyFits), 1, beta, "-"))
>biasOfMany
(Intercept) X
5.896473e-06 -1.710337e-04
Here we see that the bias is reduced quite a bit, and has changed sign for betaX giving reason to believe the bias is insignificant.
Changing the design would allow one to look into bias of interactions, outliers and other stuff using the same method.
For linear mixed models, one could perform the same method, however here you would have to design the random variables, which would require some more work, and the implementation of lmer as far as i know, does not fit a model across all columns of Y.
However b (the random effects) could be simulated, and so could any noise parameters. Do however note, that as b is a single vector containing a single outcome of simulations (often of a multivariate normal distribution), one would have to re-run the model for each simulation of b. Basically this will increase the number of times one would have to re-run the model fitting procedure, in order to get a good estimate of the bias.

How does glmnet standardize variables when weights are present?

glmnet allows the user to input a vector of observation weights through the weights argument. glmnet also standardizes (by default) the predictor variables to have zero mean and unit variance. My question is: when weights is provided, does glmnet standardize the predictors using the weighted mean (and standard deviation) of each column or the unweighted mean (and standard deviation)?
There's a description of glmnet's standardization at Link
In the post you can see the Fortran-Code-Snippet of glmnet's source that computes the standardization. ("Proof" paragraph, second bullet).
I'm not familiar with Fortran, but to me it looks very much like it is in fact using the weighted mean and sd.
Edit: From the glmnet vignette:
"weights is for the observation weights. Default is 1 for each
observation. (Note: glmnet rescales the weights to sum to N, the
sample size.)"
With w in the Fortran code being the rescaled weights, this seems to be consistent with weighted mean standardization.
For what it's worth, consistent with the accepted answer, the weights in glmnet are sampling weights, and not inverse variance weights. For example if you have many more observations than unique observations, you can compress your dataset and get the same coefficient estimates:
n <- 50
m <- 5
y_norm <- rnorm(n)
y_bool <- rbinom(n,1,.5)
x <- matrix(rnorm(n*m),n)
w <- rpois(n,3) + 1 # weights
w_indx <- rep(1:n,times=w) # weights index
m1 = glmnet(x, y_norm, weights = w)
m2 = glmnet(x[w_indx,] ,y_norm[w_indx])
all.equal(coef(m1,s=.1),
coef(m2,s=.1))
>>> TRUE
M1 = glmnet(x,y_bool,weights = w,family = "binomial")
M2 = glmnet(x[w_indx,],y_bool[w_indx],family = "binomial")
all.equal(coef(M1,s=.1),
coef(M2,s=.1))
>>> TRUE
Of course, a bit more care needs to be used when using weights with cv.glmnet since the weights of aggregated records should be spread across folds using a multinomial distribution...

Quadratic GLM in R with interactions?

So I have a question of utilizing quadratic (second order) predictors with GLMs in R. Basically I have three predictor variables (x, y, z) and a response variable (let's call it ozone).
X, Y, and Z are not pquadratic predictors yet so I square them
X2<- x^2 (same for y and z)
Now I understand that if I wanted to model ozone based off of these predictor variables I would use the poly() or polym() function
However, when it comes to using interaction terms between these three variables...that's where I get lost. For example, if i wanted to model the interaction between the quadratic predictors of X and Y I believe I would be typing in something like this
ozone<- x+ x2 + y+y2+ x*y +x2*y + x*y2 + x2*y2 + x*y (I hope this is right)
My question is, is there an easier way of inputting this (with three terms that's a lot of typing). My other question is why does the quadratic predictor flip signs in the coefficients? When I just run the predictor variable X the coefficient is positive but when I use a quadratic predictor the coefficient almost always ends up being negative.

Mixed model starting values for lme4

I am trying to fit a mixed model using the lmer function from the lme4 package. However, I do not understand what should be input to the start parameter.
My purpose is to use a simple linear regression to use the coefficients estimated there as starting values to the mixed model.
Lets say that my model is the following:
linear_model = lm(y ~ x1 + x2 + x3, data = data)
coef = summary(linear_model)$coefficients[- 1, 1] #I remove the intercept
result = lmer(y ~ x1 + x2 + x3 | x1 + x2 + x3, data = data, start = coef)
This example is an oversimplified version of what I am doing since I won't be able to share my data.
Then I get the following kind of error:
Error during wrapup: incorrect number of theta components (!=105) #105 is the value I get from the real regression I am trying to fit.
I have tried many different solutions, trying to provide a list and name those values theta like I saw suggested on some forums.
Also the Github code test whether the length is appropriate but I cant find to what it refers to:
# Assign the start value to theta
if (is.numeric(start)) {
theta <- start
}
# Check the length of theta
length(theta)!=length(pred$theta)
However I can't find where pred$theta is defined and so I don't understand where that value 105 is coming from.
Any help ?
A few points:
lmer doesn't in fact fit any of the fixed-effect coefficients explicitly; these are profiled out so that they are solved for implicitly at each step of the nonlinear estimation process. The estimation involves only a nonlinear search over the variance-covariance parameters. This is detailed (rather technically) in one of the lme4 vignettes (eqs. 30-31, p. 15). Thus providing starting values for the fixed-effect coefficients is impossible, and useless ...
glmer does fit fixed-effects coefficients explicitly as part of the nonlinear optimization (as #G.Grothendieck discusses in comments), if nAGQ>0 ...
it's admittedly rather obscure, but the starting values for the theta parameters (the only ones that are explicitly optimized in lmer fits) are 0 for the off-diagonal elements of the Cholesky factor, 1 for the diagonal elements: this is coded here
ll$theta[] <- is.finite(ll$lower) # initial values of theta are 0 off-diagonal, 1 on
... where you need to know further that, upstream, the values of the lower vector have been coded so that elements of the theta vector corresponding to diagonal elements have a lower bound of 0, off-diagonal elements have a lower bound of -Inf; this is equivalent to starting with an identity matrix for the scaled variance-covariance matrix (i.e., the variance-covariance matrix of the random-effects parameters divided by the residual variance), or a random-effects variance-covariance matrix of (sigma^2 I).
If you have several random effects and big variance-covariance matrices for each, things can get a little hairy. If you want to recover the starting values that lmer will use by default you can use lFormula() as follows:
library(lme4)
ff <- lFormula(Reaction~Days+(Days|Subject),sleepstudy)
(lwr <- ff$reTrms$lower)
## [1] 0 -Inf 0
ifelse(lwr==0,1,0) ## starting values
## [1] 1 0 1
For this model, we have a single 2x2 random-effects variance-covariance matrix. The theta parameters correspond to the lower-triangle Cholesky factor of this matrix, in column-wise order, so the first and third elements are diagonal, and the second element is off-diagonal.
The fact that you have 105 theta parameters worries me; fitting such a large random-effects model will be extremely slow and take an enormous amount of data to fit reliably. (If you know your model makes sense and you have enough data you might want to look into faster options, such as using Doug Bates's MixedModels package for Julia or possibly glmmTMB, which might scale better than lme4 for problems with large theta vectors ...)
your model formula, y ~ x1 + x2 + x3 | x1 + x2 + x3, seems very odd. I can't figure out any context in which it would make sense to have the same variables as random-effect terms and grouping variables in the same model!

Residuals and plots in ordered multinomial regression

I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.

Resources