Regression with weights: Less standardized residuals then observations - r

I modelled a multiple Regression based on the Mincer-Wage-Equation and I added a weighting-factor to make it representative for the whole population.
But when I'm adding the weights function into my modell, R calculates less standardized residuals than I have observations.
Here's my modell:
lm(log(earings) ~ Gender + Age + Age^2 + Education, weights= phrf)
So I got problems to analyze the residuals because when I'm trying to plot the rstandard against the fitted.values R is telling: Different Variable Length in rstandard() found.
This Problem ist only by rstandard and rstudent, when I'm plotting the normal resid() against fitted.values there is no problem.
And when I'm leaving out the weights function I have not problems, too.

In the help file for rstudent():
Note that cases with weights == 0 are dropped from all these functions, but that if a linear model has been fitted with na.action = na.exclude, suitable values are filled in for the cases excluded during fitting.
A simple example to demonstrate:
set.seed(123)
x <- 1:100
y <- x + rnorm(100)
w <- runif(100)
w[44] <- 0
fit <- lm(y ~ x, weights=w)
length(fitted(fit))
length(rstudent(fit))
Gives:
> length(fitted(fit))
[1] 100
> length(rstudent(fit))
[1] 99
And this makes sense. If you have a weight of 0, the theoretical variance is 0 which is an infinite studentized or standardized residual.
Since you are effectively deleting those observations, you can subset the call to lm with subset=w!=0 or you can use that flag for the fitted values:
plot(fitted(fit)[w!=0], rstudent(fit))

Related

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

Categorical Regression with Centered Levels

R's standard way of doing regression on categorical variables is to select one factor level as a reference level and constraining the effect of that level to be zero. Instead of constraining a single level effect to be zero, I'd like to constrain the sum of the coefficients to be zero.
I can hack together coefficient estimates for this manually after fitting the model the standard way:
x <- lm(data = mtcars, mpg ~ factor(cyl))
z <- c(coef(x), "factor(cyl)4" = 0)
y <- mean(z[-1])
z[-1] <- z[-1] - y
z[1] <- z[1] + y
z
## (Intercept) factor(cyl)6 factor(cyl)8 factor(cyl)4
## 20.5021645 -0.7593074 -5.4021645 6.1614719
But that leaves me without standard error estimates for the former reference level that I just added as an explicit effect, and I need to have those as well.
I did some searching and found the constrasts functions, and tried
lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
but this still only produces two effect estimates. Is there a way to change which constraint R uses for linear regression on categorical variables properly?
Think I've figured it out. Using contrasts actually is the right way to go about it, you just need to do a little work to get the results into a convenient looking form. Here's the fit:
fit <- lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
Then the matrix cs <- contr.sum(factor(cyl)) is used to get the effect estimates and the standard error.
The effect estimates just come from multiplying the contrast matrix by the effect estimates lm spits out, like so:
cs %*% coef(fit)[-1]
The standard error can be calculated using the contrast matrix and the variance-covariance matrix of the coefficients, like so:
diag(cs %*% vcov(fit)[-1,-1] %*% t(cs))

How to unscale the coefficients from an lmer()-model fitted with a scaled response

I fitted a model in R with the lmer()-function from the lme4 package. I scaled the dependent variable:
mod <- lmer(scale(Y)
~ X
+ (X | Z),
data = df,
REML = FALSE)
I look at the fixed-effect coefficients with fixef(mod):
> fixef(mod)
(Intercept) X1 X2 X3 X4
0.08577525 -0.16450047 -0.15040043 -0.25380073 0.02350007
It is quite easy to calculate the means by hand from the fixed-effects coefficients. However, I want them to be unscaled and I am unsure how to do this exactly. I am aware that scaling means substracting the mean from every Y and deviding by the standard deviation. But both, mean and standard deviation, were calculated from the original data. Can I simply reverse this process after I fitted an lmer()-model by using the mean and standard deviation of the original data?
Thanks for any help!
Update: The way I presented the model above seems to imply that the dependent variable is scaled by taking the mean over all responses and dividing by the standard deviation of all the responses. Usually, it is done differently. Rather than taking the overall mean and standard deviation the responses are standardized per subject by using the mean and standard deviation of the responses of that subject. (This is odd in an lmer() I think as the random intercept should take care of that... Not to mention the fact that we are talking about calculating means on an ordinal scale...) The problem however stays the same: Once I fitted such a model, is there a clean way to rescale the coefficients of the fitted model?
Updated: generalized to allow for scaling of the response as well as the predictors.
Here's a fairly crude implementation.
If our original (unscaled) regression is
Y = b0 + b1*x1 + b2*x2 ...
Then our scaled regression is
(Y0-mu0)/s0 = b0' + (b1'*(1/s1*(x1-mu1))) + b2'*(1/s2*(x2-mu2))+ ...
This is equivalent to
Y0 = mu0 + s0((b0'-b1'/s1*mu1-b2'/s2*mu2 + ...) + b1'/s1*x1 + b2'/s2*x2 + ...)
So bi = s0*bi'/si for i>0 and
b0 = s0*b0'+mu0-sum(bi*mui)
Implement this:
rescale.coefs <- function(beta,mu,sigma) {
beta2 <- beta ## inherit names etc.
beta2[-1] <- sigma[1]*beta[-1]/sigma[-1]
beta2[1] <- sigma[1]*beta[1]+mu[1]-sum(beta2[-1]*mu[-1])
beta2
}
Try it out for a linear model:
m1 <- lm(Illiteracy~.,as.data.frame(state.x77))
b1 <- coef(m1)
Make a scaled version of the data:
ss <- scale(state.x77)
Scaled coefficients:
m1S <- update(m1,data=as.data.frame(ss))
b1S <- coef(m1S)
Now try out rescaling:
icol <- which(colnames(state.x77)=="Illiteracy")
p.order <- c(icol,(1:ncol(state.x77))[-icol])
m <- colMeans(state.x77)[p.order]
s <- apply(state.x77,2,sd)[p.order]
all.equal(b1,rescale.coefs(b1S,m,s)) ## TRUE
This assumes that both the response and the predictors are scaled.
If you scale only the response and not the predictors, then you should submit (c(mean(response),rep(0,...)) for m and c(sd(response),rep(1,...)) for s (i.e., m and s are the values by which the variables were shifted and scaled).
If you scale only the predictors and not the response, then submit c(0,mean(predictors)) for m and c(1,sd(predictors)) for s.

'predict' gives different results than using manually the coefficients from 'summary'

Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.

Testing differences in coefficients including interactions from piecewise linear model

I'm running a piecewise linear random coefficient model testing the influence of a covariate on the second piece. Thereby, I want to test whether the coefficient of the second piece under the influence of the covariate (piece2 + piece2:covariate) differs from the coefficient of the first piece (piece1), hence whether the growth rate differs.
I set up some exemplary data:
set.seed(100)
# set up dependent variable
temp <- rep(seq(0,23),50)
y <- c(rep(seq(0,23),50)+rnorm(24*50), ifelse(temp <= 11, temp + runif(1200), temp + rnorm(1200) + (temp/sqrt(temp))))
# set up ID variable, variables indicating pieces and the covariate
id <- sort(rep(seq(1,100),24))
piece1 <- rep(c(seq(0,11), rep(11,12)),100)
piece2 <- rep(c(rep(0,12), seq(1,12)),100)
covariate <- c(rep(0,24*50), rep(c(rep(0,12), rep(1,12)), 50))
# data frame
example.data <- data.frame(id, y, piece1, piece2, covariate)
# run piecewise linear random effects model and show results
library(lme4)
lmer.results <- lmer(y ~ piece1 + piece2*covariate + (1|id) , example.data)
summary(lmer.results)
I came across the linearHypothesis() command from the car package to test differences in coefficients. However, I could not find an example on how to use it when including interactions.
Can I even use linearHypothesis() to test this or am I aiming for the wrong test?
I appreciate your help.
Many thanks in advance!
Mac
Assuming your output looks like this
Estimate Std. Error t value
(Intercept) 0.26293 0.04997 5.3
piece1 0.99582 0.00677 147.2
piece2 0.98083 0.00716 137.0
covariate 2.98265 0.09042 33.0
piece2:covariate 0.15287 0.01286 11.9
if I understand correctly what you want, you are looking for the contrast:
piece1-(piece2+piece2:covariate)
or
c(0,1,-1,0,-1)
My preferred tool for this is function estimable in gmodels; you could also do it by hand or with one of the functions in Frank Harrel's packages.
library(gmodels)
estimable(lmer.results,c(0,1,-1,0,-1),conf.int=TRUE)
giving
Estimate Std. Error p value Lower.CI Upper.CI
(0 1 -1 0 -1) -0.138 0.0127 0 -0.182 -0.0928

Resources