In R, comparing two regression coefficients from the same model - r

Using R, I want to statistically compare two coefficients from the same regression. In the Stata software, there is the test B1 = B2. What is the equivalent in R? I check several posts, but no one answered this issue.
https://stats.stackexchange.com/questions/33013/what-test-can-i-use-to-compare-slopes-from-two-or-more-regression-models
SPSS: Comparing regression coefficient from multiple models
Comparing regression models in R
Here are some simulated data.
library('MASS')
mu <- c(0,0,0)
Sigma <- matrix(.5, nrow=3, ncol=3) + diag(3)*0.3
MyData <- mvrnorm(n=10000, mu=mu, Sigma=Sigma) %>%
as.data.frame()
names(MyData) = c('v1', 'v2', 'y')
MyModel = lm(y ~ v1 * v2, data = MyData)
summary(MyModel)
I want to compare the estimate of V1 to the one of V2. So that if V1 and V2 are manipulated, I would like to tell something like "the influence of V1 on Y, is significantly higher than the influence of V2 on Y"

You can try multcomp , so if you look at the coefficients of your model:
coefficients(MyModel)
(Intercept) v1 v2 v1:v2
0.006961219 0.373547048 0.394760005 -0.012167754
You want to find the difference between the 2nd and 3rd term, so your contrast matrix is:
# yes it looks a bit weird at first
ctr = rbind("v1-v2"=c(0,1,-1,0))
And we can apply this using glht:
summary(glht(MyModel,ctr))
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = y ~ v1 * v2, data = MyData)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
v1-v2 == 0 -0.02121 0.01640 -1.294 0.196
(Adjusted p values reported -- single-step method)
This works for most general linear models. In your summary function, you get the the significance of each term based on the effect / standard error. The glht function does something similar. One exception for logistic regression I can think of, is when you have complete separation

Related

Too many coefficients with lm

I'm studying the relationship between expenditure per student and performance on pisa (a standardized test), i know that this regression can't give me a ceteris paribus relationship but this is the point of my exercise, i have to explain why it will not work.
I was running the regression on R with the basic code:
lm1=lm(a~b)
but the problem is that R reports me 32 coefficient, which is the number of the components of my population, while i should only receive the slope and the intercept, given that is a simple regression
This is the output that R gives me:
Call:
lm(formula = a ~ b)
Coefficients:
(Intercept) b10167.3 b10467.8 b10766.4 b10863.4 b10960.1 b11.688.4 b11028.1 b11052 b11207.3 b11855.9 b12424.3 b13930.8
522.9936 5.9561 0.3401 -20.6884 -14.8603 -15.0777 -3.5752 -23.0459 -27.1021 -42.2692 -20.4485 -35.3906 -30.7468
b14353.3 b2.997.9 b20450.9 b3714.8 b4996.3 b5291.6 b5851.7 b6190.7 b6663.3 b6725.3 b6747.2 b7074.9 b8189.1
-18.4412 -107.2872 -39.6793 -98.2315 -80.2505 -36.2202 -48.6179 -64.2414 1.3887 -19.0389 -59.9734 -32.0751 -31.5962
b8406.2 b8533.5 b8671.1 b8996.3 b9265.7 b9897.2
-13.4219 -26.0155 -13.9045 -37.9996 -17.0271 -27.2954
As you can see there are 32 coefficient while i should receive only two, it seems that R is reading each unite of the population as a variable but the dataset is, as always, set with variable in row. I can't figure out what is the problem.
It's not a problem with the lm function. It appears that R is treating $b$ as a categorical variable.
I have a made a small data with 5 observations, $a$ (numeric variable) and $b$ (categorical variable).
When I fit my model you will see that I am seeing a similar output as you (5 estimated coefficients).
data = data.frame(a = 1:5, b = as.factor(rnorm(5)))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b-0.16380292500502 b0.213340249988902 b0.423891299272316 b0.63738307939327
4 -3 -1 1 -2
To correct this you need to convert $b$ into a numerical vector.
data$b = as.numeric(as.character(data$b))
lm(a~b, data)
Call:
lm(formula = a ~ b, data = data)
Coefficients:
(Intercept) b
2.9580 0.2772
```

Regression with weights: Less standardized residuals then observations

I modelled a multiple Regression based on the Mincer-Wage-Equation and I added a weighting-factor to make it representative for the whole population.
But when I'm adding the weights function into my modell, R calculates less standardized residuals than I have observations.
Here's my modell:
lm(log(earings) ~ Gender + Age + Age^2 + Education, weights= phrf)
So I got problems to analyze the residuals because when I'm trying to plot the rstandard against the fitted.values R is telling: Different Variable Length in rstandard() found.
This Problem ist only by rstandard and rstudent, when I'm plotting the normal resid() against fitted.values there is no problem.
And when I'm leaving out the weights function I have not problems, too.
In the help file for rstudent():
Note that cases with weights == 0 are dropped from all these functions, but that if a linear model has been fitted with na.action = na.exclude, suitable values are filled in for the cases excluded during fitting.
A simple example to demonstrate:
set.seed(123)
x <- 1:100
y <- x + rnorm(100)
w <- runif(100)
w[44] <- 0
fit <- lm(y ~ x, weights=w)
length(fitted(fit))
length(rstudent(fit))
Gives:
> length(fitted(fit))
[1] 100
> length(rstudent(fit))
[1] 99
And this makes sense. If you have a weight of 0, the theoretical variance is 0 which is an infinite studentized or standardized residual.
Since you are effectively deleting those observations, you can subset the call to lm with subset=w!=0 or you can use that flag for the fitted values:
plot(fitted(fit)[w!=0], rstudent(fit))

Is a model-based recursive partitioning model from the family of the mixed effect models?

I was wondering if it is correct to say that a model-based recursive partitioning model (mob, package partykit) is of
the family of the mixed-effect models.
My point is that a mixed effect model provides different parameters for each random effect and this is also what does a mob model. The main difference I see is that a mob partitions itself the random effects.
Here is an example:
library(partykit); library(lme4)
set.seed(321)
##### Random data
V1 <- runif(100); V2 <- sample(1:3, 100, replace=T)
V3 <- jitter(ifelse(V2 == 1, 2*V1+3, ifelse(V2==2, -1*V1+2, V1)), amount=.2)
##### Mixed-effect model
me <- lmer(V3 ~ V1 + (1 + V1|V2))
coef(me) #linear model coefficients from the mixel effect model
#$V2
# (Intercept) V1
#1 2.99960082 1.9794378
#2 1.96874586 -0.8992926
#3 0.01520725 1.0255424
##### MOB
fit <- function(y, x, start = NULL, weights = NULL, offset = NULL) lm(y ~ x)
mo <- mob(V3 ~ V1|V2, fit=fit) #equivalent to lmtree
coef(mo) #linear model (same) coefficients from the mob
# (Intercept) x(Intercept) xV1
#2 2.99928854 NA 1.9804084
#4 1.97185661 NA -0.9047805
#5 0.01333292 NA 1.0288309
No, the kind of linear regression-based MOB (lmtree) is not a mixed-effects type of model. However, you used the MOB tree to estimate an interaction model (or nested effect) and indeed mixed-effects models can also be used to do so.
Your data-generating process implements a different intercept and V1 slope for every level of V2. If this interaction is known it can be easily recovered by a suitable linear regression with interaction effect (but V2 should be a categorical factor variable for this).
V2 <- factor(V2)
mi <- lm(V3 ~ 0 + V2 / V1)
matrix(coef(mi), ncol = 2)
## [,1] [,2]
## [1,] 2.99928854 1.9804084
## [2,] 1.97185661 -0.9047805
## [3,] 0.01333292 1.0288309
Note that the model fit is equivalent to lm(V3 ~ V1 * V2) but uses a different contrast coding for the coefficients.
The estimates obtained above are exactly identical to th lmtree() output (or manually using mob() + lm() as you did in your post):
coef(lmtree(V3 ~ V1 | V2))
## (Intercept) V1
## 2 2.99928854 1.9804084
## 4 1.97185661 -0.9047805
## 5 0.01333292 1.0288309
The main difference is that you had to tell lm() exactly which interaction to consider. lmtree(), on the other hand, "learned" the interaction in a data-driven way. Admittedly, in this case there is not so much to learn...but lmtree() could have decided without any split or with two splits instead of performing all possible splits.
Finally, your lmer(V3 ~ V1 + (1 + V1 | V2)) specification also estimates a nested (or interaction) effect. However, it uses a different estimation technology with random effects instead of full fixed effects. Also, here you have to prespecify the interaction.
In short: lmtree() can be considered as a way to find interaction effects in a data-driven way. But these interactions are not estimated with random effects, hence not a mixed-effects model.
P.S.: It is possible to combine lmtree() and lmer() but that's a different story. If you are interested see package https://CRAN.R-project.org/package=glmertree and the accomanying paper.

Calculating R^2 for a nonlinear least squares fit

Suppose I have x values, y values, and expected y values f (from some nonlinear best fit curve).
How can I compute R^2 in R? Note that this function is not a linear model, but a nonlinear least squares (nls) fit, so not an lm fit.
You just use the lm function to fit a linear model:
x = runif(100)
y = runif(100)
spam = summary(lm(x~y))
> spam$r.squared
[1] 0.0008532386
Note that the r squared is not defined for non-linear models, or at least very tricky, quote from R-help:
There is a good reason that an nls model fit in R does not provide
r-squared - r-squared doesn't make sense for a general nls model.
One way of thinking of r-squared is as a comparison of the residual
sum of squares for the fitted model to the residual sum of squares for
a trivial model that consists of a constant only. You cannot
guarantee that this is a comparison of nested models when dealing with
an nls model. If the models aren't nested this comparison is not
terribly meaningful.
So the answer is that you probably don't want to do this in the first
place.
If you want peer-reviewed evidence, see this article for example; it's not that you can't compute the R^2 value, it's just that it may not mean the same thing/have the same desirable properties as in the linear-model case.
Sounds like f are your predicted values. So the distance from them to the actual values devided by n * variance of y
so something like
1-sum((y-f)^2)/(length(y)*var(y))
should give you a quasi rsquared value, so long as your model is reasonably close to a linear model and n is pretty big.
As a direct answer to the question asked (rather than argue that R2/pseudo R2 aren't useful) the nagelkerke function in the rcompanion package will report various pseudo R2 values for nonlinear least square (nls) models as proposed by McFadden, Cox and Snell, and Nagelkerke, e.g.
require(nls)
data(BrendonSmall)
quadplat = function(x, a, b, clx) {
ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x,
a + b * clx + (-0.5*b/clx) * clx * clx)}
model = nls(Sodium ~ quadplat(Calories, a, b, clx),
data = BrendonSmall,
start = list(a = 519,
b = 0.359,
clx = 2304))
nullfunct = function(x, m){m}
null.model = nls(Sodium ~ nullfunct(Calories, m),
data = BrendonSmall,
start = list(m = 1346))
nagelkerke(model, null=null.model)
The soilphysics package also reports Efron's pseudo R2 and adjusted pseudo R2 value for nls models as 1 - RSS/TSS:
pred <- predict(model)
n <- length(pred)
res <- resid(model)
w <- weights(model)
if (is.null(w)) w <- rep(1, n)
rss <- sum(w * res ^ 2)
resp <- pred + res
center <- weighted.mean(resp, w)
r.df <- summary(model)$df[2]
int.df <- 1
tss <- sum(w * (resp - center)^2)
r.sq <- 1 - rss/tss
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
out <- list(pseudo.R.squared = r.sq,
adj.R.squared = adj.r.sq)
which is also the pseudo R2 as calculated by the accuracy function in the rcompanion package. Basically, this R2 measures how much better your fit becomes compared to if you would just draw a flat horizontal line through them. This can make sense for nls models if your null model is one that allows for an intercept only model. Also for particular other nonlinear models it can make sense. E.g. for a scam model that uses stricly increasing splines (bs="mpi" in the spline term), the fitted model for the worst possible scenario (e.g. where your data was strictly decreasing) would be a flat line, and hence would result in an R2 of zero. Adjusted R2 then also penalize models with higher nrs of fitted parameters. Using the adjusted R2 value would already address a lot of the criticisms of the paper linked above, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2892436/ (besides if one swears by using information criteria to do model selection the question becomes which one to use - AIC, BIC, EBIC, AICc, QIC, etc).
Just using
r.sq <- max(cor(y,yfitted),0)^2
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
I think would also make sense if you have normal Gaussian errors - i.e. the correlation between the observed and fitted y (clipped at zero, so that a negative relationship would imply zero predictive power) squared, and then adjusted for the nr of fitted parameters in the adjusted version. If y and yfitted go in the same direction this would be the R2 and adjusted R2 value as reported for a regular linear model. To me this would make perfect sense at least, so I don't agree with outright rejecting the usefulness of pseudo R2 values for nls models as the answer above seems to imply.
For non-normal error structures (e.g. if you were using a GAM with non-normal errors) the McFadden pseudo R2 is defined analogously as
1-residual deviance/null deviance
See here and here for some useful discussion.
Another quasi-R-squared for non-linear models is to square the correlation between the actual y-values and the predicted y-values. For linear models this is the regular R-squared.
As an alternative to this problem I used at several times the following procedure:
compute a fit on data with the nls function
using the resulting model make predictions
Trace (plot...) the data against the values predicted by the model (if the model is good, points should be near the bissectrix).
Compute the R2 of the linear régression.
Best wishes to all. Patrick.
With the modelr package
modelr::rsquare(nls_model, data)
nls_model <- nls(mpg ~ a / wt + b, data = mtcars, start = list(a = 40, b = 4))
modelr::rsquare(nls_model, mtcars)
# 0.794
This gives essentially the same result as the longer way described by Tom from the rcompanion resource.
Longer way with nagelkerke function
nullfunct <- function(x, m){m}
null_model <- nls(mpg ~ nullfunct(wt, m),
data = mtcars,
start = list(m = mean(mtcars$mpg)))
nagelkerke(nls_model, null_model)[2]
# 0.794 or 0.796
Lastly, using predicted values
lm(mpg ~ predict(nls_model), data = mtcars) %>% broom::glance()
# 0.795
Like they say, it's only an approximation.

Anova Type 2 and Contrasts

the study design of the data I have to analyse is simple. There is 1 control group (CTRL) and
2 different treatment groups (TREAT_1 and TREAT_2). The data also includes 2 covariates COV1 and COV2. I have been asked to check if there is a linear or quadratic treatment effect in the data.
I created a dummy data set to explain my situation:
df1 <- data.frame(
Observation = c(rep("CTRL",15), rep("TREAT_1",13), rep("TREAT_2", 12)),
COV1 = c(rep("A1", 30), rep("A2", 10)),
COV2 = c(rep("B1", 5), rep("B2", 5), rep("B3", 10), rep("B1", 5), rep("B2", 5), rep("B3", 10)),
Variable = c(3944133, 3632461, 3351754, 3655975, 3487722, 3644783, 3491138, 3328894,
3654507, 3465627, 3511446, 3507249, 3373233, 3432867, 3640888,
3677593, 3585096, 3441775, 3608574, 3669114, 4000812, 3503511, 3423968,
3647391, 3584604, 3548256, 3505411, 3665138,
4049955, 3425512, 3834061, 3639699, 3522208, 3711928, 3576597, 3786781,
3591042, 3995802, 3493091, 3674475)
)
plot(Variable ~ Observation, data = df1)
As you can see from the plot there is a linear relationship between the control and the treatment groups. To check if this linear effect is statistical significant I change the contrasts using the contr.poly() function and fit a linear model like this:
contrasts(df1$Observation) <- contr.poly(levels(df1$Observation))
lm1 <- lm(log(Variable) ~ Observation, data = df1)
summary.lm(lm1)
From the summary we can see that the linear effect is statistically significant:
Observation.L 0.029141 0.012377 2.355 0.024 *
Observation.Q 0.002233 0.012482 0.179 0.859
However, this first model does not include any of the two covariates. Including them results in a non-significant p-value for the linear relationship:
lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1)
summary.lm(lm2)
Observation.L 0.04116 0.02624 1.568 0.126
Observation.Q 0.01003 0.01894 0.530 0.600
COV1A2 -0.01203 0.04202 -0.286 0.776
COV2B2 -0.02071 0.02202 -0.941 0.354
COV2B3 -0.02083 0.02066 -1.008 0.320
So far so good. However, I have been told to conduct a Type II Anova rather than Type I. To conduct a Type II Anova I used the Anova() function from the car package.
Anova(lm2, type="II")
Anova Table (Type II tests)
Response: log(Variable)
Sum Sq Df F value Pr(>F)
Observation 0.006253 2 1.4651 0.2453
COV1 0.000175 1 0.0820 0.7763
COV2 0.002768 2 0.6485 0.5292
Residuals 0.072555 34
The problem here with using Type II is that you do not get a p-value for the linear and quadratic effect. So I do not know if the effect is statistically linear and or quadratic.
I found out that the following code produces the same p-value for Observation as the Anova() function. But the result also does not include any p-values for the linear or quadratic effect:
lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1)
lm3 <- lm(log(Variable) ~ COV1 + COV2, data = df1)
anova(lm2, lm3)
Does anybody know how to conduct a Type II anova and the contrasts function to obtain the p-values for the linear and quadratic effects?
Help would be very much appreciated.
Best
Peter
I found one partial workaround for this, but it may require further correction. The documentation for the function drop1() from the stats package indicates that this function produces Type II sums of squares (although this page: http://www.statmethods.net/stats/anova.html ) declares that drop1() produces Type III sums of squares, and I didn't spend too much time poring over this (http://afni.nimh.nih.gov/sscc/gangc/SS.html) to cross-check sums of squares calculations. You could use it to calculate everything manually, but I suspect you're asking this question because it would be nice if someone had already worked through it.
Anyway, I added a second vector to the dummy data called Observation2, and set it up with just the linear contrasts (you can only specify one set of contrasts for a given vector at a given time):
df1[,"Observation2"]<-df1$Observation
contrasts(df1$Observation2, how.many=1)<-contr.poly
Then created a third linear model:
lm3<-lm(log(Variable)~Observation2+COV1+COV2, data=df1)
And conducted F tests with drop1 to compare F statistics from Type II ANOVAs between the two models:
lm2, which contains both the linear and quadratic terms:
drop1(lm2, test="F")
lm3, which contains just the linear contrasts:
drop1(lm3, test="F")
This doesn't include a direct comparison of the models against each other, although the F statistic is higher (and p value accordingly lower) for the linear model, which would lead one to rely upon it instead of the quadratic model.

Resources