R Regression with different null hypothesis - r

I have a series of regressions where I would like to execute different null hypotheses in the same regression.
This means that I would like to test whether one independent variable is equal to 1 and the other equal to 0.
netew3 <- summary(lm(ewvw[,3]-factors$RF ~ factors$Mkt.RF + factors$SMB + factors$HML + factors$MOM, na.action = na.exclude), data = ewvw)
I would like to test whether the first variable (factors$Mkt.RF) is equal to 1 and the others (SMB, HML, and MOM) are equal to zero.
Thank you in advance for your help.
Best
PL

summary() of an lm-object gives you p-values for all coefficients under the null hypotheses that each coefficient equals 0. However, it also gives you all necessary information to conduct your own test with a different null hypothesis, e.g. that coefficients are 1.
This is one of many places where t-test of regression coefficients is explained in detail. Essentially, you get the t-value by calculating (estimate - reference) / SE. SE is the standard error and reference being the assumed value of the coefficient under the null hypothesis (usually 0). So all you have to do is change the latter value from 0 to 1 and you got your t-value.
I automated this in a function below. h0.value is your assumed value under the null hypothesis. You can check if it works properly with your data/model by running it with h0.value = 0 and compare the result to what you get from summary(). If it works, use it with h0.value = 1.
estim_test <- function(lm.mod, h0.value = 0) {
coefm <- as.data.frame(summary(lm.mod)$coefficients)
n <- length(lm.mod$residuals)
coefm$`t value` <- (coefm$Estimate - h0.value)/coefm$`Std. Error`
coefm$`Pr(>|t|)` <- 2*pt(-abs(coefm$`t value`), df=lm.mod$df.residual)
coefm
}
# Testing the function
data("swiss")
mod1 <- lm(Fertility ~ Agriculture + Education + Catholic, data=swiss)
summary(mod1)
estim_test(mod1, h0.value=0)
estim_test(mod1, h0.value=1)

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Interpreting standard errors of parameter estimates in PGLS regression with phylolm

I am using the phylolm function (in package phylolm) to conduct PGLS phylogenetic analysis and am having some trouble interpreting the model output.
I am running a phylolm model with a continuous (log transformed) response variable and one predictor variable which is a factor with two groups. When I change the reference group (from condition A to B) and rerun the same model, the estimates change accordingly but the standard errors do not seem to. The standard error for the new reference group remains very high - high enough that I don't see how the difference between groups can be significant (which the p value indicates they are). I was under the impression that phylolm standard errors can be interpreted in the same was as for ordinary linear regression - am I mistaken?
Since you have a binary variable, changing the reference category should only reverse the value of the beta estimate - should it not? This is what happens in your models.
It might help to think of the coefficient for the conditions as the different between the group means. The difference between the means will be the same size, but the sign will change depending if you are comparing condition A to condition B or condition B to condition A.
# Create a binary variable
x = sample(c(0,1), n, replace = TRUE)
# and create the opposite variable (ie. changing the reference level)
x.rev = +(!x)
# add some error to the model
error = runif(n)
# create the continuous response variablbe
y = 2 + 2 * x + error
df = data.frame(y, x)
# look at the group means between each condition
group_means = tapply(y, x, mean)
group_means[1] - group_means[2]
group_means[1] - group_means[2]
# compare those to the coefficients in the model
summary(lm(y ~ x))
summary(lm(y ~ x.rev))

lme4: How to specify random slopes while constraining all correlations to 0?

Due to an interesting turn of events, I'm trying use the lme4 package in R to fit a model in which the random slopes are not allowed to correlate with each other or the random intercept. Effectively, I want to estimate the variance parameter for each random slope, but none of the correlations/covariances. From the reading I've done so far, I think what I want is effectively a diagonal variance/covariance structure for the random effects.
An answer to a similar question here provides a workaround to specify a model where slopes are correlated with intercepts, but not with each other. I also know the || syntax in lme4 makes slopes that are correlated with each other, but not with the intercepts. Neither of these seems to fully accomplish what I'm looking to do.
Borrowing the example from the earlier post, if my model is:
m1 <- lmer (Y ~ A + B + (1+A+B|Subject), data=mydata)
is there a way to specify the model such that I estimate variance parameters for A and B while constraining all three correlations to 0? I would like to achieve a result that looks something like this:
VarCorr(m1)
## Groups Name Std.Dev. Corr
## Subject (Intercept) 1.41450
## A 1.49374 0.000
## B 2.47895 0.000 0.000
## Residual 0.96617
I'd prefer a solution that could achieve this for an arbitrary number of random slopes. For example, if I were to add a random effect for a third variable C, there would be 6 correlation parameters to fix at 0 rather than 3. However, anything that could get me started in the right direction would be extremely helpful.
Edit:
On asking this question, I misunderstood what the || syntax does in lme4. Struck through the incorrect statement above to avoid misleading anyone in the future.
This is exactly what the double-bar notation does. However, note that the || in lme4 does not work as one might expect for factor variables. It does work 'properly' in glmmTMB, and the afex::mixed() function is a wrapper for [g]lmer which does implement a fully functional version of ||. (I have meant to import this into lme4 for years but just haven't gotten around to it yet ...)
simulated example
library(lme4)
set.seed(101)
dd <- data.frame(A = runif(500), B = runif(500),
Subject = factor(rep(1:25, 20)))
dd$Y <- simulate(~ A + B + (1 + A + B|Subject),
newdata = dd,
family = gaussian,
newparams = list(beta = rep(1,3), theta = rep(1,6), sigma = 1))[[1]]
solution
summary(m <- lmer (Y ~ A + B + (1+A+B||Subject), data=dd))
The correlations aren't listed because they are structurally absent (internally, the random effects term is expanded to (1|Subject) + (0 + A|Subject) + (0+B|Subject), which is also why the groups are listed as Subject, Subject.1, Subject.2).
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.8744 0.9351
Subject.1 A 2.0016 1.4148
Subject.2 B 2.8718 1.6946
Residual 0.9456 0.9724
Number of obs: 500, groups: Subject, 25

R Wald test for cluster robust se's

I would like to test for the significance of my model. I have read that because I am using a cluster-robust model the F-test doesn't hold and instead I should use a Wald test.
My script currently looks like this and all of these different options give me corrected cluster-robust se's:
Option 1:
m <- lm(y_var ~ var1 + poly(var2, 2) + quartier, data = df)
m_robust_clustered <- coeftest(m, vcov = vcovCL,
type = "HC1",
df = 9, # There are 10 quartiers, so 10-1 = 9
cluster = ~ quartier) # retrieve cluster robust se's
Option 2: (using miceadds)
m <- lm.cluster(y_var ~ var1 + poly(var2, 2) + quartier,
cluster = 'quartier',
data = df)
Option 3: (using estimatr)
m <- lm_robust(y_var ~ var1 + poly(var2, 2) + quartier, cluster = quartier, data = df)
My issue is that from here I cannot figure out how to perform a Wald test. I have looked at both waldtest() and Wald_test() functions but none of these work:
waldtest(m)
Wald_test(m)
==> What am I missing here ? Which syntax should I be using for the wald test in each of the regression coding above ?
Thanks for the help!
The details of functions with names like "Wald test" can differ among packages. Some are designed for testing nested models, and wouldn't work for single models as shown in the question (which doesn't seem to specify which packages provided these waldtest() or Wald_test() functions).
A safe (if not the easiest to use) choice would be the wald.test() function in the R aod package. Usage is:
wald.test(Sigma, b, Terms = NULL, L = NULL, H0 = NULL, df = NULL, verbose = FALSE)
where Sigma is the variance-covariance matrix from the model, b is the vector of model coefficients, and you choose between Terms and L to specify what to test. It doesn't accept a model; you specify the coefficient vector and the variance-covariance matrix directly. Use Terms for a joint test on a set of coefficients; use L to test linear combinations of coefficients. The default null hypothesis H0 is all equal to 0. For the overall model Wald test, you would specify Terms as an integer vector specifying all of the coefficients, for example 1:length(coef(model)) if coef(model) returns a vector.

Linear predictions always the same regardless of features in model

I'm using the Ames, Iowa housing prices data set.
I have a train set and test set. The test set is missing the dependent variable SalePrice. (No column for SalePrice exists).
I have done a linear model and now am trying to predict the Sale Price values on the test set. But when doing so, I always get these same predicted values for SalePrice, regardless of the model used.
Then when trying to calculate RMSE, I get NA.
Here is my model:
lm2 <- lm(SalePrice ~
GarageCars +
Neighborhood +
I(OverallQual^2) + OverallQual +
OverallQual*GrLivArea +
log2(LotArea) +
log2(GrLivArea) +
KitchenQual +
I(TotalBsmtSF^2) +
TotalBsmtSF
, data=train)
# Add an empty column to the test set,
# to be later filled in by predictions
# (Is this even necessary?):
test[, "SalePrice"] <- NA
# My predictions:
predictions <- predict(lm2, newdata = test)
head(predictions)
1 2 3 4 5 6
121093.5 170270.7 170029.5 187012.1 239359.2 172962.1
I always get these same values regardless of the model used. I suspect I'm just not understanding predict(). I suspect I am only getting the predicted values based on my train set rather than on my test set.
I know that the variable names need to match exactly those used in the model, but what other aspect of predict am I not understanding? Do I need to perform the same predictor variable transformations in the test set? Must I create variables to hold them?
Then I calculate the RMSE:
# Formula function for calculating RMSE:
rmse <- function(actual, pred) sqrt(mean((actual-pred)^2))
# Calculate rmse on test set:
rmse(test$SalePrice, predictions))
[1] NA
Could you please tell me what I'm doing wrong? Let me know if you need to see the data.

Resources