When I attempt to run wald.test (from the aod package) on a categorical variable on my linear model, I get the following error:
Error in L %*% V : non-conformable arguments
The code that I'm having trouble with:
m1 <- glm(comment_count ~ factor(has_conflicts) + factor(base_repo_id) + **snip**, data = mydata)
summary(m1) # shows that base_repo_id's factors are coefficients 3 through 12
# Determine whether base_repo_id matters
wald.test(b = coef(m1), Sigma = vcov(m1), Terms = 3:12)
As I understand it, wald.test's b parameter is the linear regression's coefficients, Sigma is the regression's variances, and Terms selects the variables I want to run the Wald test on. So why am I getting the error?
In principle your code looks ok but it must be something about the particular fit to your data that did not work. Maybe there have been problems with non-identified parameters or a singular covariance matrix or something like that?
If I create a random data set with the variables above, then everything runs smoothly:
set.seed(1)
mydata <- data.frame(
comment_count = rpois(500, 3),
has_conflicts = sample(0:1, 500, replace = TRUE),
base_repo_id = sample(1:11, 500, replace = TRUE)
)
m1 <- glm(comment_count ~ factor(has_conflicts) + factor(base_repo_id),
data = mydata)
The Wald test can then be carried out by base R's anova() (which in the Gaussian case is equivalent to the Wald test):
m0 <- update(m1, . ~. - factor(base_repo_id))
anova(m0, m1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: comment_count ~ factor(has_conflicts)
## Model 2: comment_count ~ factor(has_conflicts) + factor(base_repo_id)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 498 1426.1
## 2 488 1389.2 10 36.91 0.2256
Or you can use aod:
library("aod")
wald.test(b = coef(m1), Sigma = vcov(m1), Terms = 3:12)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 13.0, df = 10, P(> X2) = 0.23
Or lmtest:
library("lmtest")
waldtest(m1, "factor(base_repo_id)", test = "Chisq")
## Wald test
##
## Model 1: comment_count ~ factor(has_conflicts) + factor(base_repo_id)
## Model 2: comment_count ~ factor(has_conflicts)
## Res.Df Df Chisq Pr(>Chisq)
## 1 488
## 2 498 -10 12.966 0.2256
Or car:
library("car")
linearHypothesis(m1, names(coef(m1))[3:12])
## Linear hypothesis test
##
## Hypothesis:
## factor(base_repo_id)2 = 0
## factor(base_repo_id)3 = 0
## factor(base_repo_id)4 = 0
## factor(base_repo_id)5 = 0
## factor(base_repo_id)6 = 0
## factor(base_repo_id)7 = 0
## factor(base_repo_id)8 = 0
## factor(base_repo_id)9 = 0
## factor(base_repo_id)10 = 0
## factor(base_repo_id)11 = 0
##
## Model 1: restricted model
## Model 2: comment_count ~ factor(has_conflicts) + factor(base_repo_id)
##
## Res.Df Df Chisq Pr(>Chisq)
## 1 498
## 2 488 10 12.966 0.2256
I have met the same problem.
The error shows that the size of two matrix L and V does not match.
Please check whether there are NA elements in your coefficients.
vcov() removes NA elements automatically which changes the size of the matrix so that their sizes do not match.
Related
I want to fit a mixed model with data containing missing values.
The imputation is performed with mice.
How can I compare the original data model fit to the mice one?
Example code..
## dummy data
set.seed(123)
DF <- data.frame(countryname = rep(LETTERS[1:10],each = 10), x1 = sample(10,100,replace = T),x2 = sample(5,100,replace = T), y = sample(10,100,replace = T))
# impute NAs
DF[sample(100,10),c("x1")] <- NA
DF[sample(100,10),c("x2")] <- NA
DF[sample(100,10),c("y")] <- NA
#
library(mice)
imp = mice(data = DF, m = 10, printFlag = FALSE)
fit = with(imp, expr=lme4::lmer(y~ x1+x2+ (1 | countryname)))
library(broom.mixed)
pool(fit)
summary(fit)
## fit to original data
fitor= lme4::lmer(y~ x1+x2+ (1 | countryname),data=DF)
## how to compare model estimates for fit and fitor?
## example output
##
## =======================================
## base w/SES
## ---------------------------------------
## (Intercept) 0.105 -0.954 ***
## (0.058) (0.085)
## x1 -0.497 *** -0.356 ***
## (0.058) (0.054)
## x2 -0.079 -0.102 *
## (0.043) (0.040)
## ---------------------------------------
## R2 0.039 0.157
## Nobs 4073 4073
## =======================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
###
Consider the following dataset
Quantity <- c(25,39,45,57,70,85,89,100,110,124,137,150,177)
Sales <- c(1000,1250,2600,3000,3500,4500,5000,4700,4405,4000,3730,3400,3300)
df <- data.frame(Quantity,Sales)
df
Plotting the data, the distribution of observations is clearly non-linear, but presents a likely breaking-point around Quantity = 89 (I skip the plot here). Therefore, I built a joint piecewise linear model as follows
df$Xbar <- ifelse(df$Quantity>89,1,0)
df$diff <- df$Quantity - 89
reg <- lm(Sales ~ Quantity + I(Xbar * (Quantity - 89)), data = df)
summary(reg)
or simply
df$X <- df$diff*df$Xbar
reg <- lm(Sales ~ Quantity + X, data = df)
summary(reg)
However, according to this parametrization, the coefficient of X represents the change in the slope from the preceding interval.
How can I parametrize the relevant coefficient to rather represent the slope for the second interval?
I did some research but I was unable to find the desired specification, apart from some automatization in stata (see the voice 'marginal' here https://www.stata.com/manuals13/rmkspline.pdf).
Any help is much appreciated. Thank you!
Acknowledgement:
the workable example is retrieved from
https://towardsdatascience.com/unraveling-spline-regression-in-r-937626bc3d96
The key here is to use a logical variable is.right which is TRUE for the points to the right of 89 and FALSE otherwise.
From the the output shown 60.88 is the slope to the left of 89 and -19.97 is the slope to the right. The lines intersect at Quantity = 89, Sales = 4817.30.
is.right <- df$Quantity > 89
fm <- lm(Sales ~ diff : is.right, df)
fm
## Call:
## lm(formula = Sales ~ diff:is.right, data = df)
##
## Coefficients:
## (Intercept) diff:is.rightFALSE diff:is.rightTRUE
## 4817.30 60.88 -19.97
Alternatives
Alternately if you want to use Xbar from the question do it this way. It gives the same coefficients as fm.
fm2 <- lm(Sales ~ diff : factor(Xbar), df)
or
fm3 <- lm(Sales ~ I(Xbar * diff) + I((1 - Xbar) * diff), df)
Double check with nls
We can double check these using nls with the following formulation which makes use of the fact that if we extend both lines the one to use at any Quantity is the lower of the two.
st <- list(a = 0, b1 = 1, b2 = -1)
fm4 <- nls(Sales ~ a + pmin(b1 * (Quantity - 89), b2 * (Quantity - 89)), start = st)
fm4
## Nonlinear regression model
## model: Sales ~ a + pmin(b1 * (Quantity - 89), b2 * (Quantity - 89))
## data: parent.frame()
## a b1 b2
## 4817.30 60.88 -19.97
## residual sum-of-squares: 713120
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 2.285e-09
This would also work:
fm5 <- nls(Sales ~ a + ifelse(Quantity > 89, b2, b1) * diff, df, start = st)
Plot
Here is a plot:
plot(Sales ~ Quantity, df)
lines(fitted(fm) ~ Quantity, df)
Model matrix
And here is the model matrix for the linear regression:
> model.matrix(fm)
(Intercept) diff:is.rightFALSE diff:is.rightTRUE
1 1 -64 0
2 1 -50 0
3 1 -44 0
4 1 -32 0
5 1 -19 0
6 1 -4 0
7 1 0 0
8 1 0 11
9 1 0 21
10 1 0 35
11 1 0 48
12 1 0 61
13 1 0 88
If you know the breakpoints, then you almost have the model, it should be:
fit=lm(Sales ~ Quantity + Xbar + Quantity:Xbar,data=df)
Because if you don't introduce a new intercept (Xbar), it will start from the intercept already in the model, which will not work. We can plot it:
plot(df$Quantity,df$Sales)
newdata = data.frame(Quantity=seq(40,200,by=5))
newdata$Xbar= ifelse(newdata$Quantity>89,1,0)
lines(newdata$Quantity,predict(fit,newdata))
The coefficients are:
summary(fit)
Call:
lm(formula = Sales ~ Quantity * Xbar, data = df)
Residuals:
Min 1Q Median 3Q Max
-527.9 -132.2 -15.1 148.1 464.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -545.435 327.977 -1.663 0.131
Quantity 59.572 5.746 10.367 2.65e-06 ***
Xbar 7227.288 585.933 12.335 6.09e-07 ***
Quantity:Xbar -80.133 6.856 -11.688 9.64e-07 ***
And the coefficient of the 2nd slope is 59.572+(-80.133) = -20.561
I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!
There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA
I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?
Is there an easy way to run followup mathematical calculations on elements of a summary? I have log transformed data that is run through an anova analysis. I would like to calculate the antilog of the summary output.
I have the following code:
require(multcomp)
inc <- log(Inc)
myanova <- aov(inc ~ educ)
tukey <- glht(myanova, linfct = mcp(educ = "Tukey"))
summary(tukey)
Which produces an output as follows:
Estimate Std. Error t value Pr(>|t|)
12 - under12 == 0 0.32787 0.08493 3.861 0.00104 **
13to15 - under12 == 0 0.49187 0.08775 5.606 < 0.001 ***
16 - under12 == 0 0.89775 0.09217 9.740 < 0.001 ***
over16 - under12 == 0 0.99856 0.09316 10.719 < 0.001 ***
13to15 - 12 == 0 0.16400 0.04674 3.509 0.00394 **
etc.
How can I easily execute an antilog calculation on the Estimate values?
This is a bit of a hack, so I'd recommend further checking, but if all you want is to see exponented estimates and standard errors I think something similar to the following will work (I used different data).
> amod <- aov(breaks ~ tension, data = warpbreaks)
> tukey = glht(amod, linfct = mcp(tension = "Tukey"))
> tsum = summary(tukey)
> tsum[[10]]$coefficients = exp(tsum[[10]]$coefficients)
> tsum[[10]]$sigma = exp(tsum[[10]]$sigma)
> tsum
If you want to use coef(tukey) to give you the estimates then you would reverse transform with:
exp(coef(tukey))
I think this should work:
coef(tukey)
to get the estimated values. here an example:
amod <- aov(breaks ~ tension, data = warpbreaks)
tukey <- glht(amod, linfct = mcp(tension = "Tukey"))
Now if want to get all tukey summary elements you type you apply head or tail to get a named list with the summary elements.
head(summary(tukey))
$model
Call:
aov(formula = breaks ~ tension, data = warpbreaks)
Terms:
tension Residuals
Sum of Squares 2034.259 7198.556
Deg. of Freedom 2 51
Residual standard error: 11.88058
Estimated effects may be unbalanced
$linfct
(Intercept) tensionM tensionH
M - L 0 1 0
H - L 0 0 1
H - M 0 -1 1
attr(,"type")
[1] "Tukey"
$rhs
[1] 0 0 0
$coef
(Intercept) tensionM tensionH
36.38889 -10.00000 -14.72222
$vcov
(Intercept) tensionM tensionH
(Intercept) 7.841564 -7.841564 -7.841564
tensionM -7.841564 15.683128 7.841564
tensionH -7.841564 7.841564 15.683128
$df
[1] 51