Based on the link below, I created a code to run regression on subsets of my data based on a variable.
Loop linear regression and saving coefficients
In this example I created a DUMMY (0 or 1) to create the subsets (in reality I have 3000 subsets)
res <- do.call(rbind, lapply(split(mydata, mydata$DUMMY),function(x){
fit <- lm(y~x1 + x2, data=x)
res <- data.frame(DUMMY=unique(x$DUMMY), coeff=coef(fit))
res
}))
This results in the following dataset
DUMMY coeff
0.(Intercept) 0 22.8419956
0.x1 0 -11.5623064
0.x2 0 2.1006948
1.(Intercept) 1 4.2020874
1.x1 1 -0.4924303
1.x2 1 1.0917668
What I would like however is one row per regression, and the variables in the columns. I also need the p values and standard errors included.
DUMMY interceptx1 coeffx1 p-valuex1 SEx1 coeffx2 p-valuex2 SEx2
0 22.84 -11.56 0.04 0.15 2.10 0.80 0.90
1 4.20 -0.49 0.10 0.60 1.09 0.60 1.20
Any idea how to do this?
While your desired output is (IMHO) not really tidy data, here is an approach using data.table and a custom-built extraction-function. It has an option to return a wide or long form of the results.
The extractor-function takes in a lm-object, and returns estimates, p-values and standard errors for all variables.
extractor <- function(model, return_wide = F){
#get datatable with coefficient, se and p-value
model_summary <- as.data.table(summary(model)$coefficients[,-3])
model_summary[,variable:=names(coef(model))]
#do some reshaping
step2 <- melt(model_summary, id.var="variable",variable.name="measure")
if(!return_wide){
return(step2)
}
step3 <- dcast(step2, 1~variable+measure,value.var="value")
return(step3)
}
Demonstration:
res_wide <- dat[,extractor(lm(y~x1 + x2), return_wide = T), by = dummy]
> res_wide
# dummy . (Intercept)_Estimate (Intercept)_Std. Error (Intercept)_Pr(>|t|) x1_Estimate x1_Std. Error x1_Pr(>|t|) x2_Estimate x2_Std. Error x2_Pr(>|t|)
# 1: 0 . 0.04314707 0.04495702 0.3376461 -0.054364406 0.04441204 0.2214895 0.01333804 0.04620999 0.7729757
# 2: 1 . -0.04137086 0.04471550 0.3553164 0.009864255 0.04533808 0.8278539 0.05272257 0.04507189 0.2426726
res_long <- dat[,extractor(lm(y~x1 + x2)), by = dummy]
# dummy variable measure value
# 1: 0 (Intercept) Estimate 0.043147072
# 2: 0 x1 Estimate -0.054364406
# 3: 0 x2 Estimate 0.013338043
# 4: 0 (Intercept) Std. Error 0.044957023
# 5: 0 x1 Std. Error 0.044412037
# 6: 0 x2 Std. Error 0.046209987
# 7: 0 (Intercept) Pr(>|t|) 0.337646052
# 8: 0 x1 Pr(>|t|) 0.221489530
Data used:
library(data.table)
set.seed(123)
nobs = 1000
dat <- data.table(
dummy = sample(0:1,nobs,T),
x1 = rnorm(nobs),
x2 = rnorm(nobs),
y = rnorm(nobs))
Related
I have the following data (dat)
I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
library(reshape2)
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
#$key
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034
Consider the following table :
DB <- data.frame(
Y =rnorm(6),
X1=c(T, T, F, T, F, F),
X2=c(T, F, T, F, T, T)
)
Y X1 X2
1 1.8376852 TRUE TRUE
2 -2.1173739 TRUE FALSE
3 1.3054450 FALSE TRUE
4 -0.3476706 TRUE FALSE
5 1.3219099 FALSE TRUE
6 0.6781750 FALSE TRUE
I'd like to explain my quantitative variable Y by two binary variables (TRUE or FALSE) without intercept.
The argument of this choice is that, in my study, we can't observe X1=FALSE and X2=FALSE at the same time, so it doesn't make sense to have a mean, other than 0, for this level.
With intercept
m1 <- lm(Y~X1+X2, data=DB)
summary(m1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9684 1.0590 -1.859 0.1600
X1TRUE 0.7358 0.9032 0.815 0.4749
X2TRUE 3.0702 0.9579 3.205 0.0491 *
Without intercept
m0 <- lm(Y~0+X1+X2, data=DB)
summary(m0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X1FALSE -1.9684 1.0590 -1.859 0.1600
X1TRUE -1.2325 0.5531 -2.229 0.1122
X2TRUE 3.0702 0.9579 3.205 0.0491 *
I can't explain why two coefficients are estimated for the variable X1. It seems to be equivalent to the intercept coefficient in the model with intercept.
Same results
When we display the estimation for all the combinations of variables, the two models are the same.
DisplayLevel <- function(m){
R <- outer(
unique(DB$X1),
unique(DB$X2),
function(a, b) predict(m,data.frame(X1=a, X2=b))
)
colnames(R) <- paste0('X2:', unique(DB$X2))
rownames(R) <- paste0('X1:', unique(DB$X1))
return(R)
}
DisplayLevel(m1)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
DisplayLevel(m0)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
So the two models are equivalent.
Question
My question is : can we just estimate one coefficient for the first effect ? Can we force R to assign a 0 value to the combinations X1=FALSE and X2=FALSE ?
Yes, we can, by
DB <- as.data.frame(data.matrix(DB))
## or you can do:
## DB$X1 <- as.integer(DB$X1)
## DB$X2 <- as.integer(DB$X2)
# Y X1 X2
# 1 -0.5059575 1 1
# 2 1.3430388 1 0
# 3 -0.2145794 0 1
# 4 -0.1795565 1 0
# 5 -0.1001907 0 1
# 6 0.7126663 0 1
## a linear model without intercept
m0 <- lm(Y ~ 0 + X1 + X2, data = DB)
DisplayLevel(m0)
# X2:1 X2:0
# X1:1 0.15967744 0.2489237
# X1:0 -0.08924625 0.0000000
I have explicitly coerced your TRUE/FALSE binary into numeric 1/0, so that no contrast is handled by lm().
The data appeared in my answer are different to yours, because you did not use set.seed(?) before rnorm() for reproducibility. But this is not a issue here.
I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?
When I attempt to run wald.test (from the aod package) on a categorical variable on my linear model, I get the following error:
Error in L %*% V : non-conformable arguments
The code that I'm having trouble with:
m1 <- glm(comment_count ~ factor(has_conflicts) + factor(base_repo_id) + **snip**, data = mydata)
summary(m1) # shows that base_repo_id's factors are coefficients 3 through 12
# Determine whether base_repo_id matters
wald.test(b = coef(m1), Sigma = vcov(m1), Terms = 3:12)
As I understand it, wald.test's b parameter is the linear regression's coefficients, Sigma is the regression's variances, and Terms selects the variables I want to run the Wald test on. So why am I getting the error?
In principle your code looks ok but it must be something about the particular fit to your data that did not work. Maybe there have been problems with non-identified parameters or a singular covariance matrix or something like that?
If I create a random data set with the variables above, then everything runs smoothly:
set.seed(1)
mydata <- data.frame(
comment_count = rpois(500, 3),
has_conflicts = sample(0:1, 500, replace = TRUE),
base_repo_id = sample(1:11, 500, replace = TRUE)
)
m1 <- glm(comment_count ~ factor(has_conflicts) + factor(base_repo_id),
data = mydata)
The Wald test can then be carried out by base R's anova() (which in the Gaussian case is equivalent to the Wald test):
m0 <- update(m1, . ~. - factor(base_repo_id))
anova(m0, m1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: comment_count ~ factor(has_conflicts)
## Model 2: comment_count ~ factor(has_conflicts) + factor(base_repo_id)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 498 1426.1
## 2 488 1389.2 10 36.91 0.2256
Or you can use aod:
library("aod")
wald.test(b = coef(m1), Sigma = vcov(m1), Terms = 3:12)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 13.0, df = 10, P(> X2) = 0.23
Or lmtest:
library("lmtest")
waldtest(m1, "factor(base_repo_id)", test = "Chisq")
## Wald test
##
## Model 1: comment_count ~ factor(has_conflicts) + factor(base_repo_id)
## Model 2: comment_count ~ factor(has_conflicts)
## Res.Df Df Chisq Pr(>Chisq)
## 1 488
## 2 498 -10 12.966 0.2256
Or car:
library("car")
linearHypothesis(m1, names(coef(m1))[3:12])
## Linear hypothesis test
##
## Hypothesis:
## factor(base_repo_id)2 = 0
## factor(base_repo_id)3 = 0
## factor(base_repo_id)4 = 0
## factor(base_repo_id)5 = 0
## factor(base_repo_id)6 = 0
## factor(base_repo_id)7 = 0
## factor(base_repo_id)8 = 0
## factor(base_repo_id)9 = 0
## factor(base_repo_id)10 = 0
## factor(base_repo_id)11 = 0
##
## Model 1: restricted model
## Model 2: comment_count ~ factor(has_conflicts) + factor(base_repo_id)
##
## Res.Df Df Chisq Pr(>Chisq)
## 1 498
## 2 488 10 12.966 0.2256
I have met the same problem.
The error shows that the size of two matrix L and V does not match.
Please check whether there are NA elements in your coefficients.
vcov() removes NA elements automatically which changes the size of the matrix so that their sizes do not match.
I would like to run the dependent variable of a logistic regression (in my data set it's : dat$admit) with all available variables, pairs and trios(3 Independent vars), each regression with a different Independent variables vs dependent variable. The outcome that I would like to get back is a list of each regression summary in a row: coeff,p-value ,AUC,CI 95%. Using the data set submitted below there should be 7 regressions:
dat$admit vs dat$female
dat$admit vs dat$apcalc
dat$admit vs dat$num
dat$admit vs dat$female + dat$apcalc
dat$admit vs dat$female + dat$num
dat$admit vs dat$apcalc + dat$num
dat$admit vs dat$female + dat$apcalc + dat$num
Here is a sample data set (where dat$admit is the logistic regression dependent variable) :
dat <- read.table(text = " female apcalc admit num
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6",header = TRUE)
Per #marek comment, the output should be like this (for female alone and from female & apcalc ):
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
female+apcalc 0.000000e+00 0.000000e+00 1 1 0.5
There is a good code that #David Arenburg wrote that produces the stats but with no models creations of pairs and trios so I would like to know how can add the models creations.
Here is David Arenburg's code?
library(caTools)
ResFunc <- function(x) {
temp <- glm(reformulate(x,response="admit"), data=dat,family=binomial)
c(summary(temp)$coefficients[,1],
summary(temp)$coefficients[,4],
colAUC(predict(temp, type = "response"), dat$admit))
}
temp <- as.data.frame(t(sapply(setdiff(names(dat),"admit"), ResFunc)))
colnames(temp) <- c("Intercept", "Estimate", "P-Value (Intercept)", "P-Value (Estimate)", "AUC")
temp
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
# apcalc 0.000000e+00 0.000000e+00 1 1 0.5
# num 5.177403e-16 -1.171295e-16 1 1 0.5
Any idea how to create this list? Thanks, Ron
Simple solution is to make the list of models by hand:
results <- list(
"female" = glm(admit~female , family=binomial, dat)
,"apcalc" = glm(admit~apcalc , family=binomial, dat)
,"num" = glm(admit~num , family=binomial, dat)
,"female + apcalc" = glm(admit~female + apcalc, family=binomial, dat)
,"female + num" = glm(admit~female + num , family=binomial, dat)
,"apcalc + num" = glm(admit~apcalc + num , family=binomial, dat)
,"all" = glm(admit~female + apcalc + num, family=binomial, dat)
)
Then you could check models by lapplying over the list of models:
lapply(results, summary)
Or more advanced (coefficient statistics):
require(plyr)
ldply(results, function(m) {
name_rows(as.data.frame(summary(m)$coefficients))
})
In similar way you could extract every information you want. Just write function to extract statistics you want, which takes glm model as argument:
get_everything_i_want <- function(model) {
#... do what i want ...
# eg:
list(AIC = AIC(model))
}
and then apply to each model:
lapply(results, get_everything_i_want)
# $female
# $female$AIC
# [1] 15.0904
# $apcalc
# $apcalc$AIC
# [1] 15.0904
# $num
# $num$AIC
# [1] 15.0904
# $`female + apcalc`
# $`female + apcalc`$AIC
# [1] 17.0904
# $`female + num`
# $`female + num`$AIC
# [1] 17.0904
# $`apcalc + num`
# $`apcalc + num`$AIC
# [1] 17.0904
# $all
# $all$AIC
# [1] 19.0904