I am trying to reproduce output from the PROC MIXED procedure using the Satterwaithe approximation in SAS using the lmerTest package in R.
This is my data:
Participant Condition Data
1 0 -1,032941629
1 0 0,869267841
1 0 -1,636722191
1 0 -1,15451393
1 0 0,340454836
1 0 -0,399315906
1 1 0,668983169
1 1 1,937817592
1 1 3,110013393
1 1 3,23409718
2 0 0,806881925
2 1 2,71020911
2 1 3,406864275
2 1 1,494288182
2 1 0,741827047
2 1 2,532062685
2 1 3,702118917
2 1 1,825046681
2 1 4,37167021
2 1 1,85125279
3 0 0,288743786
3 0 1,024396121
3 1 2,051281876
3 1 0,24543851
3 1 3,349677964
3 1 1,565395822
3 1 3,077031712
3 1 1,087494708
3 1 1,546150033
3 1 0,440249347
Using the following statement in SAS:
proc mixed data=mbd;
class participant;
model data = condition / solution ddfm=sat;
random intercept condition / sub=participant;
run;
I get this output:
My problem is that I can't seem to reproduce these results using lmerTest in R.
I thought that lmer(Data ~ Condition + (1 | Participant) + (Condition | Participant), REML=TRUE) was the equivalent statement of what I did in SAS but this gives different results. Note that the degrees of freedom are way off from the SAS output so I think I'm estimating parameters in R that I'm not estimating in SAS. I tried several other statements in R but I didn't manage to get the exact same output. However this should be possible as the lmer() function from the lmerTest package also uses the Satterwaithe approximation and should be exactly the same as the SAS PROC MIXED procedure.
Does anybody know what I'm doing wrong in R?
Thanks a lot!
Bart
You don't specify the same random effects as in your SAS example. (Condition | Participant) is expanded internally to (1 + Condition | Participant), which fits a random intercept, a random slope and the covariance between them [1]. So, you have two additional parameters (an intercept variance and the covariance) in your model. Uncorrelated random effects can be specified using || in lme4 syntax. Note how the formula gets expanded in the summary output.
library(lmerTest)
fit <- lmer(Data ~ Condition + (Condition || Participant), REML=TRUE, data = DF)
summary(fit)
#Linear mixed model fit by REML
#t-tests use Satterthwaite approximations to degrees of freedom ['lmerMod']
#Formula: Data ~ Condition + ((1 | Participant) + (0 + Condition | Participant))
# Data: DF
#
#REML criterion at convergence: 90.6
#
#Scaled residuals:
# Min 1Q Median 3Q Max
#-1.58383 -0.78970 -0.06993 0.87801 1.91237
#
#Random effects:
# Groups Name Variance Std.Dev.
# Participant (Intercept) 0.00000 0.000
# Participant.1 Condition 0.07292 0.270
# Residual 1.20701 1.099
#Number of obs: 30, groups: Participant, 3
#
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) -0.09931 0.36621 26.50400 -0.271 0.788363
#Condition 2.23711 0.46655 12.05700 4.795 0.000432 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Correlation of Fixed Effects:
# (Intr)
#Condition -0.785
Related
I have discrete count data (trap_catch) for two groups withing the variable in_tree (1 = trap in a tree, or 0 = trap not in a tree), and I want to see if counts were different between these two groups. The data is overdispersed and there are many zeroes, so I have come to the conclusion that I need a hurdle model. Is this OK?
trap_id trap_catch in_tree
1 0 0
2 10 1
3 0 0
4 0 1
5 9 1
6 3 0
Here is an example of how the data is set up. My code is as follows:
mod.hurdle <- hurdle(trap_catch~in_tree, data=data,dist="negbin")
summary(mod.hurdle)
The results I get are as follows and seem so different to any examples I have read:
Pearson residuals:
Min 1Q Median 3Q Max
-0.8986 -0.6635 -0.2080 0.2474 6.8513
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2582 0.1285 9.793 < 2e-16 ***
in_tree 1.3722 0.3100 4.426 9.58e-06 ***
Log(theta) -0.2056 0.2674 -0.769 0.442
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5647 0.1944 8.049 8.32e-16 ***
in_tree 16.0014 1684.1379 0.010 0.992
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.8142
Number of iterations in BFGS optimization: 8
Log-likelihood: -513.7 on 5 Df
I am confused as to how to interpret these results.
I apologise in advance for my lack of understanding - I am very new to this type of analysis.
I have binary outcome variable and 4 predictors: 2 binary one and 2 continuous (truncated to whole numbers). I have some 1158 observations and the objective of the analysis is to predict the probability of the binary outcome (infection), check Goodness of fit and predictive quality of the final model.
> str(data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1158 obs. of 5 variables:
$ age : num 25 49 41 19 55 37 30 31 52 37 ...
$ gender: num 1 1 1 0 0 0 1 0 1 1 ...
$ var1 : num 0 0 0 0 0 0 0 0 0 0 ...
$ y : num 1 0 0 1 1 0 1 1 0 1 ...
$ var2 : num 26 33 25 30 28 20 28 21 17 25 ...
I have seen that the data is sometimes split in 2: testing and training data set, but not always. I assume this depends on the original sample size? Is it advisable to split the data for my analysis?
For now, I have not split the data. I conducted varius variable selection procedures:
manual LRT based backward selection manual
LRT based forward selection automated
LRT based backward selection
AIC backward selection procedure
AIC forward selection procedure
And the all lead to the same results: Only age and gender should be included in my model.
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2716 -0.8767 -0.7361 1.3008 1.9353
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.785753 0.238634 3.293 0.000992 ***
age -0.031504 0.004882 -6.453 1.1e-10 ***
gender -0.223195 0.129774 -1.720 0.085455 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1444.9 on 1157 degrees of freedom
Residual deviance: 1398.7 on 1155 degrees of freedom
AIC: 1404.7
Now, i want to see if any interactions or polynomials are significant. The dot (.) denotes the full model with 4 predictors.
full.twoway <- glm(y ~ (.)^2 , family = binomial, data=data) # includes 2-way interactions
summary(full.twoway)
model.aic.backward_2w <- step(full.twoway, direction = "backward", trace = 1)
summary(model.aic.backward_2w)
full.treeway <- glm(y ~ (.)^3 , family = binomial, data=data) # includes 3-way interactions
summary(full.treeway)
full.treeway <- glm(reject ~ (.)^3 , family = binomial, data=renal) # includes 3-way interactions
summary(full.treeway)
# significant interaction: age:male:cardio at 0.5
model.aic.backward_3w <- step(full.treeway, direction = "backward", trace = 1)
summary(model.aic.backward_3w)
# polynomials
model.polynomial <- glm(y ~ age + gender + I(age^2), family = binomial, data=data)
# only age, gender significant
Also only age and gender are significant. This seems very strange to me. I would have expected some interaction or polynomial term to be significant. Am I doing something wrong? Are there some other variable selection techniques?
EDIT:
I have partitioned the dataset in training and testing. Training dataset consist of 868 observations. The results of the selection procedure indicate that only the variable age is significant now...
I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6
While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992
The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.
The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I have little experience with panel data in R, and am trying to run a simple panel regression with the plm-package. When converting my dataframe to a pdata.frame, however, my time index-variable is transformed to a factor variable. This means that if I want to regress a dependent variable as a function of time, the regression generates a long list of dummy-variables for time and calculates individual coefficients for each. I just want the average effect per time unit (ie. average monthly increase/decrease in points).
Example dataframe:
ID Date Points
1 1/11/2014 2
1 1/12/2014 4
1 1/1/2015 6
1 1/2/2015 8
2 1/11/2014 1
2 1/12/2014 2
2 1/1/2015 3
2 1/2/2015 4
Say the example dataframe structure is ID = int, Date = POSIXct, Points = int.
I then convert it to a pdata.frame with index ID and Date:
panel <- pdata.frame(dataframe, c("ID", "Date"))
And run a plm fixed effects regression:
fixed <- plm(Points ~ Date, data=panel, model="within")
summary(fixed)
The resulting coefficients are then broken down by each month as dummies.
I want to treat my time-variable as a continuous variable, so I get only one coefficient for Date. How can I do this? Is there a way to avoid formatting the time index-variable as a factor in panel dataframes?
I think you need to create a separate clock or time counter from panel$Date to use in your model. For example:
library(dplyr)
dataframe <- dataframe %>%
group_by(ID) %>%
mutate(clock = seq_along(ID))
panel <- pdata.frame(dataframe, c("ID", "Date"))
That produces these data:
ID Date Points clock
1-2014-11-01 1 2014-11-01 2 1
1-2014-12-01 1 2014-12-01 4 2
1-2015-01-01 1 2015-01-01 6 3
1-2015-02-01 1 2015-02-01 8 4
2-2014-11-01 2 2014-11-01 1 1
2-2014-12-01 2 2014-12-01 2 2
2-2015-01-01 2 2015-01-01 3 3
2-2015-02-01 2 2015-02-01 4 4
That produces this output:
> fixed <- plm(Points ~ clock, data=panel, model="within")
> summary(fixed)
Oneway (individual) effect Within Model
Call:
plm(formula = points ~ clock, data = panel, model = "within")
Balanced Panel: n=2, T=4, N=8
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.750 -0.375 0.000 0.375 0.750
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
clock 1.50000 0.22361 6.7082 0.001114 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 25
Residual Sum of Squares: 2.5
R-Squared : 0.9
Adj. R-Squared : 0.5625
F-statistic: 45 on 1 and 5 DF, p-value: 0.0011144
Suppose I have to estimate coefficients a,b in regression:
y=a*x+b*z+c
I know in advance that y is always in range y>=0 and y<=x, but regression model produces sometimes y outside of this range.
Sample data:
mydata<-data.frame(y=c(0,1,3,4,9,11),x=c(1,3,4,7,10,11),z=c(1,1,1,9,6,7))
round(predict(lm(y~x+z,data=mydata)),2)
1 2 3 4 5 6
-0.87 1.79 3.12 4.30 9.34 10.32
First predicted value is <0.
I tried model without intercept: all predictions are >0, but third prediction of y is >x (4.03>3)
round(predict(lm(y~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.76 2.94 4.03 4.67 8.92 9.68
I also considered to model proportion y/x instead of y:
mydata$y2x<-mydata$y/mydata$x
round(predict(lm(y2x~x+z,data=mydata)),2)
1 2 3 4 5 6
0.15 0.39 0.50 0.49 0.97 1.04
round(predict(lm(y2x~x+z-1,data=mydata)),2)
1 2 3 4 5 6
0.08 0.33 0.46 0.47 0.99 1.07
But now sixth prediction is >1, but proportion should be in range [0,1].
I also tried to apply method where glm is used with offset option: Regression for a Rate variable in R
and
http://en.wikipedia.org/wiki/Poisson_regression#.22Exposure.22_and_offset
but this was not successfull.
Please note, in my data dependent variable: proportion y/x is both zero-inflated and one-inflated.
Any idea, what is suitable approach to build a model in R ('glm','lm')?
You're on the right track: if 0 ≤ y ≤ x then 0 ≤ (y/x) ≤ 1. This suggests fitting y/x to a logistic model in glm(...). Details are below, but considering that you've only got 6 points, this is a pretty good fit.
The major concern is that the model is not valid unless the error in (y/x) is Normal with constant variance (or, equivalently, the error in y increases with x). If this is true then we should get a (more or less) linear Q-Q plot, which we do.
One nuance: the interface to the glm logistic model wants two columns for y: "number of successes (S)" and "number of failures (F)". It then calculates the probability as S/(S+F). So we have to provide two columns which mimic this: y and x-y. Then glm(...) will calculate y/(y+(x-y)) = y/x.
Finally, the fit summary suggests that x is important and z may or may not be. You might want to try a model that excludes z and see if that improves AIC.
fit = glm(cbind(y,x-y)~x+z, data=mydata, family=binomial(logit))
summary(fit)
# Call:
# glm(formula = cbind(y, x - y) ~ x + z, family = binomial(logit),
# data = mydata)
# Deviance Residuals:
# 1 2 3 4 5 6
# -0.59942 -0.35394 0.62705 0.08405 -0.75590 0.81160
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.0264 1.2177 -1.664 0.0961 .
# x 0.6786 0.2695 2.518 0.0118 *
# z -0.2778 0.1933 -1.437 0.1507
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Dispersion parameter for binomial family taken to be 1)
# Null deviance: 13.7587 on 5 degrees of freedom
# Residual deviance: 2.1149 on 3 degrees of freedom
# AIC: 15.809
par(mfrow=c(2,2))
plot(fit) # residuals, Q-Q, Scale-Location, and Leverage Plots
mydata$pred <- predict(fit, type="response")
par(mfrow=c(1,1))
plot(mydata$y/mydata$x,mydata$pred,xlim=c(0,1),ylim=c(0,1), xlab="Actual", ylab="Predicted")
abline(0,1, lty=2, col="blue")