Extract individual coefficients from plm model (R) - r

I would like to ask if it is possible to extract coefficients for individual observations for a given variable in panel data using plm in R, for instance using the example data:
wi <- plm(inv ~ value + capital, data = Grunfeld, model = "within", effect = "twoways")
In other words, in this example a coefficient for each of the firms in the sample.

It is not clear to me what you are looking for:
1) "coefficients for individual observations"
vs.
2) "a coefficient for each of the firms"
.
These are two different things and I believe you cannot have neither.
1) Asking for more than one coefficient for one observation does not work.
2) Asking for one coefficient per firm but trying to estimate two (value and capital) does not seem plausible.
Maybe this answer gives what you are looking for:
Extract all individual slope coefficient from pooled OLS estimation in R

I think you're looking for the function fixef in the plm package, which extracts the "fixed effects" (firm coefficients in this case). In your example:
library(plm)
data("Grunfeld")
wi <- plm(inv ~ value + capital, data = Grunfeld, model = "within", effect = "twoways")
and then run:
> fixef(wi)
1 2 3 4 5 6 7 8 9
-134.227709 72.826531 -269.458508 -38.873866 -139.666304 -31.339066 -82.761099 -66.737194 -104.010153
10
-7.390586

Related

Calculate indirect effect of 1-1-1 (within-person, multilevel) mediation analyses

I have data from an Experience Sampling Study, which consists of 8140 observations nested in 106 participants. I want to test if there is a mediation, in which I also want to compare the predictors (X1= socialInteraction_tech, X2= socialInteraction_ftf, M = MPEE_int, Y= wellbeing). X1, X2, and M are person-mean centred in order to obtain the within-person effects. To account for the autocorrelation I have fit a model with an ARMA(2,1) structure. We control for time with the variable "obs".
This is the final model including all variables of interest:
fit_mainH1xmy <- lme(fixed = wellbeing ~ 1 + obs # Controls
+ MPEE_int_centred + socialInteraction_tech_centred + socialInteraction_ftf_centred,
random = ~ 1 + obs | ID, correlation = corARMA(form = ~ obs | ID, p = 2, q = 1),
data = file, method = "ML", na.action=na.exclude)
summary(fit_mainH1xmy)
The mediation is partial, as my predictor X still significantly predicts Y after adding M.
However, I can't find a way to calculate c'(cprime), the indirect effect.
I have found the mlma package, but it looks weird and requires me to do transformations to my data.
I have tried melting the data in a long format and using lmer() to fit the model (following https://quantdev.ssri.psu.edu/sites/qdev/files/ILD_Ch07_2017_Within-PersonMedationWithMLM.html), but lmer() does not let me take into account the moving average (MA-part of the ARMA(2,1) structure).
Does anyone know how I could now obtain the indirect effect?

Syntax for diagonal variance-covariance matrix for non-linear mixed effects model in nlme

I am analysing routinely collected substance use data during the first 12 months' of treatment in a large sample of outpatients attending drug and alcohol treatment services. I am interested in whether differing levels of methamphetamine use (no use, low use, and high use) at the outset of treatment predicts different levels after a year in treatment, but the data is very irregular, with different clients measured at different times and different numbers of times during their year of treatment.
The data for the high and low use group seem to suggest that drug use at outset reduces during the first 3 months of treatment and then asymptotes. Hence I thought I would try a non-linear exponential decay model.
I started with the following nonlinear generalised least squares model using the gnls() function in the nlme package:
fitExp <- gnls(outcome ~ C*exp(-k*yearsFromStart),
params = list(C ~ atsBase_fac, k ~ atsBase_fac),
data = dfNL,
start = list(C = c(nsC[1], lsC[1], hsC[1]),
k = c(nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = gnlsControl(nlsTol = 0.1))
where outcome is number of days of drug use in the 28 days previous to measurement, atsBase_fac is a three-level categorical predictor indicating level of amphetamine use at baseline (noUse, lowUse, and highUse), yearsFromStart is a continuous predictor indicating time from start of treatment in years (baseline = 0, max - 1), C is a parameter indicating initial level of drug use, and k is the rate of decay in drug use. The starting values of C and k are taken from nls models estimating these parameters for each group. These are the results of that model
Generalized nonlinear least squares fit
Model: outcome ~ C * exp(-k * yearsFromStart)
Data: dfNL
AIC BIC logLik
27672.17 27725.29 -13828.08
Variance function:
Structure: Exponential of variance covariate
Formula: ~yearsFromStart
Parameter estimates:
expon
0.7927517
Coefficients:
Value Std.Error t-value p-value
C.(Intercept) 0.130410 0.0411728 3.16738 0.0015
C.atsBase_faclow 3.409828 0.1249553 27.28839 0.0000
C.atsBase_fachigh 20.574833 0.3122500 65.89218 0.0000
k.(Intercept) -1.667870 0.5841222 -2.85534 0.0043
k.atsBase_faclow 2.481850 0.6110666 4.06150 0.0000
k.atsBase_fachigh 9.485155 0.7175471 13.21886 0.0000
So it looks as if there are differences between groups in initial rate of drug use and in rate of reduction in drug use. I would like to go a step further and fit a nonlinear mixed effects model.I tried consulting Pinhiero and Bates' book accompanying the nlme package but the only models I could find that used irregular, sparse data like mine used a self-starting function, and my model does not do that.
I tried to adapt the gnls() model to nlme like so:
fitNLME <- nlme(model = outcome ~ C*exp(-k*yearsFromStart),
data = dfNL,
fixed = list(C ~ atsBase_fac, k ~ atsBase_fac),
random = pdDiag(yearsFromStart ~ id),
groups = ~ id,
start = list(fixed = c(nsC[1], lsC[1], hsC[1], nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = nlmeControl(optim = "optimizer"))
bit I keep getting error message, I presume through errors in the syntax specifying the random effects.
Can anyone give me some tips on how the syntax for the random effects works in nlme?
The only dataset in Pinhiero and Bates that resembled mine used a diagonal variance-covariance matrix. Can anyone filled me in on the syntax of this nlme function, or suggest a better one?
p.s. I wish I could provide a reproducible example but coming up with synthetic data that re-creates the same errors is way beyond my skills.

What is the correct way to use weights in a logistic regression in R?

My data includes survey data of car buyers. My data has a weight column that i used in SPSS to get sample sizes. Weight column is affected by demographic factors & vehicle sales. Now i am trying to put together a logistic regression model for a car segment which includes a few vehicles. I want to use the weight column in the logistic regression model & i tried to do so using "weights" in glm function. But the results are horrific. Deviances are too high, McFadden Rsquare too low. My dependent variable is binary, independent variables are on 1 to 5 scale. Weight column is numerical, ranging from 32 to 197. Could that be a reason that results are poor? Do i need to have values in weight column below 1?
Format of input file to R is -
WGT output I1 I2 I3 I4 I5
67 1 1 3 1 5 4
I1, I2, I3 being independent variables
logr<-glm(output~1,data=data1,weights=WGT,family="binomial")
logrstep<-step(logr,direction = "both",scope = formula(data1))\
logr1<-glm(output~ (formula from final iteration),weights = WGT,data=data1,family="binomial")
hl <- hoslem.test(data1$output,fitted(logr1),g=10)
I want a logistic regression model with better accuracy & gain a better understanding of using weights with logistic regression
I would check out the survey package. This will allow you to specify weights for the survey design using the svydesign function. Additionally, you can use the svyglm function to perform your weighted logistic regression. See http://r-survey.r-forge.r-project.org/survey/
Something like the following assuming your data is in a dataframe called df
my_svy <- svydesign(df, ids = ~1, weights = ~WGT)
Then you can do the following:
my_fit <- svyglm(output ~1, my_svy, family = "binomial")
For a full reprex check out the below example
library(survey)
# Generate Some Random Weights
mtcars$wts <- rnorm(nrow(mtcars), 50, 5)
# Make vs a factor just for illustrative purposes
mtcars$vs <- as.factor(mtcars$vs)
# Build the Complete survey Object
svy_df <- svydesign(data = mtcars, ids = ~1, weights = ~wts)
# Fit the logistic regression
fit <- svyglm(vs ~ gear + disp, svy_df, family = "binomial")
# Store the summary object
(fit_sumz <- summary(fit))
# Look at the AIC if desired
AIC(fit)
# Pull out the deviance if desired
fit_sumz$deviance
As far as the stepwise regression, this typically isn't a great methodology for a statistical point of view. It results in a higher R2 and some other issues regarding inference (see https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/).

Analyze longitudinal data with a mixed effects model in R

I try to analyze some simulated longitudinal data in R using a mixed-effects model (lme4 package).
Simulated data: 25 subjects have to perform 2 tasks at 5 consecutive time points.
#Simulate longitudinal data
N <- 25
t <- 5
x <- rep(1:t,N)
#task1
beta1 <- 4
e1 <- rnorm(N*t, mean = 0, sd = 1.5)
y1 <- 1 + x * beta1 + e1
#task2
beta2 <- 1.5
e2 <- rnorm(N*t, mean = 0, sd = 1)
y2 <- 1 + x * beta2 + e2
data1 <- data.frame(id=factor(rep(1:N, each=t)), day = x, y = y1, task=rep(c("task1"),length(y1)))
data2 <- data.frame(id=factor(rep(1:N, each=t)), day = x, y = y2, task=rep(c("task2"),length(y2)))
data <- rbind(data1, data2)
Question1: How to analyze how a subject learns each task?
library(lme4)
m1 <- lmer(y ~ day + (1 | id), data=data1)
summary(m1)
...
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.2757 0.3561 123.0000 3.582 0.000489 ***
day 3.9299 0.1074 123.0000 36.603 < 2e-16 ***
With ranef(m1) I get the random intercept for each subject, which I think reflects the baseline value for each subject at day = 1. But I don't understand how I can tell how an individual learns a task, or whether subjects differ in the way how they learn the task.
Question2: How can I analyze whether the way subjects learn differ between task1 and task2.
I expanded on your example to answer your questions briefly, but I can recommend reading chapter 15 of Snijders & Bosker (2012) or the book by Singer & Willet (2003) for a better explanation. Day is treated as a continuous variable in your model, seeing as you have panel data (i.e. everyone is measured at the same day) and day has no meaning apart from indicating the different measurement occasions, it may be better to treat day as a factor (i.e. use dummy variables).
However, for now I will continue with your example
Your first model (I think you want data instread of data1) gives a fixed linear slope (i.e. average slope, no difference in the tasks, no difference between individuals). The fixed intercept is the performance when day is 0, which has no meaning so you may want to consider centering the effect of day for a better interpretation (or indeed use dummies). The random effect gives the individual deviance from this intercept which has an estimated variance of 0.00 in your example so individuals hardly differ from each other in their starting position.
m1 <- lmer(y ~ day + (1 | id), data=data)
summary(m1)
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.00 0.000
Residual 18.54 4.306
Number of obs: 250, groups: id, 25
We can extend this model by adding an interaction with task. Meaning that the fixed slope is different for task1 and task2 which answers question 2 I believe (you can also use update() to update your model)
m2 <- lmer(y ~ day*task + (1|id), data = data)
summary(m2)
The effect of day in this model is the fixed slope of your reference category (task1) and the interaction is the difference between the slope of task1 and task2. The fixed effect of task is the difference in intercept.
model fit can be assessed with a deviance test, read Snijders & Boskers (2012) for an explanation of ML and REML estimates.
anova(m1,m2)
To add a random effect for the growth of individuals we can update the model again, which answers question 1
m3 <- lmer(y ~ day*task + (day|id), data = data)
summary(m3)
ranef(m3)
The random effects indicate the individual deviations in slope and intercept. A summary of the distribution of you random effects is included in the model summary (same as for m1).
Finally I think you could add a random effect on the day-task interaction to assess whether individuals differ in their performance growth on task1 and task2. But this depends very much on your data and the performance of the previous models.
m4 <- lmer(y ~ day*task + (day*task|id), data = data)
summary(m4)
ranef(m4)
Hope this helps. The books I recommended certainly should. Both provide excellent examples and explanation of theory (no R examples unfortunately). If you decide on a fixed occasion model (effect of day expressed by dummies) the nlme package provides excellent options to control the covariance structure of random effects. Good documentation of the package is provided by Pinheiro & Bates (2000).

Converting Repeated Measures mixed model formula from SAS to R

There are several questions and posts about mixed models for more complex experimental designs, so I thought this more simple model would help other beginners in this process as well as I.
So, my question is I would like to formulate a repeated measures ancova in R from sas proc mixed procedure:
proc mixed data=df1;
FitStatistics=akaike
class GROUP person day;
model Y = GROUP X1 / solution alpha=.1 cl;
repeated / type=cs subject=person group=GROUP;
lsmeans GROUP;
run;
Here is the SAS output using the data created in R (below):
. Effect panel Estimate Error DF t Value Pr > |t| Alpha Lower Upper
Intercept -9.8693 251.04 7 -0.04 0.9697 0.1 -485.49 465.75
panel 1 -247.17 112.86 7 -2.19 0.0647 0.1 -460.99 -33.3510
panel 2 0 . . . . . . .
X1 20.4125 10.0228 7 2.04 0.0811 0.1 1.4235 39.4016
Below is how I formulated the model in R using 'nlme' package, but am not getting similar coefficient estimates:
## create reproducible example fake panel data set:
set.seed(94); subject.id = abs(round(rnorm(10)*10000,0))
set.seed(99); sds = rnorm(10,15,5);means = 1:10*runif(10,7,13);trends = runif(10,0.5,2.5)
this = NULL; set.seed(98)
for(i in 1:10) { this = c(this,rnorm(6, mean = means[i], sd = sds[i])*trends[i]*1:6)}
set.seed(97)
that = sort(rep(rnorm(10,mean = 20, sd = 3),6))
df1 = data.frame(day = rep(1:6,10), GROUP = c(rep('TEST',30),rep('CONTROL',30)),
Y = this,
X1 = that,
person = sort(rep(subject.id,6)))
## use package nlme
require(nlme)
## run repeated measures mixed model using compound symmetry covariance structure:
summary(lme(Y ~ GROUP + X1, random = ~ +1 | person,
correlation=corCompSymm(form=~day|person), na.action = na.exclude,
data = df1,method='REML'))
Now, the output from R, which I now realize is similar to the output from lm():
Value Std.Error DF t-value p-value
(Intercept) -626.1622 527.9890 50 -1.1859379 0.2413
GROUPTEST -101.3647 156.2940 7 -0.6485518 0.5373
X1 47.0919 22.6698 7 2.0772934 0.0764
I believe I'm close as to the specification, but not sure what piece I'm missing to make the results match (within reason..). Any help would be appreciated!
UPDATE: Using the code in the answer below, the R output becomes:
> summary(model2)
Scroll to bottom for the parameter estimates -- look! identical to SAS.
Linear mixed-effects model fit by REML
Data: df1
AIC BIC logLik
776.942 793.2864 -380.471
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUPCONTROL GROUPTEST Residual
StdDev: 184.692 14.56864 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
TEST CONTROL
1.000000 3.068837
Fixed effects: Y ~ GROUP + X1
Value Std.Error DF t-value p-value
(Intercept) -9.8706 251.04678 50 -0.0393178 0.9688
GROUPTEST -247.1712 112.85945 7 -2.1900795 0.0647
X1 20.4126 10.02292 7 2.0365914 0.0811
Please try below:
model1 <- lme(
Y ~ GROUP + X1,
random = ~ GROUP | person,
correlation = corCompSymm(form = ~ day | person),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model1)
I think random = ~ groupvar | subjvar option with R lme provides similar result of repeated / subject = subjvar group = groupvar option with SAS/MIXED in this case.
Edit:
SAS/MIXED
R (a revised model2)
model2 <- lme(
Y ~ GROUP + X1,
random = list(person = pdDiag(form = ~ GROUP - 1)),
correlation = corCompSymm(form = ~ day | person),
weights = varIdent(form = ~ 1 | GROUP),
na.action = na.exclude, data = df1, method = "REML"
)
summary(model2)
So, I think these covariance structures are very similar (σg1 = τg2 + σ1).
Edit 2:
Covariate estimates (SAS/MIXED):
Variance person GROUP TEST 8789.23
CS person GROUP TEST 125.79
Variance person GROUP CONTROL 82775
CS person GROUP CONTROL 33297
So
TEST group diagonal element
= 125.79 + 8789.23
= 8915.02
CONTROL group diagonal element
= 33297 + 82775
= 116072
where diagonal element = σk1 + σk2.
Covariate estimates (R lme):
Random effects:
Formula: ~GROUP - 1 | person
Structure: Diagonal
GROUP1TEST GROUP2CONTROL Residual
StdDev: 14.56864 184.692 93.28885
Correlation Structure: Compound symmetry
Formula: ~day | person
Parameter estimate(s):
Rho
-0.009929987
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | GROUP
Parameter estimates:
1TEST 2CONTROL
1.000000 3.068837
So
TEST group diagonal element
= 14.56864^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + 93.28885^2
= 8913.432
CONTROL group diagonal element
= 184.692^2 + (3.068837^0.5 * 93.28885 * -0.009929987) + (3.068837 * 93.28885)^2
= 116070.5
where diagonal element = τg2 + σ1 + σg2.
Oooh, this is going to be a tricky one, and if it's even possible using standard nlme functions, is going to take some serious study of Pinheiro/Bates.
Before you spend the time doing that though, you should make absolutely sure that this is exact model you need. Perhaps there's something else that might fit the story of your data better. Or maybe there's something R can do more easily that is just as good, but not quite the same.
First, here's my take on what you're doing in SAS with this line:
repeated / type=cs subject=person group=GROUP;
This type=cs subject=person is inducing correlation between all the measurements on the same person, and that correlation is the same for all pairs of days. The group=GROUP is allowing the correlation for each group to be different.
In contrast, here's my take on what your R code is doing:
random = ~ +1 | person,
correlation=corCompSymm(form=~day|person)
This code is actually adding almost the same effect in two different ways; the random line is adding a random effect for each person, and the correlation line is inducing correlation between all the measurements on the same person. However, these two things are almost identical; if the correlation is positive, you get the exact same result by including either of them. I'm not sure what happens when you include both, but I do know that only one is necessary. Regardless, this code has the same correlation for all individuals, it's not allowing each group to have their own correlation.
To let each group have their own correlation, I think you have to build a more complicated correlation structure up out of two different pieces; I've never done this but I'm pretty sure I remember Pinheiro/Bates doing it.
You might consider instead adding a random effect for person and then letting the variance be different for the different groups with weights=varIdent(form=~1|group) (from memory, check my syntax, please). This won't quite be the same but tells a similar story. The story in SAS is that the measurements on some individuals are more correlated than the measurements on other individuals. Thinking about what that means, the measurements for individuals with higher correlation will be closer together than the measurements for individuals with lower correlation. In contrast, the story in R is that the variability of measurements within individuals varies; thinking about that, measurements with higher variability with have lower correlation. So they do tell similar stories, but come at it from opposite sides.
It is even possible (but I would be surprised) that these two models end up being different parameterizations of the same thing. My intuition is that the overall measurement variability will be different in some way. But even if they aren't the same thing, it would be worth writing out the parameterizations just to be sure you understand them and to make sure that they are appropriately describing the story of your data.

Resources