I have little experience with panel data in R, and am trying to run a simple panel regression with the plm-package. When converting my dataframe to a pdata.frame, however, my time index-variable is transformed to a factor variable. This means that if I want to regress a dependent variable as a function of time, the regression generates a long list of dummy-variables for time and calculates individual coefficients for each. I just want the average effect per time unit (ie. average monthly increase/decrease in points).
Example dataframe:
ID Date Points
1 1/11/2014 2
1 1/12/2014 4
1 1/1/2015 6
1 1/2/2015 8
2 1/11/2014 1
2 1/12/2014 2
2 1/1/2015 3
2 1/2/2015 4
Say the example dataframe structure is ID = int, Date = POSIXct, Points = int.
I then convert it to a pdata.frame with index ID and Date:
panel <- pdata.frame(dataframe, c("ID", "Date"))
And run a plm fixed effects regression:
fixed <- plm(Points ~ Date, data=panel, model="within")
summary(fixed)
The resulting coefficients are then broken down by each month as dummies.
I want to treat my time-variable as a continuous variable, so I get only one coefficient for Date. How can I do this? Is there a way to avoid formatting the time index-variable as a factor in panel dataframes?
I think you need to create a separate clock or time counter from panel$Date to use in your model. For example:
library(dplyr)
dataframe <- dataframe %>%
group_by(ID) %>%
mutate(clock = seq_along(ID))
panel <- pdata.frame(dataframe, c("ID", "Date"))
That produces these data:
ID Date Points clock
1-2014-11-01 1 2014-11-01 2 1
1-2014-12-01 1 2014-12-01 4 2
1-2015-01-01 1 2015-01-01 6 3
1-2015-02-01 1 2015-02-01 8 4
2-2014-11-01 2 2014-11-01 1 1
2-2014-12-01 2 2014-12-01 2 2
2-2015-01-01 2 2015-01-01 3 3
2-2015-02-01 2 2015-02-01 4 4
That produces this output:
> fixed <- plm(Points ~ clock, data=panel, model="within")
> summary(fixed)
Oneway (individual) effect Within Model
Call:
plm(formula = points ~ clock, data = panel, model = "within")
Balanced Panel: n=2, T=4, N=8
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.750 -0.375 0.000 0.375 0.750
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
clock 1.50000 0.22361 6.7082 0.001114 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 25
Residual Sum of Squares: 2.5
R-Squared : 0.9
Adj. R-Squared : 0.5625
F-statistic: 45 on 1 and 5 DF, p-value: 0.0011144
Related
I am trying to use the lme4 package in R and function lmer() to fit a model for my split-split plot design. I would have used a repeated measures ANOVA if I did not have a small number of observations missing, but the missing data should be no problem with a linear mixed effects model.
My data frame (data) has a simple structure with four factors and a numeric outcome variable called all_vai. Note that in this example data frame, not all levels of all factors are crossed even though they would be in my real data (except for the missing observations). It shouldn't matter for my question, which is an attempt to fix problematic syntax.
collected_vai <- rnorm(125, mean = 6, sd = 1)
missing <- rep(NA, times = 3)
all_vai <- c(collected_vai, missing)
year1 <- rep(2018, times = 32)
year2 <- rep(2019, times = 32)
year3 <- rep(2020, times = 32)
year4 <- rep(2021, times = 32)
year <- c(year1, year2, year3, year4)
disturbance_severity <- rep(c(0,45,65,85), each = 32)
treatment <- rep(c("B" , "T"), each = 64)
replicate <- rep(c("A", "B", "C", "D"), each = 32)
data = data.frame(all_vai, year, disturbance_severity, treatment, replicate)
data$year <- as.factor(data$year)
data$disturbance_severity <- as.factor(data$disturbance_severity)
data$treatment <- as.factor(data$treatment)
data$replicate <- as.factor(data$replicate)
Here is the model I ran for an identical data set with a different (normally distributed) numeric outcome and no missing observations -- i.e., this is the model I would be running if I didn't have unbalanced repeated measures now due to missing data:
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year + Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
When I run this, I get the error message: "Warning message:
In aov(mean_vai ~ disturbance_severity * treatment * Year + Error(Replicate/disturbance_severity/treatment/Year), :
Error() model is singular"
I have observations from different years nested within different treatments, which are nested within different disturbance severities, and all of this nested within replicates (which are experimental blocks). So I tried using this structure in lme4:
library(lme4)
library(lmerTest)
VAImodel2 <- lmer(all_vai ~ (year|replicate:disturbance_severity:treatment) + disturbance_severity*treatment*year, data = data)
summary(VAImodel2)
And this is the error message I get: "Error: number of observations (=125) <= number of random effects (=128) for term (Year | Replicate:disturbance_severity:treatment); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable"
Next I tried simplifying my model so that I was not running out of degrees of freedom, by removing the treatment variable and interaction term, like so:
VAImodel3 <- lmer(all_vai ~ (year|replicate:disturbance_severity) + disturbance_severity*year, data = data)
summary(VAImodel3)
This time I get a different error: "boundary (singular) fit: see ?isSingular
Warning message:
Model failed to converge with 1 negative eigenvalue: -1.2e-01 "
Thank you in advance for any help.
Your problem is wrong data preparation!!
Let's start by defining values for your variables year, disturbance_severity, treatment, replicate.
library(tidyverse)
set.seed(123)
yars = 2018:2021
disturbances = c(0,45,65,85)
treatments = c("B" , "T")
replicates = c("A", "B", "C", "D")
n = length(yars)*length(disturbances)*length(treatments)*length(replicates)*1
nNA=3
Please note that I first created the variables yars, disturbances, treatments and replicates with all the allowed values.
Then I calculated the amount of data in n (you can increase the last value in the multiplication from 1 e.g. to 10) and determined how many values will be missing in the variable nNA.
The key aspect is the use of the function expand.grid(yars, disturbances, treatments, replicates) which will return the appropriate table with the correct distribution of values.
Look at the first few lines of what expand.grid returns.
Var1 Var2 Var3 Var4
1 2018 0 B A
2 2019 0 B A
3 2020 0 B A
4 2021 0 B A
5 2018 45 B A
6 2019 45 B A
7 2020 45 B A
8 2021 45 B A
9 2018 65 B A
10 2019 65 B A
11 2020 65 B A
12 2021 65 B A
13 2018 85 B A
14 2019 85 B A
15 2020 85 B A
16 2021 85 B A
17 2018 0 T A
18 2019 0 T A
This is crucial here.
The next step is straight ahead. We create a tibble sequence and put it in the aov function.
data = tibble(sample(c(rnorm(n-nNA, mean = 6, sd = 1), rep(NA, nNA)), n)) %>%
mutate(expand.grid(yars, disturbances, treatments, replicates)) %>%
rename_with(~c("all_vai", "year", "disturbance_severity", "treatment", "replicate"))
VAImodel1 <- aov(all_vai ~ disturbance_severity*treatment*year +
Error(replicate/disturbance_severity/treatment/year), data = data)
summary(VAImodel1)
output
Error: replicate
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1341 0.1341 0.093 0.811
treatment 1 0.0384 0.0384 0.027 0.897
Residuals 1 1.4410 1.4410
Error: replicate:disturbance_severity
Df Sum Sq Mean Sq F value Pr(>F)
disturbance_severity 1 0.1391 0.1391 0.152 0.763
treatment 1 0.1819 0.1819 0.199 0.733
year 1 1.4106 1.4106 1.545 0.431
Residuals 1 0.9129 0.9129
Error: replicate:disturbance_severity:treatment
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.4647 0.4647 0.698 0.491
year 1 0.8127 0.8127 1.221 0.384
Residuals 2 1.3311 0.6655
Error: replicate:disturbance_severity:treatment:year
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 2.885 2.8846 3.001 0.144
year 1 0.373 0.3734 0.388 0.560
treatment:year 1 0.002 0.0015 0.002 0.970
Residuals 5 4.806 0.9612
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 0.03 0.031 0.039 0.8430
year 1 1.29 1.292 1.662 0.2002
treatment:year 1 4.30 4.299 5.532 0.0206 *
Residuals 102 79.26 0.777
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now there are no model is singular errors!!
I am trying to obtain Bonferroni simultaneous confidence intervals in R. I have the following data set that I made up for practice:
df2 <- read.table(textConnection(
'group value
1 25
2 36
3 42
4 50
1 27
2 35
3 49
4 57
1 22
2 37
3 45
4 51'), header = TRUE)
I have tried
aov(formula = value ~ group, data = df2)
However, this doesn't output simultaneous confidence intervals. Using SAS, the calculations should come out as:
There seem to be some conceptual/coding mistakes.
df$group needs to be a categorical variable for your ANOVA to work. At the moment it is numeric.
You want to perform what's called a post-hoc analysis, to correct ANOVA p-values for multiple group comparisons.
Here is an example using the R package DescTools, based on the sample data you give:
# Step 1: Make sure that group is a factor
df2$group <- as.factor(df2$group);
# Step 2: Perform ANOVA
res <- aov(formula = value ~ group, data = df2)
# Step 3: Perform post-hoc analysis
require(DescTools);
PostHocTest(res, method = "bonferroni");
#
# Posthoc multiple comparisons of means : Bonferroni
# 95% family-wise confidence level
#
#$group
# diff lwr.ci upr.ci pval
#2-1 11.333333 3.0519444 19.61472 0.00855 **
#3-1 20.666667 12.3852778 28.94806 0.00014 ***
#4-1 28.000000 19.7186111 36.28139 1.5e-05 ***
#3-2 9.333333 1.0519444 17.61472 0.02648 *
#4-2 16.666667 8.3852778 24.94806 0.00067 ***
#4-3 7.333333 -0.9480556 15.61472 0.09062 .
#
#---
#Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The reported differences between the group means and confidence intervals match the SAS numbers you give.
I am trying to reproduce output from the PROC MIXED procedure using the Satterwaithe approximation in SAS using the lmerTest package in R.
This is my data:
Participant Condition Data
1 0 -1,032941629
1 0 0,869267841
1 0 -1,636722191
1 0 -1,15451393
1 0 0,340454836
1 0 -0,399315906
1 1 0,668983169
1 1 1,937817592
1 1 3,110013393
1 1 3,23409718
2 0 0,806881925
2 1 2,71020911
2 1 3,406864275
2 1 1,494288182
2 1 0,741827047
2 1 2,532062685
2 1 3,702118917
2 1 1,825046681
2 1 4,37167021
2 1 1,85125279
3 0 0,288743786
3 0 1,024396121
3 1 2,051281876
3 1 0,24543851
3 1 3,349677964
3 1 1,565395822
3 1 3,077031712
3 1 1,087494708
3 1 1,546150033
3 1 0,440249347
Using the following statement in SAS:
proc mixed data=mbd;
class participant;
model data = condition / solution ddfm=sat;
random intercept condition / sub=participant;
run;
I get this output:
My problem is that I can't seem to reproduce these results using lmerTest in R.
I thought that lmer(Data ~ Condition + (1 | Participant) + (Condition | Participant), REML=TRUE) was the equivalent statement of what I did in SAS but this gives different results. Note that the degrees of freedom are way off from the SAS output so I think I'm estimating parameters in R that I'm not estimating in SAS. I tried several other statements in R but I didn't manage to get the exact same output. However this should be possible as the lmer() function from the lmerTest package also uses the Satterwaithe approximation and should be exactly the same as the SAS PROC MIXED procedure.
Does anybody know what I'm doing wrong in R?
Thanks a lot!
Bart
You don't specify the same random effects as in your SAS example. (Condition | Participant) is expanded internally to (1 + Condition | Participant), which fits a random intercept, a random slope and the covariance between them [1]. So, you have two additional parameters (an intercept variance and the covariance) in your model. Uncorrelated random effects can be specified using || in lme4 syntax. Note how the formula gets expanded in the summary output.
library(lmerTest)
fit <- lmer(Data ~ Condition + (Condition || Participant), REML=TRUE, data = DF)
summary(fit)
#Linear mixed model fit by REML
#t-tests use Satterthwaite approximations to degrees of freedom ['lmerMod']
#Formula: Data ~ Condition + ((1 | Participant) + (0 + Condition | Participant))
# Data: DF
#
#REML criterion at convergence: 90.6
#
#Scaled residuals:
# Min 1Q Median 3Q Max
#-1.58383 -0.78970 -0.06993 0.87801 1.91237
#
#Random effects:
# Groups Name Variance Std.Dev.
# Participant (Intercept) 0.00000 0.000
# Participant.1 Condition 0.07292 0.270
# Residual 1.20701 1.099
#Number of obs: 30, groups: Participant, 3
#
#Fixed effects:
# Estimate Std. Error df t value Pr(>|t|)
#(Intercept) -0.09931 0.36621 26.50400 -0.271 0.788363
#Condition 2.23711 0.46655 12.05700 4.795 0.000432 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Correlation of Fixed Effects:
# (Intr)
#Condition -0.785
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to run anova() in R and running into some difficulty. This is what I've done up to now to help shed some light on my question.
Here is the str() of my data to this point.
str(mhw)
'data.frame': 500 obs. of 5 variables:
$ r : int 1 2 3 4 5 6 7 8 9 10 ...
$ c : int 1 1 1 1 1 1 1 1 1 1 ...
$ grain: num 3.63 4.07 4.51 3.9 3.63 3.16 3.18 3.42 3.97 3.4 ...
$ straw: num 6.37 6.24 7.05 6.91 5.93 5.59 5.32 5.52 6.03 5.66 ...
$ Quad : Factor w/ 4 levels "NE","NW","SE",..: 2 2 2 2 2 2 2 2 2 2 ...
Column r is a numerical value indicating which row in the field an individual plot resides
Column c is a numerical value indicating which column an individual plot resides
Column Quad corresponds to the geographical location in the field to which each plot resides
Quad <- ifelse(mhw$c > 13 & mhw$r < 11, "NE",ifelse(mhw$c < 13 & mhw$r < 11,"NW", ifelse(mhw$c < 13 & mhw$r >= 11, "SW","SE")))
mhw <- cbind(mhw, Quad)
I have fit a lm() as follows
nov.model <-lm(mhw$grain ~ mhw$straw)
anova(nov.model)
This is an anova() for the entire field, which is testing grain yield against straw yield for each plot in the dataset.
My trouble is that I want to run an individual anova() for the Quad column of my data to test grain yield and straw yield in each quadrant.
perhaps a with() might fix that. I have never used it before and I am in the process of learning R currently. Any help would be greatly appreciated.
I think you are looking for by facility in R.
fit <- with(mhw, by(mhw, Quad, function (dat) lm(grain ~ straw, data = dat)))
Since you have 4 levels in Quad, you end up with 4 linear models in fit, i.e., fit is a "by" class object (a type of "list") of length 4.
To get coefficient for each model, you can use
sapply(fit, coef)
To produce model summary, use
lapply(fit, summary)
To export ANOVA table, use
lapply(fit, anova)
As a reproducible example, I am taking the example from ?by:
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
class(tmp)
# [1] "by"
mode(tmp)
# [1] "list"
sapply(tmp, coef)
# L M H
#(Intercept) 44.55556 24.000000 24.555556
#woolB -16.33333 4.777778 -5.777778
lapply(tmp, anova)
#$L
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 1200.5 1200.50 5.6531 0.03023 *
#Residuals 16 3397.8 212.36
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#$M
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 102.72 102.722 1.2531 0.2795
#Residuals 16 1311.56 81.972
#
#$H
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 150.22 150.222 2.3205 0.1472
#Residuals 16 1035.78 64.736
I was aware of this option, but not familiar with it. Thanks to #Roland for providing code for the above reproducible example:
library(nlme)
lapply(lmList(breaks ~ wool | tension, data = warpbreaks), anova)
For your data I think it would be
fit <- lmList(grain ~ straw | Quad, data = mhw)
lapply(fit, anova)
You don't need to install nlme; it comes with R as one of recommended packages.
Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)