How to estimate residuals of subgroup with lme4 in R - r

I'd like to reproduce the results reported in Hoffman & Rovine's work (Multilevel models for the experimental psychologist: foundations and illustrative examples) with lme4 package in R.
In their first example they compared reaction time between elders and young people. Each of their participants have many task trials. So, in the individual level, participants' reaction time were affected by various variables related to their manipulation of trials. In the second level, participants' age and age group would affect participants' reaction time.
In Hoffman's model 2B, they estimate first level residuals for elders and young people respectively, with two dummy variable for young people and old people.
Hoffman's equation is
Level1 equation
I'd like to know how to estimate two residuals in lme4 package.
Hoffman's article and example data could be found in Hoffman's website.
I've successfully replicated their result of model 2A, where the variance of young people and old people were assumed to be the same, with the following code.
lmer(lg_rt ~ c_mean + c_sal + (1|Item) + oldage + yrs65 + (1|id), Ex1, REML = F)

You can handle heteroscedasticity in lme4 using the modular fitting functions. Here is an example with two groups, which should be extendable to other types of heteroscedasticity. Note that although the weights are estimated, the uncertainty about the weights is not taken into account in the standard errors of the parameters in the final fit. This issue should be possible to solve using the delta method, see e.g. the first equation in Section 2.3.3 of https://10.3102/1076998611417628.
set.seed(1234)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(lme4)
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpack
n <- 100 # number of level-2 units
m <- 3 # number of repeated observations per unit
sd_b <- .3 # random intercept standard deviation
sd_eps1 <- .1 # residual standard deviation in group 1
sd_eps2 <- .3 # residual standard deviation in group 2
# Simulate data
dat <- tibble(
# unique ID
id = seq_len(n),
# explanatory variable, constant over repetitions
x = runif(n),
# random intercept
b = rnorm(n, sd = sd_b),
# group membership
grp = sample(1:2, n, replace = TRUE)
) %>%
uncount(3) %>%
mutate(
# residual
eps = rnorm(nrow(.), sd = c(sd_eps1, sd_eps2)[grp]),
# response, fixed effect is beta=1
y = x + b + eps
)
# now optimize over residual weights, fixing the group 1 weight to 1.
# optimize() would be sufficient, but I show it with optim() because it
# then can be directly extended to a larger number of groups
opt <- optim(
# initial value for group 2 residual relative to group 1
par = 1,
fn = function(weight){
# Compute weights from group variable
df <- dat %>%
mutate(weight = c(1, weight)[grp])
## 1. Parse the data and formula:
lmod <- lFormula(y ~ x + (1|id), data = df, weights = df$weight)
## 2. Create the deviance function to be optimized:
devfun <- do.call(mkLmerDevfun, lmod)
## 3. Optimize the deviance function:
opt <- optimizeLmer(devfun)
# return the deviance
opt$fval
},
# Use a method that allows box constraints
method = "L-BFGS-B",
# Weight cannot be negative
lower = 0.01
)
# The weight estimates the following ratio, and it is pretty close
sd_eps1^2/sd_eps2^2
#> [1] 0.1111111
opt$par
#> [1] 0.1035914
# We can now fit the final model at the chosen weights
df <- dat %>%
mutate(weight = c(1, opt$par)[grp])
mod <- lmer(y ~ x + (1|id), data = df, weights = df$weight)
# Our estimate of sd_eps1
sigma(mod)
#> [1] 0.09899687
# True value
sd_eps1
#> [1] 0.1
# Our estimate of sd_eps2
sigma(mod) * sqrt(1/opt$par)
#> [1] 0.307581
# True value
sd_eps2
#> [1] 0.3
Created on 2021-02-10 by the reprex package (v1.0.0)

Related

R squared in gls regression

I want to create a gls regression that includes the value R squared and observations where the values "log likelihood" etc. are. The p values should be below the coefficients in the table. Here is an example of a code:
`
# import the necessary packages
library(nlme)
library(dplyr)
library(stargazer)
# create a new subset that only includes observations with a value in the "Price.Book.Value" column
dotcom_subset_MBV <- dotcom_subset %>% filter(!is.na(Price.Book.Value))
financial_subset_MBV <- financial_subset %>% filter(!is.na(Price.Book.Value))
covid_subset_MBV <- covid_subset %>% filter(!is.na(Price.Book.Value))
# Hypothesis 2: Fit GLS models
dotcom_model_MBV <- gls(X1.Month.Equity.Premium ~ crisis*Price.Book.Value, data = dotcom_subset_MBV, method = "ML")
financial_model_MBV <- gls(X1.Month.Equity.Premium ~ crisis*Price.Book.Value, data = financial_subset_MBV, method = "ML")
covid_model_MBV <- gls(X1.Month.Equity.Premium ~ crisis*Price.Book.Value, data = covid_subset_MBV, method = "ML")
library(stargazer)
stargazer(dotcom_model_MBV, financial_model_MBV, covid_model_MBV, type = "text",column.labels = c("Dotcom","Financial","Covid"),report=('vc*p'))
The only problem with the code above is that it shows the Log Likelihood, Akaike Inf. Crit. and Bayesian Inf. Crit. instead of the R squared values. The rest would be okay.
I tried the following:
omit.stat = c("ll", "AIC", "BIC")
and it works. However, it still doesn't show me the R squared. Then I tried:
add.lines = list(c(paste0("R-squared = ", round(r2_dotcom, 2)
and it includes a line that is called "R Squared" but without any values.
Here's a working example using the mtcars data:
data(mtcars)
library(nlme)
library(stargazer)
#>
#> Please cite as:
#> Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
#> R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
m1 <- gls(qsec ~ cyl + wt, data=mtcars)
m2 <- gls(mpg ~ cyl + wt, data=mtcars)
r2 <- c(cor(fitted(m1), mtcars$qsec)^2,
cor(fitted(m2), mtcars$mpg)^2)
stargazer(m1, m2,
type="text",
omit.stat = c("ll", "AIC", "BIC"),
add.lines = list(c("R-squared", sprintf("%.2f", r2))))
#>
#> =========================================
#> Dependent variable:
#> ----------------------------
#> qsec mpg
#> (1) (2)
#> -----------------------------------------
#> cyl -1.173*** -1.508***
#> (0.197) (0.415)
#>
#> wt 1.356*** -3.191***
#> (0.360) (0.757)
#>
#> Constant 20.743*** 39.686***
#> (0.815) (1.715)
#>
#> -----------------------------------------
#> R-squared 0.56 0.83
#> Observations 32 32
#> =========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Created on 2023-02-01 by the reprex package (v2.0.1)
The add.lines option expects a list and each vector in the list will be appended to the table with a different element in each column. So you would need the vector of values to be "R-squared" and then each of the r-squared values printed in a string (that's what sprintf() does).
Also, note, that I've calculated the r-squared as the squared correlation between the observed and fitted values, but make no claim that this is statistically sound (though it is one way of calculating R-squared in the OLS model).

How to plot an interaction effect within an ordinal regression model?

I have an ordinal model for predicting anxiety severity, using the clm() function. Within the model there is significant interaction effect between two of the variables. I am looking for a way to demonstrate the interaction effect visually, so I can interpret the effect with greater clarity. I have tried looking for different options, but I can't find any applications that work with the clm() function of modelling ordinal regressions. The error I get for the example below is "Error in UseMethod("family") :
no applicable method for 'family' applied to an object of class 'clm'". Modelling the ordinal regression with the polr function gives the same response.
m <- clm(anxiety_levels ~ pred_a * pred_b + pred_c, data, link = "logit")
interact_plot(m, pred = pred_a, modx = pred_b)
Any suggestions on how to plot the interaction from an ordinal regression would be greatly appreciated.
The marginaleffects package supports clm models. (Disclaimer: I am the author.)
However, note that the development version includes an important bug fix. You can install it like this:
library(remotes)
install_github("vincentarelbundock/marginaleffects")
Make sure you restart R completely, then we can estimate, summarize and plot the results. Note that we use facet_wrap(~group) often to present results for each outcome level separately in the plots:
library(ordinal)
library(ggplot2)
library(marginaleffects)
dat <- data.frame(
y = factor(sample(letters[1:5], 100, replace = TRUE)),
x1 = rnorm(100),
x2 = rnorm(100))
mod <- clm(y ~ x1 * x2, data = dat)
Adjusted predictions
plot_cap(mod, condition = "x1") + facet_wrap(~group)
plot_cap(mod, condition = c("x1", "x2")) + facet_wrap(~group)
Marginal effects (slope)
mfx <- marginaleffects(mod)
summary(mfx)
#> Average marginal effects
#> Group Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
#> 1 a x1 -0.0066902 0.021233 -0.31508 0.75270 -0.04831 0.034927
#> 2 a x2 -0.0018825 0.021849 -0.08616 0.93134 -0.04471 0.040941
#> 3 b x1 -0.0058553 0.017050 -0.34343 0.73128 -0.03927 0.027561
#> 4 b x2 -0.0022098 0.017486 -0.12637 0.89944 -0.03648 0.032062
#> 5 c x1 -0.0017640 0.004803 -0.36729 0.71341 -0.01118 0.007649
#> 6 c x2 -0.0015586 0.004723 -0.33002 0.74138 -0.01081 0.007698
#> 7 d x1 0.0031011 0.010273 0.30186 0.76276 -0.01703 0.023236
#> 8 d x2 0.0006885 0.010593 0.06499 0.94818 -0.02007 0.021449
#> 9 e x1 0.0112084 0.031057 0.36089 0.71818 -0.04966 0.072080
#> 10 e x2 0.0049625 0.031719 0.15645 0.87568 -0.05721 0.067130
#>
#> Model type: clm
#> Prediction type: response
plot(mfx) + facet_wrap(~group)
plot_cme(mod, effect = "x1", condition = "x2") + facet_wrap(~group)

including non linearity in fixed effects model in plm

I am trying to build a fixed effects regression with the plm package in R. I am using country level panel data with year and country fixed effects.
My problem concerns 2 explanatory variables. One is an interaction term of two varibels and one is a squared term of one of the variables.
model is basically:
y = x1 + x1^2+ x3 + x1*x3+ ...+xn , with the variables all being in log form
It is central to the model to include the squared term, but when I run the regression it always gets excluded because of "singularities", as x1 and x1^2 are obviously correlated.
Meaning the regression works and I get estimates for my variables, just not for x1^2 and x1*x2.
How do I circumvent this?
library(plm)
fe_reg<- plm(log(y) ~ log(x1)+log(x2)+log(x2^2)+log(x1*x2)+dummy,
data = df,
index = c("country", "year"),
model = "within",
effect = "twoways")
summary(fe_reg)
´´´
#I have tried defining the interaction and squared terms as vectors, which helped with the #interaction term but not the squared term.
df1.pd<- df1 %>% mutate_at(c('x1'), ~(scale(.) %>% as.vector))
df1.pd<- df1 %>% mutate_at(c('x2'), ~(scale(.) %>% as.vector))
´´´
I am pretty new to R, so apologies if this not a very well structured question.
You just found two properties of the logarithm function:
log(x^2) = 2 * log(x)
log(x*y) = log(x) + log(y)
Then, obviously, log(x) is collinear with 2*log(x) and one of the two collinear variables is dropped from the estimation. Same for log(x*y) and log(x) + log(y).
So, the model you want to estimate is not estimable by linear regression methods. You might want to take different data transformations than log into account or the original variables.
See also the reproducible example below wher I just used log(x^2) = 2*log(x). Linear dependence can be detected, e.g., via function detect.lindep from package plm (see also below). Dropping of coefficients from estimation also hints at collinear columns in the model estimation matrix. At times, linear dependence appears only after data transformations invovled in the estimation functions, see for an example of the within transformation the help page ?detect.lindep in the Example section).
library(plm)
data("Grunfeld")
pGrun <- pdata.frame(Grunfeld)
pGrun$lvalue <- log(pGrun$value) # log(x)
pGrun$lvalue2 <- log(pGrun$value^2) # log(x^2) == 2 * log(x)
mod <- plm(inv ~ lvalue + lvalue2 + capital, data = pGrun, model = "within")
summary(mod)
#> Oneway (individual) effect Within Model
#>
#> Call:
#> plm(formula = inv ~ lvalue + lvalue2 + capital, data = pGrun,
#> model = "within")
#>
#> Balanced Panel: n = 10, T = 20, N = 200
#>
#> Residuals:
#> Min. 1st Qu. Median 3rd Qu. Max.
#> -186.62916 -20.56311 -0.17669 20.66673 300.87714
#>
#> Coefficients: (1 dropped because of singularities)
#> Estimate Std. Error t-value Pr(>|t|)
#> lvalue 30.979345 17.592730 1.7609 0.07988 .
#> capital 0.360764 0.020078 17.9678 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Total Sum of Squares: 2244400
#> Residual Sum of Squares: 751290
#> R-Squared: 0.66525
#> Adj. R-Squared: 0.64567
#> F-statistic: 186.81 on 2 and 188 DF, p-value: < 2.22e-16
detect.lindep(mod) # run on the model
#> [1] "Suspicious column number(s): 1, 2"
#> [1] "Suspicious column name(s): lvalue, lvalue2"
detect.lindep(pGrun) # run on the data
#> [1] "Suspicious column number(s): 6, 7"
#> [1] "Suspicious column name(s): lvalue, lvalue2"

Plotting Dose Response Curve from Survival Data

I would like to make a dose response curve from the library(drc), and am stuck on how to prepare my dataset properly in order to make the plot. In particular, I’m struggling how to get my y-axis ready.
I made up a dataframe (df) to help clarify what I would like to do.
df <- read.table("https://pastebin.com/raw/TZdjp2JX", header=T)
Open necessary libraries for today's exercise
library(drc)
library(ggplot2)
Let’s pretend I like humming birds, and do an experiment with different concentrations of sugar with the goal of seeing which concentration is ideal for humming birds. I therefore run an experiment in a closed setting (here column “room”), with 4 different sugar concentrations (column concentration), and 10 individual birds per concentration. I also run each experiment with 4 replicates in parallel, which is why there are 4 “rooms”. After 36 hours (column “time”), I go into the room, and check how many birds survived, creating a “yes/no” variable, or 1 & 0 (here, this is my column “status), where 1==survive, 0==died.
With this dataset, I specifically made it that most survived at concentration 0, 50% survived concentration 1, 25% survived concentration 2, and only 10% survive concentration 3.
My first issue I’m running into is : how can I turn my y-axis , generated from my “status” column into a percentage? I have done this when I do kaplan-meier survival curves, but that does not work here unfortunately. Obviously, this should column should go from 0% - 100% (we could call the column "mortality"). After I succeed at doing this, I would like to make a dose response curve that looks like the following (I found this example online, and will directly copy it here to use an example. It is from the ryegrass dataset included in R)
ryegrass.LL.4 <- drm(rootl ~ conc, data = ryegrass, fct = LL.3())
I must admit, the next steps of code are a little confusing for me.
# new dose levels as support for the line
newdata <- expand.grid(conc=exp(seq(log(0.5), log(100), length=100)))
# predictions and confidence intervals
pm <- predict(ryegrass.LL.4, newdata=newdata, interval="confidence")
# new data with predictions
newdata$p <- pm[,1]
newdata$pmin <- pm[,2]
newdata$pmax <- pm[,3]
# plot curve
# need to shift conc == 0 a bit up, otherwise there are problems with coord_trans
ryegrass$conc0 <- ryegrass$conc
ryegrass$conc0[ryegrass$conc0 == 0] <- 0.5
# plotting the curve
ggplot(ryegrass, aes(x = conc0, y = rootl)) +
geom_point() +
geom_ribbon(data=newdata, aes(x=conc, y=p, ymin=pmin, ymax=pmax), alpha=0.2) +
geom_line(data=newdata, aes(x=conc, y=p)) +
coord_trans(x="log") +
xlab("Ferulic acid (mM)") + ylab("Root length (cm)")
In the end, I would like to generate a similar curve, but with mortality on the y-axis, from 0-100 (starting low, going high) and also display the confidence intervals in a shaded grey area around the regression line. Meaning, my first step of code should like something like the following:
model <- drc(mortality ~ Concentration, data=df, fct = LL.3()) But I'm lost on the "mortality" creation part, and a little bit on the next step with ggplot
Could anyone help me achieve this? From the example from ryegrass, I'm perplexed how to translate this to be helpful for my pretend dataset. I hope someone here is able to help me solve this issue! Many thanks, and I appreciate any feedback if there are other ways I should have my dataset structured, etc.
-Andy
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(drc)
#> Loading required package: MASS
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
#>
#> 'drc' has been loaded.
#> Please cite R and 'drc' if used for a publication,
#> for references type 'citation()' and 'citation('drc')'.
#>
#> Attaching package: 'drc'
#> The following objects are masked from 'package:stats':
#>
#> gaussian, getInitial
df <- read.table("https://pastebin.com/raw/sH5hCr2J", header=T)
Making the mortality or as I do here survival, can be done easily with the dplyr package. This will help perform many calculations. It seems that you are interested in calcuating the percent survival for each Concentration across your four rooms (or replicates). So the first step is to group the data by these columns and then calculate the statistic we want.
df_calc <- df %>%
group_by(Concentration, room) %>%
summarise(surv = sum(Status)/n())
#> `summarise()` has grouped output by 'Concentration'. You can override using the `.groups` argument.
I don’t know if Concentration represents arbitrary
Concentration levels so I’m moving forward with the following
assumption:
1 == higher levels of sugar, 2 == lower levels of sugar
Concentrations were coding in log space - so I convert to linear space
df_calc <- mutate(df_calc, conc = exp(-Concentration))
Just to be clear, the conc variable is just my attempt at having something close to the true known concentrations of the experiment. If your data has the true concentrations, then don't mind this calculation.
df_calc
#> # A tibble: 12 x 4
#> # Groups: Concentration [3]
#> Concentration room surv conc
#> <int> <int> <dbl> <dbl>
#> 1 1 1 0.5 0.368
#> 2 1 2 0.4 0.368
#> 3 1 3 0.5 0.368
#> 4 1 4 0.6 0.368
#> 5 2 1 0 0.135
#> 6 2 2 0.4 0.135
#> 7 2 3 0.2 0.135
#> 8 2 4 0.4 0.135
#> 9 3 1 0.2 0.0498
#> 10 3 2 0 0.0498
#> 11 3 3 0 0.0498
#> 12 3 4 0.2 0.0498
mod <- drm(surv ~ conc, data = df_calc, fct = LL.3())
Make new conc data points
newdata <- data.frame(conc = exp(seq(log(0.01), log(10), length = 100)))
EDIT
To respond to your comment I'll explain the above code chunk. Again the conc variable is expected to be the unit concentration. In this hypothetical case, we have three concentration levels c(0.049, 0.135, 0.368). For brevity, lets assume the units are mg of sugar/ml of water. Our model was fit on these three dose levels with 4 data points per dose level. If we wanted, we could have just plotted the curve between these levels of c(0.049, 0.368), but in this example, I chose c(0.01, 10) mg/ml as the domain to plot on. This was just so that we could visualize where the curve would end up based on the model fit. In short, you choose the range that you are interested in most. As I show later - even though we can choose data points outside the range of the experimental data, the confidence intervals are extremely large indicating the model will be unhelpful for those points.
The reason behind casting these values with the log() function is to ensure that we are sampling points that look evenly distributed on a log10 scale (most does response curves are plotted with this transformation). Once we get the sequence of 100 points, we use exp() to return back to the linear space (which our model was fit on). These values are then used in the predict function as the new dose levels in conjunction with the fitted model.
All this is saved into newdata variable which allows for the plotting of the line and the confidence intervals.
Use the model and the generated data points to
predict a new surv value as well as the upper and lower bound
newdata <- cbind(newdata,
suppressWarnings(predict(mod, newdata = newdata, interval="confidence")))
plot with ggplot2
ggplot(df_calc, aes(conc)) +
geom_point(aes(y = surv)) +
geom_ribbon(aes(ymin = Lower, ymax = Upper), data = newdata, alpha = 0.2) +
geom_line(aes(y = Prediction), data = newdata) +
scale_x_log10() +
coord_cartesian(ylim = c(0, 1))
As you may notice, the confidence intervals increase greately when we try and
predict ranges that has no data.
Created on 2021-10-27 by the reprex package (v1.0.0)

Understanding the Output Coefficients from a Linear Model Regression in R

I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.
However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.
I created a reproducible example using the ToothGrowth dataset, where the same is the case:
library(tidyverse)
# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
as_tibble() %>%
mutate(dose = as.factor(dose))
# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)
# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC
#> -5.25
# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#> Mean
#> <dbl>
#> 1 18.8
## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
group_by(supp) %>%
summarise(Mean = mean(len))) %>%
filter(supp == "VC")
#> # A tibble: 1 x 2
#> supp Mean
#> <fct> <dbl>
#> 1 VC 17.0
## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85
# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC
#> FALSE
Created on 2021-02-23 by the reprex package (v1.0.0)
My questions:
Am I understanding the interpretation of the coefficient values completely wrong?
What is the lm actually calculating regarding the coefficients?
Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?
I hope this is enough information. If not, please don't hesitate to ask me!
The lm() function uses dummy coding, so all the coefficients in your model are compared to the reference group's mean. The reference group here is the first levels of your factors, so supp=OJ and dose=0.5
You can then do this verification like so:
coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] == mean_table %>% filter(supp=='VC' & dose==0.5) %>% pull(M)
(coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] + coef(tg_lm)["dose1"] + coef(tg_lm)["suppVC:dose1"]) == mean_table %>% filter(supp=='VC' & dose==1) %>% pull(M)
You can read into the differences here

Resources