I have put together an R code for split plot analysis by exploring the internet (thanks to data scientists for making our lives easier). However, the problem now is I have more than a 100 variables to test. Inside this code there are a few instances where I need to change the column/variable name each time, axis names for each figure, and output figure name. Also, when the Anova results are printed on the screen I manually copy the p-values and lsd scores before moving forward. Is there a way to just automate everything to save many hours of manual editing? The code I am using is as below:
data <- read.csv("file.csv", header = TRUE)
data$Genotype <- as.factor(data$Genotype)
data$N.level <- as.factor(data$N.level)
library(agricolae)
attach(data)
Here before running the model I change response variables name each time, also the model output prints anova results which I am copying manually before moving on
model <- sp.plot(block = Block,
pplot = Genotype,
splot = N.level,
Y = GPC.30DAP)
Edf_a <- model$gl.a
Edf_b <- model$gl.b
EMS_a <- model$Ea
EMS_b <- model$Eb
Again I change Y for LSD test each time, and from LSD output manually copy the results of each test 1,2,3
out1 <- LSD.test(y = GPC.30DAP,
trt = Genotype,
DFerror = Edf_a,
MSerror = EMS_a,
alpha = 0.05,
#p.adj = "bonferroni",
group = TRUE,
console = TRUE)
out2 <- LSD.test(y = GPC.30DAP,
trt = N.level,
DFerror = Edf_b,
MSerror = EMS_b,
alpha = 0.05,
#p.adj = "bonferroni",
group = TRUE,
console = TRUE)
out3 <- LSD.test(y = GPC.30DAP,
trt = Genotype:N.level,
DFerror = Edf_b,
MSerror = EMS_b,
alpha = 0.05,
#p.adj = "bonferroni",
group = TRUE,
console = TRUE)
library(dplyr)
ascend_AB = out3$groups %>%
group_by(rownames(out3$groups)) %>%
arrange(rownames(out3$groups))
print(ascend_AB)
CHANGE VARIABLE NAME AGAIN
MeanSD_AB = data %>%
group_by(Genotype, N.level) %>%
summarise(avg_AB = mean(GPC.30DAP),
sd = sd(GPC.30DAP))
print(MeanSD_AB)
attach(MeanSD_AB)
##------ Plotting "G*N interaction"
library(ggplot2)
## Create plotting object
p3 = ggplot(MeanSD_AB, aes(x = N.level,
y = avg_AB,
fill = factor(Genotype)))
print(p3)
## Plotting bars
plotA = p3 +
geom_bar(stat = "identity",
color = "black",
position = position_dodge(width=0.9))
print(plotA)
## Adding error bars
plotC = plotA +
geom_errorbar(aes(ymax = avg_AB + sd,
ymin = avg_AB - sd),
position = position_dodge(width=0.9),
width = 0.25)
print(plotC)
Changing main title, X and Y labels and legend title
plotD = plotC +
labs(title = "",
x = "",
y = "GPC (%) 30 days after anthesis",
fill = "Genotype")
print(plotD)
## Adding lettering from test applied
plotE = plotD +
geom_text(aes(x = N.level,
y = avg_AB + sd,
label = as.matrix(ascend_AB$groups)),
position = position_dodge(width = 0.9),
vjust = -(0.5))
print(plotE)
figure title change to avoid overwriting
ggsave("GPC 30 days after anthesis.jpeg", width = 10, height = 10, units = c("cm"), dpi = 300)
There’s a few things going on here which I think would be useful to go over
in some detail. Before diving in, let’s load the example data directly from pastebin:
library(agricolae)
data <- read.delim("https://pastebin.com/raw/kkjZVxsW")
data$Genotype <- as.factor(data$Genotype)
data$N.level <- as.factor(data$N.level)
data
#> Block Genotype N.level GPC.15DAP GPC.23DAP GPC.30DAP
#> 1 1 1290H Low N 4.9 2.7 2.6
#> 2 2 1290H Low N 4.6 3.3 2.3
#> 3 3 1290H Low N 5.1 2.7 3.7
#> 4 1 1290W Low N 5.0 2.4 1.9
#> 5 2 1290W Low N 4.4 3.2 2.7
#> 6 3 1290W Low N 4.5 3.0 2.4
#> 7 1 1290H High N 8.2 3.6 4.1
#> 8 2 1290H High N 7.6 4.4 3.4
#> 9 3 1290H High N 7.3 4.3 4.5
#> 10 1 1290W High N 5.5 3.9 2.8
#> 11 2 1290W High N 6.1 3.3 2.0
#> 12 3 1290W High N 5.8 3.8 2.4
Looping
You might hope to loop over the variable names along these lines:
for (variable in c("GPC.15DAP", "GPC.30DAP")) {
sp.plot(Block, Genotype, N.level, variable)
}
This won’t work because variable now contains the name of the variable in
a string, rather than the actual value of the variable. Instead, you can
get() the value given a name:
variable <- "GPC.30DAP"
with(data, {
sp.plot(Block, Genotype, N.level, get(variable))
})
#>
#> ANALYSIS SPLIT PLOT: get(variable)
#> Class level information
#>
#> Genotype : 1290H 1290W
#> N.level : Low N High N
#> Block : 1 2 3
#>
#> Number of observations: 12
#>
#> Analysis of Variance Table
#>
#> Response: get(variable)
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Block 2 0.8600 0.4300 NaN NaN
#> Genotype 1 3.4133 3.4133 9.3945 0.09199 .
#> Ea 2 0.7267 0.3633 NaN NaN
#> N.level 1 1.0800 1.0800 5.6348 0.07651 .
#> Genotype:N.level 1 0.8533 0.8533 4.4522 0.10249
#> Eb 4 0.7667 0.1917 NaN NaN
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> cv(a) = 20.8 %, cv(b) = 15.1 %, Mean = 2.9
This gives the correct result, but there’s something not quite right.
The name of the response variable given in the output is less than useful:
get(variable). That happens because sp.plot() captures the code that was
used to pass the arguments, and extracts the name to use for the response variable from that.
To get the name of the response variable to show up,
we need some metaprogramming to modify our code before
running it. I’ll use a toy example to introduce the
concepts, and then we’ll see how they apply to the problem at hand.
The fist component is str2lang(), which takes a string and turns
it into a “language object” – some unevaluated code:
code <- str2lang("1 + 1")
code
#> 1 + 1
Next, bquote() can insert our piece of code into another, using
a special .() syntax:
bquote(2 * .(code))
#> 2 * (1 + 1)
Finally we have eval(), which takes a piece of code and runs it:
eval(bquote(2 * .(code)))
#> [1] 4
Let’s put the pieces together:
first create a language object from the variable name in a string with
str2lang(), then insert that into the sp.plot() code with bquote(),
and finally eval() the result to run the analysis:
variable <- str2lang("GPC.30DAP")
code <- bquote({
sp.plot(Block, Genotype, N.level, .(variable))
})
with(data, eval(code))
#>
#> ANALYSIS SPLIT PLOT: GPC.30DAP
#> Class level information
#>
#> Genotype : 1290H 1290W
#> N.level : Low N High N
#> Block : 1 2 3
#>
#> Number of observations: 12
#>
#> Analysis of Variance Table
#>
#> Response: GPC.30DAP
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Block 2 0.8600 0.4300 NaN NaN
#> Genotype 1 3.4133 3.4133 9.3945 0.09199 .
#> Ea 2 0.7267 0.3633 NaN NaN
#> N.level 1 1.0800 1.0800 5.6348 0.07651 .
#> Genotype:N.level 1 0.8533 0.8533 4.4522 0.10249
#> Eb 4 0.7667 0.1917 NaN NaN
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> cv(a) = 20.8 %, cv(b) = 15.1 %, Mean = 2.9
You could use this directly, but it might be helpful to create a little
helper function to encapsulate the process. Then you won’t have to think
about what’s happening in this fairly complex block of code when you
actually call it:
fit_model <- function(variable) {
variable <- str2lang(variable)
code <- bquote({
sp.plot(Block, Genotype, N.level, .(variable))
})
eval(code, parent.frame())
}
with(data, fit_model("GPC.30DAP"))
Now we have the tools to execute the loop we envisioned in the beginning:
for (variable in c("GPC.15DAP", "GPC.30DAP")) {
with(data, fit_model(variable))
}
#>
#> ANALYSIS SPLIT PLOT: GPC.15DAP
#> Class level information
#>
#> Genotype : 1290H 1290W
#> N.level : Low N High N
#> Block : 1 2 3
#>
#> Number of observations: 12
#>
#> Analysis of Variance Table
#>
#> Response: GPC.15DAP
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Block 2 0.1350 0.0675 NaN NaN
#> Genotype 1 3.4133 3.4133 67.147 0.01457 *
#> Ea 2 0.1017 0.0508 NaN NaN
#> N.level 1 12.0000 12.0000 68.900 0.00115 **
#> Genotype:N.level 1 2.0833 2.0833 11.962 0.02585 *
#> Eb 4 0.6967 0.1742 NaN NaN
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> cv(a) = 3.9 %, cv(b) = 7.3 %, Mean = 5.75
#>
#>
#> ANALYSIS SPLIT PLOT: GPC.30DAP
#> Class level information
#>
#> Genotype : 1290H 1290W
#> N.level : Low N High N
#> Block : 1 2 3
#>
#> Number of observations: 12
#>
#> Analysis of Variance Table
#>
#> Response: GPC.30DAP
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Block 2 0.8600 0.4300 NaN NaN
#> Genotype 1 3.4133 3.4133 9.3945 0.09199 .
#> Ea 2 0.7267 0.3633 NaN NaN
#> N.level 1 1.0800 1.0800 5.6348 0.07651 .
#> Genotype:N.level 1 0.8533 0.8533 4.4522 0.10249
#> Eb 4 0.7667 0.1917 NaN NaN
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> cv(a) = 20.8 %, cv(b) = 15.1 %, Mean = 2.9
Capturing results
Unfortunately sp.plot() simultaneously both fits the model and prints the results.
This is not ideal, as when fitting hundreds of models you don’t want the output
to just overwhelm your console. We need some additional tools to work around that:
# Helper to discard printed output when evaluating `expr`
silence.output <- function(expr) {
sink(nullfile())
on.exit(sink())
expr
}
Now lapply() is a convenient way to loop and collect the results in a list:
variables <- c("GPC.15DAP", "GPC.30DAP")
models <- lapply(variables, \(variable) {
with(data, fit_model(variable)) |> silence.output()
})
Inspecting the result, we see we’ve saved all the relevant components:
str(models, 2)
#> List of 2
#> $ :List of 5
#> ..$ ANOVA:Classes 'anova' and 'data.frame': 6 obs. of 5 variables:
#> .. ..- attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Response: GPC.15DAP"
#> ..$ gl.a : int 2
#> ..$ gl.b : int 4
#> ..$ Ea : num 0.0508
#> ..$ Eb : num 0.174
#> $ :List of 5
#> ..$ ANOVA:Classes 'anova' and 'data.frame': 6 obs. of 5 variables:
#> .. ..- attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Response: GPC.30DAP"
#> ..$ gl.a : int 2
#> ..$ gl.b : int 4
#> ..$ Ea : num 0.363
#> ..$ Eb : num 0.192
The ANOVA table is also accessible, which you can use directly instead of
copying the results manually:
models[[1]]$ANOVA
#> Analysis of Variance Table
#>
#> Response: GPC.15DAP
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Block 2 0.1350 0.0675 NaN NaN
#> Genotype 1 3.4133 3.4133 67.147 0.01457 *
#> Ea 2 0.1017 0.0508 NaN NaN
#> N.level 1 12.0000 12.0000 68.900 0.00115 **
#> Genotype:N.level 1 2.0833 2.0833 11.962 0.02585 *
#> Eb 4 0.6967 0.1742 NaN NaN
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data.frame(
term = rownames(models[[1]]$ANOVA),
pval = models[[1]]$ANOVA$`Pr(>F)`
)
#> term pval
#> 1 Block NaN
#> 2 Genotype 0.014567943
#> 3 Ea NaN
#> 4 N.level 0.001150332
#> 5 Genotype:N.level 0.025851402
#> 6 Eb NaN
Let’s give LSD.test() the same treatment. An additional complication here
is that we need to pass more arguments than just the variable name now. To
handle that with ..., we swap bquote() for substitute():
lsd_test <- function(variable, ...) {
variable <- str2lang(variable)
code <- substitute({
LSD.test(variable, ...)
})
eval(code, parent.frame())
}
with(data, lsd_test("GPC.30DAP", Genotype, 2, 0.050833, console = TRUE))
#>
#> Study: GPC.30DAP ~ Genotype
#>
#> LSD t Test for GPC.30DAP
#>
#> Mean Square Error: 0.050833
#>
#> Genotype, means and individual ( 95 %) CI
#>
#> GPC.30DAP std r LCL UCL Min Max
#> 1290H 3.433333 0.8524475 6 3.037298 3.829368 2.3 4.5
#> 1290W 2.366667 0.3614784 6 1.970632 2.762702 1.9 2.8
#>
#> Alpha: 0.05 ; DF Error: 2
#> Critical Value of t: 4.302653
#>
#> least Significant Difference: 0.560078
#>
#> Treatments with the same letter are not significantly different.
#>
#> GPC.30DAP groups
#> 1290H 3.433333 a
#> 1290W 2.366667 b
Now the code to analyse one variable would be:
with(data, {
model <- fit_model("GPC.30DAP") |> silence.output()
test1 <- lsd_test("GPC.30DAP", Genotype, model$gl.a, model$Ea)
test2 <- lsd_test("GPC.30DAP", N.level, model$gl.b, model$Eb)
test3 <- lsd_test("GPC.30DAP", Genotype:N.level, model$gl.b, model$Eb)
})
Again, a function to encapsulate this would be helpful:
analyze_variable <- function(data, variable, ...) {
with(data, {
model <- silence.output(fit_model(variable))
test1 <- lsd_test(variable, Genotype, model$gl.a, model$Ea, ...)
test2 <- lsd_test(variable, N.level, model$gl.b, model$Eb, ...)
test3 <- lsd_test(variable, Genotype:N.level, model$gl.b, model$Eb, ...)
list(variable = variable, model = model, tests = list(test1, test2, test3))
})
}
analysis <- analyze_variable(data, "GPC.30DAP")
str(analysis, 2)
#> List of 3
#> $ variable: chr "GPC.30DAP"
#> $ model :List of 5
#> ..$ ANOVA:Classes 'anova' and 'data.frame': 6 obs. of 5 variables:
#> .. ..- attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Response: GPC.30DAP"
#> ..$ gl.a : int 2
#> ..$ gl.b : int 4
#> ..$ Ea : num 0.363
#> ..$ Eb : num 0.192
#> $ tests :List of 3
#> ..$ :List of 5
#> .. ..- attr(*, "class")= chr "group"
#> ..$ :List of 5
#> .. ..- attr(*, "class")= chr "group"
#> ..$ :List of 5
#> .. ..- attr(*, "class")= chr "group"
Processing results
The analysis object that we return contains all the relevant information
for further processing. We could for example produce a similar interaction
plot to what you had from this result:
library(tidyverse)
plot_analysis <- function(analysis) {
means <- analysis$tests[[3]]$means
groups <- analysis$tests[[3]]$groups
means |>
cbind(groups = groups[, 2]) |>
rename(mean = 1) |>
rownames_to_column("term") |>
separate(term, into = c("Genotype", "N.level"), sep = ":") |>
ggplot(aes(N.level, mean, color = Genotype, group = Genotype)) +
geom_errorbar(
aes(ymin = LCL, ymax = UCL),
position = position_dodge(0.1),
width = 0.1
) +
geom_line(position = position_dodge(0.1)) +
geom_point(position = position_dodge(0.1), size = 3) +
geom_text(
aes(y = UCL, label = groups),
position = position_dodge(0.1),
vjust = -0.5,
show.legend = FALSE
) +
labs(y = analysis$variable, x = NULL)
}
plot_analysis(analysis)
Or we can make a helper to print the result summaries:
print_analysis <- function(analysis) {
cat("Split plot ANOVA of", analysis$variable, "\n")
print(analysis$model$ANOVA)
test_rows <- list()
for (test in analysis$tests) {
row <- cbind(test$parameters, test$statistics)
test_rows <- c(test_rows, list(row))
}
cat("\n")
print(do.call("rbind", test_rows))
invisible(analysis)
}
print_analysis(analysis)
#> Split plot ANOVA of GPC.30DAP
#> Analysis of Variance Table
#>
#> Response: GPC.30DAP
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Block 2 0.8600 0.4300 NaN NaN
#> Genotype 1 3.4133 3.4133 9.3945 0.09199 .
#> Ea 2 0.7267 0.3633 NaN NaN
#> N.level 1 1.0800 1.0800 5.6348 0.07651 .
#> Genotype:N.level 1 0.8533 0.8533 4.4522 0.10249
#> Eb 4 0.7667 0.1917 NaN NaN
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> test p.ajusted name.t ntr alpha MSerror Df Mean CV
#> Fisher-LSD none Genotype 2 0.05 0.3633333 2 2.9 20.78522
#> 1 Fisher-LSD none N.level 2 0.05 0.1916667 4 2.9 15.09647
#> 2 Fisher-LSD none Genotype:N.level 4 0.05 0.1916667 4 2.9 15.09647
#> t.value LSD
#> 4.302653 1.4973671
#> 1 2.776445 0.7017812
#> 2 2.776445 0.9924686
Now you can write one loop to do all the analyses, and then continue working
with the list of results:
# Analyze all but the 3 first variables in data
variables <- tail(names(data), -3)
variables
#> [1] "GPC.15DAP" "GPC.23DAP" "GPC.30DAP"
analyses <- lapply(variables, \(variable) {
analyze_variable(data, variable)
})
For example, we can save the results of the plots and summaries using the
functions we created:
output_dir <- tempdir()
for (analysis in analyses) {
# Save summaries
output_file <- file.path(output_dir, paste0(analysis$variable, ".txt"))
sink(output_file)
print_analysis(analysis)
sink()
# Save plots
output_file <- file.path(output_dir, paste0(analysis$variable, ".jpeg"))
jpeg(output_file, width = 10, height = 10, units = "cm", res = 300)
plot_analysis(analysis) |> print()
dev.off()
}
list.files(output_dir, pattern = "^GPC")
#> [1] "GPC.15DAP.jpeg" "GPC.15DAP.txt" "GPC.23DAP.jpeg" "GPC.23DAP.txt"
#> [5] "GPC.30DAP.jpeg" "GPC.30DAP.txt"
I'm trying to use ggeffects::ggpredict to make some effects plots for my model. I find that the standard errors and confidence limits are missing for many of the results. I can reproduce the problem with some simulated data. It seems specifically for observations where the standard error puts the predicted probability close to 0 or 1.
I tried to get predictions on the link scale to diagnose if it's a problem with the translation from link to response, but I don't believe this is supported by the package.
Any ideas how to address this? Many thanks.
library(tidyverse)
library(lme4)
library(ggeffects)
# number of simulated observations
n <- 1000
# simulated data with a numerical predictor x, factor predictor f, response y
# the simulated effects of x and f are somewhat weak compared to the noise, so expect high standard errors
df <- tibble(
x = seq(-0.1, 0.1, length.out = n),
g = floor(runif(n) * 3),
f = letters[1 + g] %>% as.factor(),
y = pracma::sigmoid(x + (runif(n) - 0.5) + 0.1 * (g - mean(g))),
z = if_else(y > 0.5, "high", "low") %>% as.factor()
)
# glmer model
model <- glmer(z ~ x + (1 | f), data = df, family = binomial)
print(summary(model))
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula: z ~ x + (1 | f)
#> Data: df
#>
#> AIC BIC logLik deviance df.resid
#> 1373.0 1387.8 -683.5 1367.0 997
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.3858 -0.9928 0.7317 0.9534 1.3600
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> f (Intercept) 0.0337 0.1836
#> Number of obs: 1000, groups: f, 3
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02737 0.12380 0.221 0.825
#> x -4.48012 1.12066 -3.998 6.39e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.001
# missing standard errors
ggpredict(model, c("x", "f")) %>% print()
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # Predicted probabilities of z
#>
#> # f = a
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = b
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.56, 0.67]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
#>
#> # f = c
#>
#> x | Predicted | 95% CI
#> --------------------------------
#> -0.10 | 0.62 | [0.54, 0.69]
#> 0.00 | 0.51 |
#> 0.10 | 0.40 |
ggpredict(model, c("x", "f")) %>% as_tibble() %>% print(n = 20)
#> Data were 'prettified'. Consider using `terms="x [all]"` to get smooth plots.
#> # A tibble: 9 x 6
#> x predicted std.error conf.low conf.high group
#> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 -0.1 0.617 0.167 0.537 0.691 a
#> 2 -0.1 0.617 0.124 0.558 0.672 b
#> 3 -0.1 0.617 0.167 0.537 0.691 c
#> 4 0 0.507 NA NA NA a
#> 5 0 0.507 NA NA NA b
#> 6 0 0.507 NA NA NA c
#> 7 0.1 0.396 NA NA NA a
#> 8 0.1 0.396 NA NA NA b
#> 9 0.1 0.396 NA NA NA c
Created on 2022-04-12 by the reprex package (v2.0.1)
I think this may be due to the singular model fit.
I dug down into the guts of the code as far as here, where there appears to be a mismatch between the dimensions of the covariance matrix of the predictions (3x3) and the number of predicted values (15).
I further suspect that the problem may happen here:
rows_to_keep <- as.numeric(rownames(unique(model_matrix_data[
intersect(colnames(model_matrix_data), terms)])))
Perhaps the function is getting confused because the conditional modes/BLUPs for every group are the same (which will only be true, generically, when the random effects variance is zero) ... ?
This seems worth opening an issue on the ggeffects issues list ?
I ran some code below that looks at running Cox regression across multiple outcome types (stroke, cancer, respiratory) that appear in separate columns. purrr seems to do this quite well. But I would also like to
print the name of each outcome type above the corresponding regression model and
print the coefficients as hazard ratios with 95% CIs.
I know this is quite a big ask but is important since my real dataset has almost 20 outcome types. Any help would be much appreciated!
library(survival)
library(purrr)
mydata <- read.table(header=T,
text="age Sex survival stroke cancer respiratory
51 2 1.419178082 2 1 1
60 1 5 1 2 2
49 2 1.082191781 2 2 2
83 1 0.038356164 1 1 2
68 2 0.77260274 2 1 2
44 2 2.336986301 1 2 1
76 1 1.271232877 1 2 2")
outcomes <- names(mydata[4:6])
purrr::map(outcomes, ~coxph(as.formula(paste("Surv(survival,", .x, ") ~ Sex + age")),
mydata))
I'm not quite sure if this is what you are looking for, but if you run the following code:
result <- purrr::map(outcomes, function(x) {
f <- as.formula(paste("Surv(survival,", x, ") ~ Sex + age"))
model <- coxph(f, mydata)
model$call$formula <- f
s <- summary(model)
cat(x, ':\n', paste0(apply(s$coefficients, 1,
function(x) {
paste0("HR : ", round(exp(x[1]), 2),
' (95% CI ', round(exp(x[1] - 1.96 * x[3]), 2),
' - ', round(exp(x[1] + 1.96 * x[3]), 2), ')')}),
collapse = '\n'), '\n\n', sep = '')
invisible(model)
})
It will print out:
#> stroke:
#> HR : 650273590159.06 (95% CI 0 - Inf)
#> HR : 1.36 (95% CI 0.75 - 2.49)
#>
#> cancer:
#> HR : 1121.58 (95% CI 0 - 770170911.09)
#> HR : 1.33 (95% CI 0.78 - 2.28)
#>
#> respiratory:
#> HR : 24.1 (95% CI 0.31 - 1884.85)
#> HR : 1.2 (95% CI 0.99 - 1.45)
And your list of models will be stored with the correct call above them:
result
#> [[1]]
#> Call:
#> coxph(formula = Surv(survival, stroke) ~ Sex + age, data = mydata)
#>
#> coef exp(coef) se(coef) z p
#> Sex 2.720e+01 6.503e+11 2.111e+04 0.001 0.999
#> age 3.105e-01 1.364e+00 3.066e-01 1.013 0.311
#>
#> Likelihood ratio test=6.52 on 2 df, p=0.03834
#> n= 7, number of events= 3
#>
#> [[2]]
#> Call:
#> coxph(formula = Surv(survival, cancer) ~ Sex + age, data = mydata)
#>
#> coef exp(coef) se(coef) z p
#> Sex 7.0225 1121.5843 6.8570 1.024 0.306
#> age 0.2870 1.3325 0.2739 1.048 0.295
#>
#> Likelihood ratio test=2.58 on 2 df, p=0.2753
#> n= 7, number of events= 4
#>
#> [[3]]
#> Call:
#> coxph(formula = Surv(survival, respiratory) ~ Sex + age, data = mydata)
#>
#> coef exp(coef) se(coef) z p
#> Sex 3.18232 24.10259 2.22413 1.431 0.1525
#> age 0.18078 1.19815 0.09772 1.850 0.0643
#>
#> Likelihood ratio test=5.78 on 2 df, p=0.05552
#> n= 7, number of events= 5
Study background: I want to see if mean caterpillar abundance for a given year and within a given population, can explain differences in bird(blue tit) density. The blue tits breed in nest boxes, so I calculated density as the total number of occupied nest boxes/the number of unoccupied nest boxes within a given year and population.
Below I show the structure of the data, my model and the error messages.
Model:
model1.1 <-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,
data.density$number_of_nest.boxes)~population*year*caterpillar+(1|site),
data = data.density, family=binomial)
The first error is:
fixed-effect model matrix is rank deficient so dropping 27 columns / coefficients
I think this this is due to not having enough combinations of caterpillars with population x year.
The second error is
boundary (singular) fit: see ?isSingular
I'm just not sure how to go about fixing this.
I also don't understand what the other error means and how to fix it.
I appreciate any advice.
#loading data density data
data.density<-read.csv ("nest_box_caterpillar_density.csv")
View(data.density)
str(data.density)
#> 'data.frame': 63 obs. of 8 variables:
#> $ year : int 2011 2012 2013 2014 2015 2016 2017 2018 2019 2011 ...
#> $ number.nest.boxes.occupied.that.year: int 17 13 12 16 16 16 15 17 12 17 ...
#> $ number_of_nest.boxes : int 20 20 20 20 20 20 20 20 20 30 ...
#> $ proportion_occupied_boxes : num 0.85 0.65 0.6 0.8 0.8 0.8 0.75 0.85 0.6 0.57 ...
#> $ site : Factor w/ 7 levels "ari","ava","fel",..: 5 5 5 5 5 5 5 5 5 1 ...
#> $ population : Factor w/ 3 levels "D-Muro","E-Muro",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ mean_yearly_frass : num 295 231 437 263 426 ...
#> $ site_ID : Factor w/ 63 levels "2011_ari_","2011_ava_",..: 5 12 19 26 33 40 47 54 61 1 ...
data.density$year<-factor (data.density$year)# making year a factor (categorical variable)
str(data.density) # now we see year as a factor in the data.
#> 'data.frame': 63 obs. of 8 variables:
#> $ year : Factor w/ 9 levels "2011","2012",..: 1 2 3 4 5 6 7 8 9 1 ...
#> $ number.nest.boxes.occupied.that.year: int 17 13 12 16 16 16 15 17 12 17 ...
#> $ number_of_nest.boxes : int 20 20 20 20 20 20 20 20 20 30 ...
#> $ proportion_occupied_boxes : num 0.85 0.65 0.6 0.8 0.8 0.8 0.75 0.85 0.6 0.57 ...
#> $ site : Factor w/ 7 levels "ari","ava","fel",..: 5 5 5 5 5 5 5 5 5 1 ...
#> $ population : Factor w/ 3 levels "D-Muro","E-Muro",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ mean_yearly_frass : num 295 231 437 263 426 ...
#> $ site_ID : Factor w/ 63 levels "2011_ari_","2011_ava_",..: 5 12 19 26 33 40 47 54 61 1 ...
density<-data.density$proportion_occupied_boxes # making a new object called density
caterpillar<-data.density$mean_yearly_frass # making new object called caterpillar
model1.1<-glmer(cbind(data.density$number.nest.boxes.occupied.that.year,data.density$number_of_nest.boxes)~population*year*caterpillar+(1|site),data = data.density, family=binomial)
#> fixed-effect model matrix is rank deficient so dropping 27 columns / coefficients
#> boundary (singular) fit: see ?isSingular
summary(model1.1)
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#> Approximation) [glmerMod]
#> Family: binomial ( logit )
#> Formula:
#> cbind(data.density$number.nest.boxes.occupied.that.year, data.density$number_of_nest.boxes) ~
#> population * year * caterpillar + (1 | site)
#> Data: data.density
#>
#> AIC BIC logLik deviance df.resid
#> 343.7 403.7 -143.8 287.7 35
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.1125 -0.1379 0.0000 0.2264 0.6778
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> site (Intercept) 0 0
#> Number of obs: 63, groups: site, 7
#>
#> Fixed effects:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.4054532 0.2454754 -1.652 0.0986 .
#> populationE-Muro -0.1158123 0.2301030 -0.503 0.6147
#> populationE-Pirio -0.4945158 0.2932707 -1.686 0.0918 .
#> year2012 0.0905137 0.2109513 0.429 0.6679
#> year2013 -0.1223076 0.2160367 -0.566 0.5713
#> year2014 -0.0703760 0.2304236 -0.305 0.7600
#> year2015 -0.0507882 0.2127083 -0.239 0.8113
#> year2016 -0.0562139 0.2077616 -0.271 0.7867
#> year2017 -0.0994962 0.2070464 -0.481 0.6308
#> year2018 0.0977751 0.2192755 0.446 0.6557
#> year2019 -0.2312869 0.2133430 -1.084 0.2783
#> caterpillar 0.0004598 0.0005432 0.846 0.3973
#> populationE-Muro:year2012 -0.1217344 0.3294773 -0.369 0.7118
#> populationE-Pirio:year2012 -0.3121173 0.2912256 -1.072 0.2838
#> populationE-Muro:year2013 -0.0682892 0.3600992 -0.190 0.8496
#> populationE-Pirio:year2013 -0.3345701 0.3051039 -1.097 0.2728
#> populationE-Muro:year2014 0.1604636 0.3383121 0.474 0.6353
#> populationE-Pirio:year2014 -0.1074231 0.3171972 -0.339 0.7349
#> populationE-Muro:year2015 0.0838557 0.3491699 0.240 0.8102
#> populationE-Pirio:year2015 -0.0640988 0.2943189 -0.218 0.8276
#> populationE-Muro:year2016 0.0679017 0.3333771 0.204 0.8386
#> populationE-Pirio:year2016 -0.0899343 0.2919975 -0.308 0.7581
#> populationE-Muro:year2017 0.1643493 0.3300491 0.498 0.6185
#> populationE-Pirio:year2017 0.0338824 0.2730344 0.124 0.9012
#> populationE-Muro:year2018 0.0315607 0.3264224 0.097 0.9230
#> populationE-Pirio:year2018 -0.4196974 0.3180515 -1.320 0.1870
#> populationE-Muro:year2019 -0.0587593 0.3619408 -0.162 0.8710
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation matrix not shown by default, as p = 27 > 12.
#> Use print(x, correlation=TRUE) or
#> vcov(x) if you need it
#> fit warnings:
#> fixed-effect model matrix is rank deficient so dropping 27 columns / coefficients
#> optimizer (Nelder_Mead) convergence code: 0 (OK)
#> boundary (singular) fit: see ?isSingular
Created on 2022-03-21 by the reprex package (v2.0.1)
I tried removing caterpillars from the model, and the error first error goes away. But the point of my model is to see how caterpillar effects density. I also still get the "boundary (singular) fit: see ?isSingular" error.
There are a few problems with your model specification.
binomial responses should be specified as cbind(n_success, n_failure), not cbind(n_success, n_total) (in your case cbind(boxes_occupied, total_boxes-boxes_occupied). (I actually find it clearer to use the alternative specification boxes_occupied/total_boxes with the additional argument weights=total_boxes ...)
there are few absolutely hard and fast rules, but it probably makes sense to estimate the effects of at least year as a random effect, and possibly population within site (this will depend on how much you are interested in the detailed differences between populations)
with a total of 63 observations, you need to be parsimonious with your model; a reasonable rule of thumb is that you shouldn't try to estimate more than at most 6 parameters
so I would recommend something like
glmer(boxes_occupied/total_boxes ~ caterpillar + population + (1|year) + (1|site) + caterpillar,
data = data.density,
weights = total_boxes,
family = binomial)
As for singularity, this is a fundamental issue in mixed models, and has been discussed in lots of places. There's a lot of information provided at ?lme4::isSingular (i.e., look up the help page for the isSingular() function), and at the GLMM FAQ, or see some of the existing answers on Stack Overflow about singular fits ...
I am trying to write a .csv file that appends the important information from the summary of a glmer analysis (from the package lme4).
I have been able to isolate the coefficients, AIC, and random effects , but I have not been able to isolate the scaled residuals (Min, 1Q, Median, 3Q, Max).
I have tried using $residuals, but I get a very long output, not the information shown in the summary.
> library(lme4)
> setwd("C:/Users/Arthur Scully/Dropbox/! ! ! ! PHD/Chapter 2 Lynx Bobcat BC/ResourceSelection")
> #simple vectors
>
> x <- c("a","b","b","b","b","d","b","c","c","a")
>
> y <- c(1,1,0,1,0,1,1,1,1,0)
>
>
> # Simple data frame
>
> aes.samp <- data.frame(x,y)
> aes.samp
x y
1 a 1
2 b 1
3 b 0
4 b 1
5 b 0
6 d 1
7 b 1
8 c 1
9 c 1
10 a 0
>
> # Simple glmer
>
> aes.glmer <- glmer(y~(1|x),aes.samp,family ="binomial")
boundary (singular) fit: see ?isSingular
>
> summary(aes.glmer)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ (1 | x)
Data: aes.samp
AIC BIC logLik deviance df.resid
16.2 16.8 -6.1 12.2 8
I can isolate information above by using the call summary(aes.glmer)$AIC
Scaled residuals:
Min 1Q Median 3Q Max
-1.5275 -0.9820 0.6546 0.6546 0.6546
I do not know the call to isolate the above information
Random effects:
Groups Name Variance Std.Dev.
x (Intercept) 0 0
Number of obs: 10, groups: x, 4
I can isolate this information using the ranef function
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.6901 1.228 0.22
And I can isolate the information above using summary(aes.glmer)$coefficient
convergence code: 0
boundary (singular) fit: see ?isSingular
>
> #Pull important
> ##write call to select important output
> aes.glmer.coef <- summary(aes.glmer)$coefficient
> aes.glmer.AIC <- summary(aes.glmer)$AIC
> aes.glmer.ran <-ranef(aes.glmer)
>
> ##
> data.frame(c(aes.glmer.coef, aes.glmer.AIC, aes.glmer.ran))
X0.847297859077025 X0.690065555425105 X1.22785125618255 X0.219502810378876 AIC BIC logLik deviance df.resid X.Intercept.
a 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
b 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
c 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
d 0.8472979 0.6900656 1.227851 0.2195028 16.21729 16.82246 -6.108643 12.21729 8 0
If anyone knows what call I can use to isolate the "scaled residuals" I would be very greatful.
I haven't got your data, so we'll use example data from the lme4 vignette.
library(lme4)
library(lattice)
library(broom)
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
This is for the residuals. tidy from the broom package puts it in to a tibble, which you can then export to a csv.
x <- tidy(quantile(residuals(gm1, "pearson", scaled = TRUE)))
x
# A tibble: 5 x 2
names x
<chr> <dbl>
1 0% -2.38
2 25% -0.789
3 50% -0.203
4 75% 0.514
5 100% 2.88
Also here are some of the other bits that you might find useful, using glance from broom.
y <- glance(gm1)
y
# A tibble: 1 x 6
sigma logLik AIC BIC deviance df.residual
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 -92.0 194. 204. 73.5 51
And
z <- tidy(gm1)
z
# A tibble: 5 x 6
term estimate std.error statistic p.value group
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 (Intercept) -1.40 0.231 -6.05 1.47e-9 fixed
2 period2 -0.992 0.303 -3.27 1.07e-3 fixed
3 period3 -1.13 0.323 -3.49 4.74e-4 fixed
4 period4 -1.58 0.422 -3.74 1.82e-4 fixed
5 sd_(Intercept).herd 0.642 NA NA NA herd