Box-Cox back transformation with emmeans not working properly - r

I am doing a linear regression with data that needs transformation, for it, I am using a Box-Cox power transformation, followed by back-transformation to write a report using the original scale. I've been trying to do this with the emmeans packages, and I followed the steps described in the emmeans package vignette, however, I find that the summary results for the estimated means are not at all similar to the untransformed data. In fact, the output is not transformed at all.
Here is a reproducible example using the examples from the emmeans package:
require(emmeans)
# Fit a model using an oddball transformation:
bctran <- make.tran("boxcox", 0.368)
warp.bc <- with(bctran,
lm(linkfun(breaks) ~ wool * tension, data = warpbreaks))
# Obtain back-transformed LS means:
emmeans(warp.bc, ~ tension | wool, type = "response")
# Fit a model without transformation:
warp <- lm(breaks ~ wool * tension, data = warpbreaks)
# Obtain LS means:
emmeans(warp, ~ tension | wool)
which returns:
> emmeans(warp.bc, ~ tension | wool, type = "response")
wool = A:
tension emmean SE df lower.CL upper.CL
L 8.07 0.419 48 7.23 8.92
M 5.91 0.419 48 5.07 6.75
H 5.94 0.419 48 5.10 6.79
wool = B:
tension emmean SE df lower.CL upper.CL
L 6.45 0.419 48 5.61 7.29
M 6.53 0.419 48 5.69 7.37
H 5.22 0.419 48 4.38 6.07
Confidence level used: 0.95
> emmeans(warp, ~ tension | wool)
wool = A:
tension emmean SE df lower.CL upper.CL
L 44.6 3.65 48 37.2 51.9
M 24.0 3.65 48 16.7 31.3
H 24.6 3.65 48 17.2 31.9
wool = B:
tension emmean SE df lower.CL upper.CL
L 28.2 3.65 48 20.9 35.6
M 28.8 3.65 48 21.4 36.1
H 18.8 3.65 48 11.4 26.1
Confidence level used: 0.95
when in fact the estimated mean for tension:L should be 42.37, as calculated using the formula:
> origin + (1 + param * pmax(eta))^(1/param)
> 0 + (1 + 0.368 * pmax(8.07))^(1/0.368)
[1] 42.37179
Is there something I am missing or not understanding properly?

Hmmmm. I reproduced this problem. I'm not sure what's wrong, but so far I can tell that bctran itself is in order:
> emm = as.data.frame(emmeans(warp.bc, ~tension|wool))
> emm
tension wool emmean SE df lower.CL upper.CL
1 L A 8.074761 0.4192815 48 7.231739 8.917783
2 M A 5.911710 0.4192815 48 5.068688 6.754732
3 H A 5.942335 0.4192815 48 5.099313 6.785357
4 L B 6.449869 0.4192815 48 5.606847 7.292891
5 M B 6.531085 0.4192815 48 5.688063 7.374107
6 H B 5.224939 0.4192815 48 4.381917 6.067961
> bctran$linkinv(emm$emmean)
[1] 42.42263 23.10060 23.32407 27.22827 27.88877 18.43951
So these back-transformed EMMs are in-order. I'll trace the code and see why the results aren't back-transformed.
Update
I found a logic error from a revision a few months ago whereby if a transformation is character (e.g., "log") it works fine, but if it is a list (e.g., your bctran) it is ignored.
I fixed that error in the next version to push to the github site (version >= 1.3.3.0999902), and the fix will be in the next CRAN update (version > 1.3.3).
> emmeans(warp.bc, ~ tension | wool)
wool = A:
tension emmean SE df lower.CL upper.CL
L 8.07 0.419 48 7.23 8.92
M 5.91 0.419 48 5.07 6.75
H 5.94 0.419 48 5.10 6.79
wool = B:
tension emmean SE df lower.CL upper.CL
L 6.45 0.419 48 5.61 7.29
M 6.53 0.419 48 5.69 7.37
H 5.22 0.419 48 4.38 6.07
Results are given on the Box-Cox (lambda = 0.368) (not the response) scale.
Confidence level used: 0.95
> emmeans(warp.bc, ~ tension | wool, type = "response")
wool = A:
tension response SE df lower.CL upper.CL
L 42.4 4.48 48 34.0 52.0
M 23.1 3.05 48 17.5 29.8
H 23.3 3.07 48 17.7 30.0
wool = B:
tension response SE df lower.CL upper.CL
L 27.2 3.38 48 20.9 34.6
M 27.9 3.44 48 21.5 35.3
H 18.4 2.65 48 13.6 24.3
Confidence level used: 0.95
Intervals are back-transformed from the Box-Cox (lambda = 0.368) scale
Notice that even without back-transforming, there is an annotation of that fact. The fact that no annotation at all is present in your results was a tip-off.

Related

Convert a list from data frame (emmGrid class)

I would like to convert a list to dataframe (picture as below)
I did use do.call(rbind.data.frame, contrast), however, I got this Error in xi[[j]] : this S4 class is not subsettable. I still can read them separately. Anyone know about this thing?
This list I got when running the ART anova test by using the package ARTool
Update
This my orignial code to calculate and get the model done.
Organism_df_posthoc <- bird_metrics_long_new %>%
rbind(plant_metrics_long_new) %>%
mutate(Type = factor(Type, levels = c("Forest", "Jungle rubber", "Rubber", "Oil palm"))) %>%
mutate(Category = factor(Category)) %>%
group_by(Category) %>%
mutate_at(c("PD"), ~(scale(.) %>% as.vector())) %>%
ungroup() %>%
nest_by(n1) %>%
mutate(fit = list(art.con(art(PD ~ Category + Type + Category:Type, data = data),
"Category:Type",adjust = "tukey", interaction = T)))
And the output of fit is that I showed already.
With rbind, instead of rbind.data.frame, there is a specific method for 'emmGrid' object and it can directly use the correct method by matching the class if we specify just rbind
do.call(rbind, contrast)
-output
wool tension emmean SE df lower.CL upper.CL
A L 44.6 3.65 48 33.6 55.5
A M 24.0 3.65 48 13.0 35.0
A H 24.6 3.65 48 13.6 35.5
B L 28.2 3.65 48 17.2 39.2
B M 28.8 3.65 48 17.8 39.8
B H 18.8 3.65 48 7.8 29.8
A L 44.6 3.65 48 33.6 55.5
A M 24.0 3.65 48 13.0 35.0
A H 24.6 3.65 48 13.6 35.5
B L 28.2 3.65 48 17.2 39.2
B M 28.8 3.65 48 17.8 39.8
B H 18.8 3.65 48 7.8 29.8
Confidence level used: 0.95
Conf-level adjustment: bonferroni method for 12 estimates
The reason is that there is a specific method for rbind when we load the emmeans
> methods('rbind')
[1] rbind.data.frame rbind.data.table* rbind.emm_list* rbind.emmGrid* rbind.grouped_df* rbind.zoo*
The structure in the example created matches the OP's structure showed
By using rbind.data.frame, it doesn't match because the class is already emmGrid
data
library(multcomp)
library(emmeans)
warp.lm <- lm(breaks ~ wool*tension, data = warpbreaks)
warp.emmGrid <- emmeans(warp.lm, ~ tension | wool)
contrast <- list(warp.emmGrid, warp.emmGrid)
If the OP used 'ARTool' and if the columns are different, the above solution may not work because rbind requires all objects to have the same column names. We could convert to tibble by looping over the list with map (from purrr) and bind them
library(ARTool)
library(purrr)
library(tibble)
map_dfr(contrast, as_tibble)
-output
# A tibble: 42 × 8
contrast estimate SE df t.ratio p.value Moisture_pairwise Fertilizer_pairwise
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 m1 - m2 -23.1 4.12 8.00 -5.61 0.00226 NA NA
2 m1 - m3 -33.8 4.12 8.00 -8.20 0.000169 NA NA
3 m1 - m4 -15.2 4.12 8.00 -3.68 0.0256 NA NA
4 m2 - m3 -10.7 4.12 8 -2.59 0.118 NA NA
5 m2 - m4 7.92 4.12 8 1.92 0.291 NA NA
6 m3 - m4 18.6 4.12 8 4.51 0.00849 NA NA
7 NA 6.83 10.9 24 0.625 0.538 m1 - m2 f1 - f2
8 NA 15.3 10.9 24 1.40 0.174 m1 - m3 f1 - f2
9 NA -5.83 10.9 24 -0.533 0.599 m1 - m4 f1 - f2
10 NA 8.50 10.9 24 0.777 0.445 m2 - m3 f1 - f2
# … with 32 more rows
data
data(Higgins1990Table5, package = "ARTool")
m <- art(DryMatter ~ Moisture*Fertilizer + (1|Tray), data=Higgins1990Table5)
a1 <- art.con(m, ~ Moisture)
a2 <- art.con(m, "Moisture:Fertilizer", interaction = TRUE)
contrast <- list(a1, a2)

Abline command is not showing a regression line?

I'm new to R programming and I'm trying to plot a regression line for this data set, but it doesn't seem to be working.
I followed exactly what my professor was using, however it doesn't seem to be working. I've also interchanged the abline command with abline(lm(batters$EMH~batters$TB)) with similar results.
Here is my code for it:
batters<-read.table(header=TRUE, text="
X AVG EBH TB OPS K.to.BB.Ratio
1 LeMahieu 0.327 61 312 0.893 1.95
2 Urshela 0.314 55 236 0.889 3.48
3 Torres 0.278 64 292 0.871 2.64
4 Judge 0.272 46 204 0.921 2.21
5 Sanchez 0.232 47 208 0.841 3.13
6 Wong 0.285 40 202 0.784 1.76
7 Molina 0.270 34 167 0.711 2.52
8 Goldschmidt 0.260 60 284 0.821 2.13
9 Ozuna 0.243 53 230 0.804 1.84
10 DeJong 0.233 62 259 0.762 2.39
11 Altuve 0.298 61 275 0.903 1.98
12 Bregman 0.296 80 328 1.015 0.69
13 Springer 0.292 62 283 0.974 1.69
14 Reddick 0.275 36 205 0.728 1.83
15 Chirinos 0.238 40 162 0.791 2.45
16 Bellinger 0.305 84 351 1.035 1.14
17 Turner 0.290 51 244 0.881 1.72
18 Seager 0.272 64 236 0.817 2.23
19 Taylor 0.262 45 169 0.794 3.11
20 Muncy 0.251 58 251 0.889 1.65
21 Meadows 0.291 69 296 0.922 2.43
22 Garcia 0.282 47 227 0.796 4.03
23 Pham 0.273 56 255 0.818 1.52
24 Choi 0.261 41 188 0.822 1.69
25 Adames 0.254 46 222 0.735 3.32
26 Yelich 0.329 76 328 1.101 1.48
27 Braun 0.285 55 232 0.849 3.09
28 Moustakas 0.254 66 270 0.845 1.85
29 Grandal 0.246 56 240 0.848 1.28
30 Arcia 0.223 32 173 0.633 2.53")
plot(batters$EBH,batters$TB,main="Attribute Pairing 5",xlab="EBH",ylab="TB")
lm(formula = batters$EBH~batters$TB)
#Call:
#lm(formula = batters$EBH ~ batters$TB)
#Coefficients:
#(Intercept) batters$TB
# -4.1275 0.2416
lin_model_1<-lm(formula = batters$EBH~batters$TB)
summary(lin_model_1)
abline(-4.12752, 0.24162)
I apologize for the messy coding, this is for a class.
Your formula is backwards in the lm() function call. The dependent variable is on the left side of the "~".
In your plot the y-axis (dependent variable) is TB, but in the linear regression model, it is defined as the independent variable. So for the linear regression model to work, one needs to swap EBH & TB.
plot(batters$EBH,batters$TB,main="Attribute Pairing 5",xlab="EBH",ylab="TB")
model <-lm(formula = batters$TB ~batters$EBH)
model
Call: lm(formula = batters$TB ~ batters$EBH)
Coefficients: (Intercept) batters$EBH
46.510 3.603
abline(model)
#or
abline (46.51, 3.60)
Also if you pass the "model" to abline you can avoid the need to specify the slope and intercept with abline

Accumulate or build a prediction where the prediction output is an input to the next item

Some data:
mydiamonds <- diamonds %>%
group_by(cut, color) %>%
mutate(cumprice = cumsum(price)) %>%
mutate(lag_cumprice = lag(cumprice)) %>%
na.omit(.)
A model:
nonsense_model <- glm(cumprice ~ depth + lag_cumprice, family = "poisson", data = mydiamonds)
To predict with this model, the output of the prediction is also an input, since it uses lagged data. So I cannot just use predict. Here's an attempted custom function:
acumPredict <- function(dta, mod, initial_amount) {
# for each row in the dataframe, predict/build
for(r in 1:nrow(dta)) {
total_exponent <-
mod$coefficients['(Intercept)'] +
(mod$coefficients['depth'] * dta$depth[r]) +
(mod$coefficients['lag_cumprice'] * initial_amount)
predictced_cumprice = exp(total_exponent)
#<update initial_value for the list item here with predictced_cumprice>
return(predictced_cumprice)
}
}
And, here is a list of dataframes, which is just mydiamonds split by the grouping, along with an item for the initial_value of cumulative_price:
mylist <- mydiamonds %>%
group_split %>%
map(~ list(dta = ., initial_val = min(.$cumprice)))
Now I have all the pieces. For each dataframe within mylist, I would like to mutate a new field predicted_cumprice that uses the function acumPredict. acumPredict takes as inputs the initial_amount which should start with the list items value of initial_val and then, on each row, accumulate or build predictions where the output of the prediction for cumprice is the input variable lag_cumprice for the next rows prediction.
I hope my post is clear, it's tricky to communicate it well.
I can visualize what I need, I just cannot code it. Put in other words, I need to update the value of initial_val for the list items on each iteration/row during mutate(predicted_cumprice = acumPredict(...)) (assuming this is the 'right' approach?).
How can I do this?
If we update the values in each iteration inside the function with index as r,
-updated function
acumPredict <- function(dta, mod, initial_amount) {
dta$predicted_cumprice <- NA_real_
for(r in 1:nrow(dta)) {
total_exponent <-
mod$coefficients['(Intercept)'] +
(mod$coefficients['depth'] * dta$depth[r]) +
(mod$coefficients['lag_cumprice'] * initial_amount)
predictced_cumprice = exp(total_exponent)
initial_amount <- predictced_cumprice
dta$predicted_cumprice[r] <- predictced_cumprice
}
dta$predicted_cumprice
}
-testing
mylist2 <- mylist %>%
map( ~ {
.x$dta$predicted_cumprice <- acumPredict(.x$dta,
mod = nonsense_model, initial_amount = .x$initial_val)
.x
})
-checking the ouptut
mylist2[1:2]
[[1]]
[[1]]$dta
# A tibble: 162 x 13
carat cut color clarity depth table price x y z cumprice lag_cumprice predicted_cumprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 0.71 Fair D VS2 56.9 65 2858 5.89 5.84 3.34 5706 2848 2111578.
2 0.9 Fair D SI2 66.9 57 2885 6.02 5.9 3.99 8591 5706 2214081.
3 1 Fair D SI2 69.3 58 2974 5.96 5.87 4.1 11565 8591 2106000.
4 1.01 Fair D SI2 64.6 56 3003 6.31 6.24 4.05 14568 11565 2355904.
5 0.73 Fair D VS1 66 54 3047 5.56 5.66 3.7 17615 14568 2355406.
6 0.71 Fair D VS2 64.7 58 3077 5.61 5.58 3.62 20692 17615 2440519.
7 0.91 Fair D SI2 62.5 66 3079 6.08 6.01 3.78 23771 20692 2625804.
8 0.9 Fair D SI2 65.9 59 3205 6 5.95 3.94 26976 23771 2461070.
9 0.9 Fair D SI2 66 58 3205 6 5.97 3.95 30181 26976 2393474.
10 0.9 Fair D SI2 64.7 54 3205 6.1 6.04 3.93 33386 30181 2454723.
# … with 152 more rows
[[1]]$initial_val
[1] 5706
[[2]]
[[2]]$dta
# A tibble: 223 x 13
carat cut color clarity depth table price x y z cumprice lag_cumprice predicted_cumprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52 3094 337 2217306.
2 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03 5882 3094 2402788.
3 1.01 Fair E SI2 67.4 60 2797 6.19 6.05 4.13 8679 5882 2283138.
4 0.57 Fair E VVS1 58.7 66 2805 5.34 5.43 3.16 11484 8679 2844474.
5 0.96 Fair E SI2 53.1 63 2815 6.73 6.65 3.55 14299 11484 3611755.
6 0.98 Fair E SI2 53.3 67 2855 6.82 6.74 3.61 17154 14299 4037777.
7 1.01 Fair E SI2 67.6 57 2862 6.21 6.11 4.18 20016 17154 2913440.
8 0.8 Fair E SI1 56.3 63 2885 6.22 6.14 3.48 22901 20016 3343916.
9 0.71 Fair E VS2 64.6 59 2902 5.62 5.59 3.62 25803 22901 2845239.
10 0.9 Fair E SI2 65 58 2930 6.08 6.04 3.94 28733 25803 2608244.
# … with 213 more rows
[[2]]$initial_val
[1] 3094
-manual check for the first three values for first list element
> unname(exp(nonsense_model$coefficients['lag_cumprice'] * 5706 + nonsense_model$coefficients['depth'] * 56.9 + nonsense_model$coefficients['(Intercept)']))
[1] 2111578
> unname(exp(nonsense_model$coefficients['lag_cumprice'] * 2111578 + nonsense_model$coefficients['depth'] * 66.9 + nonsense_model$coefficients['(Intercept)']))
[1] 2214081
> unname(exp(nonsense_model$coefficients['lag_cumprice'] * 2214081 + nonsense_model$coefficients['depth'] * 69.3 + nonsense_model$coefficients['(Intercept)']))
[1] 2106000

remove nonEst row(s) in emmeans result

I have unbalanced design so when I apply emmeans to my model at specific levels, the absent nested factor (which is present in other levels) is marked as nonEst in my output table. How do I change my code so that the table below shows the three estimable rows only?
emmeans(model, specs = ~ Rot/Crop | Herb, at = list(Rot = "3", Herb="conv"))
Herb = conv:
Rot Crop emmean SE df lower.CL upper.CL
3 alfalfa nonEst NA NA NA NA
3 corn 3.50 0.283 270 2.94 4.06
3 oat 3.44 0.283 270 2.88 3.99
3 soybean 2.65 0.253 270 2.15 3.15
Confidence level used: 0.95
An option is to tidy it with broom and then remove the NA rows with na.omit
library(emmeans)
library(broom)
library(dplyr)
emmeans(model, specs = ~ Rot/Crop | Herb, at = list(Rot = "3", Herb="conv")) %>%
tidy %>%
na.omit
Or with as.data.frame/subset
subset(as.data.frame( emmeans(model, specs = ~ Rot/Crop | Herb,
at = list(Rot = "3", Herb="conv"))), !is.na(emmean))
Using a reproducible example
warp.lm <- lm(breaks ~ wool * tension, data = head(warpbreaks, 30))
emmeans (warp.lm, ~ wool | tension)
#tension = L:
# wool emmean SE df lower.CL upper.CL
# A 44.6 4.24 26 35.85 53.3
# B 23.3 7.34 26 8.26 38.4
#tension = M:
# wool emmean SE df lower.CL upper.CL
# A 24.0 4.24 26 15.29 32.7
# B nonEst NA NA NA NA
#tension = H:
# wool emmean SE df lower.CL upper.CL
# A 24.6 4.24 26 15.85 33.3
# B nonEst NA NA NA NA
emmeans (warp.lm, ~ wool | tension) %>%
tidy %>%
na.omit
# A tibble: 4 x 7
# wool tension estimate std.error df statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A L 44.6 4.24 26 10.5 7.29e-11
#2 B L 23.3 7.34 26 3.18 3.78e- 3
#3 A M 24.0 4.24 26 5.67 5.84e- 6
#4 A H 24.6 4.24 26 5.80 4.15e- 6
Or in base R, coerce it to data.frame and then subset the non-NA rows
subset(as.data.frame(emmeans (warp.lm, ~ wool | tension)), !is.na(emmean))
# wool tension emmean SE df lower.CL upper.CL
#1 A L 44.55556 4.235135 26 35.850110 53.26100
#2 B L 23.33333 7.335470 26 8.255059 38.41161
#3 A M 24.00000 4.235135 26 15.294554 32.70545
#5 A H 24.55556 4.235135 26 15.850110 33.26100

P spline smoother

Hi I am trying to find a non-parametric regression smoother to the difference between the control and treatment groups so as to determine the effectiveness of the appetite suppressant over time. then I need to use my model to estimate the difference between the treatment and control group at t=0 and t=50.
I want to use P-spline smoother ,but I do not have enough background about it
This is my data :
t
0 1 3 7 8 10 14 15 17 21 22 24 28 29 31 35 36 38 42 43 45 49 50 52 56 57 59 63 64 70 73 77 80 84 87 91 94 98 105
con
20.5 19.399 22.25 17.949 19.899 21.449 16.899 21.5 22.8 24.699 26.2 28.5 24.35 24.399 26.6 26.2 26.649 29.25 27.55 29.6 24.899 27.6 28.1 27.85 26.899 27.8 30.25 27.6 27.449 27.199 27.8 28.199 28 27.3 27.899 28.699 27.6 28.6 27.5
trt
21.3 16.35 19.25 16.6 14.75 18.149 14.649 16.7 15.05 15.5 13.949 16.949 15.6 14.699 14.15 14.899 12.449 14.85 16.75 14.3 16 16.85 15.65 17.149 18.05 15.699 18.25 18.149 16.149 16.899 18.95 22 23.6 23.75 27.149 28.449 25.85 29.7 29.449
where:
t - the time in days since the experiment started.
con - the median food intake of the control group.
trt - the median food intake of the treatment group.
Can anybody help please?
Only to give you a start. mgcv package implements various regression spline basis, including P-splines (penalized B-splines with difference penalty).
First, you need to set up your data:
dat <- data.frame(time = rep(t, 2), y = c(con, trt),
grp = gl(2, 39, labels = c("con", "trt")))
Then call gam for non-parametric regression:
library(mgcv) # no need to install; it comes with R
fit <- gam(y ~ s(time, bs = 'ps', by = grp) + grp, data = dat)
Read mgcv: how to specify interaction between smooth and factor? for specification of interaction. bs = 'ps' sets P-spline basis. By default, 10 (evenly spaced interior) knots are chosen. You can change k if you want.
More about P-splines in mgcv, read mgcv: how to extract knots, basis, coefficients and predictions for P-splines in adaptive smooth?.

Resources