The following model is model 1 of Preece and Baines (1978, Annals of Human Biology), and is used to describe human growth.
My Stan code for this model is as follows:
```{stan output.var="test"}
data {
int<lower=1> n;
ordered[n] t; // age
ordered[n] y; // height of human
}
parameters {
positive_ordered[2] h;
real<lower=0, upper=t[n-1]>theta;
positive_ordered[2] s;
real<lower=0>sigma;
}
model {
h[1] ~ uniform(0, y[n]);
h[2] ~ normal(180, 20);
sigma ~ student_t(2, 0, 1);
s[1] ~ normal(5, 5);
s[2] ~ normal(5, 5);
theta ~ normal(10, 5);
y ~ normal(h[2] - (2*(h[2] - h[1]) * inv(exp(s[1]*(t - theta)) + exp(s[2]*(t - theta)))), sigma);
}
```
I now want to create a hierarchical version of this model, with parameters modelled either identical across boys (such as the measurement variability sigma) or in a hierarchical version (such as h_1, the adult height).
As for the parameters sigma and theta, I want to maintain these priors as identical across boys as sigma ~ student_t(2, 0, 1); and theta ~ normal(10, 5);.
Unfortunately, I have almost no experience in implementing hierarchical modelling, and I have struggled in my attempts at doing any hierarchical examples beyond the simple binomial hierarchical models in Bayesian textbooks (see chapter 12, pages 358-359 of Statistical Rethinking by Richard McElreath). I do, however, understand the theory behind hierarchical modelling, as written in chapter 5 of Bayesian Data Analysis by Andrew Gelman and chapter 12 of Statistical Rethinking by Richard McElreath.
I am keen to discover how this type of hierarchical model would be implemented in Stan. Ideally, I am seeking explanations alongside the code, so that I may learn how to implement these types of hierarchical model examples independently in the future.
A sample of my data is as follows:
age boy01 boy02 boy03 boy04 boy05 boy06 boy07 boy08 boy09 boy10 boy11 boy12 boy13 boy14 boy15 boy16 boy17 boy18
1 1 81.3 76.2 76.8 74.1 74.2 76.8 72.4 73.8 75.4 78.8 76.9 81.6 78 76.4 76.4 76.2 75 79.7
2 1.25 84.2 80.4 79.8 78.4 76.3 79.1 76 78.7 81 83.3 79.9 83.7 81.8 79.4 81.2 79.2 78.4 81.3
3 1.5 86.4 83.2 82.6 82.6 78.3 81.1 79.4 83 84.9 87 84.1 86.3 85 83.4 86 82.3 82 83.3
4 1.75 88.9 85.4 84.7 85.4 80.3 84.4 82 85.8 87.9 89.6 88.5 88.8 86.4 87.6 89.2 85.4 84 86.5
5 2 91.4 87.6 86.7 88.1 82.2 87.4 84.2 88.4 90 91.4 90.6 92.2 87.1 91.4 92.2 88.4 85.9 88.9
6 3 101. 97 94.2 98.6 89.4 94 93.2 97.3 97.3 100. 96.6 99.3 96.2 101. 101. 101 95.6 99.4
7 4 110. 105. 100. 104. 96.9 102. 102. 107. 103. 111 105. 106. 104 106. 110. 107. 102. 104.
8 5 116. 112. 107. 111 104. 109. 109 113. 108. 118. 112 113. 111 113. 117. 115. 109. 112.
9 6 122. 119. 112. 116. 111. 116. 117. 119. 114. 126. 119. 120. 117. 120. 122. 121. 118. 119
10 7 130 125 119. 123. 116. 122. 123. 126. 120. 131. 125. 127. 124. 129. 130. 128 125. 128
I acknowledge the lack of precision in the decimal places. The data is in the form of a tibble table, which doesn't seem to respond to R's usual commands for greater precision. For the sake of consistency, it might be better to simply ignore all of the rows after row 5, since rows 1 - 5 display the full precision that is present in the original data.
In the full data, the ages are
> Children$age
[1] 1.00 1.25 1.50 1.75 2.00 3.00 4.00 5.00 6.00 7.00 8.00 8.50 9.00 9.50 10.00 10.50 11.00 11.50 12.00 12.50
[21] 13.00 13.50 14.00 14.50 15.00 15.50 16.00 16.50 17.00 17.50 18.00
And there are 39 boys, listed in the same wide-data format as the above sample.
Disclaimer: As a start let's fit a (non-hierarchical) non-linear growth model using Stan.
We read in the sample data.
library(tidyverse);
df <- read.table(text = "
age boy01 boy02 boy03 boy04 boy05 boy06 boy07 boy08 boy09 boy10 boy11 boy12 boy13 boy14 boy15 boy16 boy17 boy18
1 1 81.3 76.2 76.8 74.1 74.2 76.8 72.4 73.8 75.4 78.8 76.9 81.6 78 76.4 76.4 76.2 75 79.7
2 1.25 84.2 80.4 79.8 78.4 76.3 79.1 76 78.7 81 83.3 79.9 83.7 81.8 79.4 81.2 79.2 78.4 81.3
3 1.5 86.4 83.2 82.6 82.6 78.3 81.1 79.4 83 84.9 87 84.1 86.3 85 83.4 86 82.3 82 83.3
4 1.75 88.9 85.4 84.7 85.4 80.3 84.4 82 85.8 87.9 89.6 88.5 88.8 86.4 87.6 89.2 85.4 84 86.5
5 2 91.4 87.6 86.7 88.1 82.2 87.4 84.2 88.4 90 91.4 90.6 92.2 87.1 91.4 92.2 88.4 85.9 88.9
6 3 101. 97 94.2 98.6 89.4 94 93.2 97.3 97.3 100. 96.6 99.3 96.2 101. 101. 101 95.6 99.4
7 4 110. 105. 100. 104. 96.9 102. 102. 107. 103. 111 105. 106. 104 106. 110. 107. 102. 104.
8 5 116. 112. 107. 111 104. 109. 109 113. 108. 118. 112 113. 111 113. 117. 115. 109. 112.
9 6 122. 119. 112. 116. 111. 116. 117. 119. 114. 126. 119. 120. 117. 120. 122. 121. 118. 119
10 7 130 125 119. 123. 116. 122. 123. 126. 120. 131. 125. 127. 124. 129. 130. 128 125. 128", header = T, row.names = 1);
df <- df %>%
gather(boy, height, -age);
We define the Stan model.
model <- "
data {
int N; // Number of observations
real y[N]; // Height
real t[N]; // Time
}
parameters {
real<lower=0,upper=1> s[2];
real h_theta;
real theta;
real sigma;
}
transformed parameters {
vector[N] mu;
real h1;
h1 = max(y);
for (i in 1:N)
mu[i] = h1 - 2 * (h1 - h_theta) / (exp(s[1] * (t[i] - theta)) + (exp(s[2] * (t[i] - theta))));
}
model {
// Priors
s ~ cauchy(0, 2.5);
y ~ normal(mu, sigma);
}
"
Here we consider weakly informative (regularising) priors on s[1] and s[2].
Fit the model to the data.
library(rstan);
options(mc.cores = parallel::detectCores());
rstan_options(auto_write = TRUE);
model <- stan(
model_code = model,
data = list(
N = nrow(df),
y = df$height,
t = df$age),
iter = 4000);
Summary of the 6 model parameters:
summary(model, pars = c("h1", "h_theta", "theta", "s", "sigma"))$summary
# mean se_mean sd 2.5% 25% 50%
#h1 131.0000000 0.000000000 0.0000000 131.0000000 131.0000000 131.0000000
#h_theta 121.6874553 0.118527828 2.7554944 115.4316738 121.1654809 122.2134014
#theta 6.5895553 0.019738319 0.5143429 5.4232740 6.4053479 6.6469534
#s[1] 0.7170836 0.214402086 0.3124318 0.1748077 0.3843143 0.8765256
#s[2] 0.3691174 0.212062373 0.3035039 0.1519308 0.1930381 0.2066811
#sigma 3.1524819 0.003510676 0.1739904 2.8400096 3.0331962 3.1411533
# 75% 97.5% n_eff Rhat
#h1 131.0000000 131.0000000 8000.000000 NaN
#h_theta 123.0556379 124.3928800 540.453594 1.002660
#theta 6.8790801 7.3376348 679.024115 1.002296
#s[1] 0.9516115 0.9955989 2.123501 3.866466
#s[2] 0.2954409 0.9852540 2.048336 6.072550
#sigma 3.2635849 3.5204101 2456.231113 1.001078
So what does this mean? From the Rhat values for s[1] and s[2] you can see that there are issues with convergence for these two parameters. This is due to the fact that s[1] and s[2] are indistinguishable: they cannot be estimated both at the same time. A stronger regularising prior on s[1] and s[2] would probably drive one of the s parameters to zero.
I'm not sure I understand the point of s[1] and s[2]. From a statistical modelling point of view, you cannot obtain estimates for both parameters in the simple non-linear growth model that we're considering.
Update
As promised here is an update. This is turning into quite a long post, I've tried to make things as clear as possible by adding additional explanations.
Preliminary comments
Using positive_ordered as data type for s makes a significant difference in terms of convergence of solutions. It's not clear to me why that is the case, nor do I know how Stan implements positive_ordered, but it works.
In this hierarchical approach, we partially pool height data across all boys by considering h1 ~ normal(mu_h1, sigma_h1), with priors on the hyperparameters mu_h1 ~ normal(max(y), 10) and a half-Cauchy prior on sigma_h1 ~ cauchy(0, 10) (it's half-Cauchy because sigma is declared as real<lower=0>).
To be honest, I am unsure about the interpretation (and interpretability) of some of the parameters. Estimates for h_1 and h_theta are very similar, and in some way cancel each other out. I would imagine that this creates some convergence issues when fitting the model, but as you can see further down, Rhat values seem ok. Still, as I don't know enough about the model, data and its context, I remain somewhat skeptical as to the interpretability of those parameters. Extending the model by turning some of the other parameters into group-level parameters is straightforward from a statistical modelling point of view; however I imagine that difficulties will arise from the indistinguishability and lack of interpretability.
All these points aside, this should give you a practical example of how to implement a hierarchical model.
The Stan model
model_code <- "
data {
int N; // Number of observations
int J; // Number of boys
int<lower=1,upper=J> boy[N]; // Boy of observation
real y[N]; // Height
real t[N]; // Time
}
parameters {
real<lower=0> h1[J];
real<lower=0> h_theta;
real<lower=0> theta;
positive_ordered[2] s;
real<lower=0> sigma;
// Hyperparameters
real<lower=0> mu_h1;
real<lower=0> sigma_h1;
}
transformed parameters {
vector[N] mu;
for (i in 1:N)
mu[i] = h1[boy[i]] - 2 * (h1[boy[i]] - h_theta) / (exp(s[1] * (t[i] - theta)) + (exp(s[2] * (t[i] - theta))));
}
model {
h1 ~ normal(mu_h1, sigma_h1); // Partially pool h1 parameters across boys
mu_h1 ~ normal(max(y), 10); // Prior on h1 hyperparameter mu
sigma_h1 ~ cauchy(0, 10); // Half-Cauchy prior on h1 hyperparameter sigma
h_theta ~ normal(max(y), 2); // Prior on h_theta
theta ~ normal(max(t), 2); // Prior on theta
s ~ cauchy(0, 1); // Half-Cauchy priors on s[1] and s[2]
y ~ normal(mu, sigma);
}
"
To summarise: We pool height data from all boys to improve estimates at the group (i.e. boy) level, by modelling the adult height parameter as h1 ~ normal(mu_h1, sigma_h1), where the hyperparameters mu_h1 and sigma_h1 characterise the normal distribution of adult height values across all boys. We choose weakly informative priors on the hyperparameters, and choose weakly informative priors on all remaining parameters similar to the first complete-pooling example.
Fit the model
# Fit model
fit <- stan(
model_code = model_code,
data = list(
N = nrow(df),
J = length(unique(df$boy)),
boy = df$boy,
y = df$height,
t = df$age),
iter = 4000)
Extract summary
We extract parameter estimates for all parameters; note that we now have as many h1 parameters as there are groups (i.e. boys).
# Get summary
summary(fit, pars = c("h1", "h_theta", "theta", "s", "sigma"))$summary
# mean se_mean sd 2.5% 25% 50%
#h1[1] 142.9406153 0.1046670943 2.41580757 138.4272280 141.2858391 142.909765
#h1[2] 143.7054020 0.1070466445 2.46570025 139.1301456 142.0233342 143.652657
#h1[3] 144.0352331 0.1086953809 2.50145442 139.3982034 142.3131167 143.971473
#h1[4] 143.8589955 0.1075753575 2.48015745 139.2689731 142.1666685 143.830347
#h1[5] 144.7359976 0.1109871908 2.55284812 140.0529359 142.9917503 144.660586
#h1[6] 143.9844938 0.1082691127 2.49497990 139.3378948 142.2919990 143.926931
#h1[7] 144.3857221 0.1092604239 2.51645359 139.7349112 142.6665955 144.314645
#h1[8] 143.7469630 0.1070594855 2.46860328 139.1748700 142.0660983 143.697302
#h1[9] 143.6841113 0.1072208284 2.47391295 139.0885987 141.9839040 143.644357
#h1[10] 142.9518072 0.1041206784 2.40729732 138.4289207 141.3114204 142.918407
#h1[11] 143.5352502 0.1064173663 2.45712021 138.9607665 141.8547610 143.483157
#h1[12] 143.0941582 0.1050061258 2.42894673 138.5579378 141.4295430 143.055576
#h1[13] 143.6194965 0.1068494690 2.46574352 138.9426195 141.9412820 143.577920
#h1[14] 143.4477182 0.1060254849 2.44776536 138.9142081 141.7708660 143.392231
#h1[15] 143.1415683 0.1049131998 2.42575487 138.6246642 141.5014391 143.102219
#h1[16] 143.5686919 0.1063594201 2.45328456 139.0064573 141.8962853 143.510276
#h1[17] 144.0170715 0.1080567189 2.49269747 139.4162885 142.3138300 143.965127
#h1[18] 143.4740997 0.1064867748 2.45545200 138.8768051 141.7989566 143.426211
#h_theta 134.3394366 0.0718785944 1.72084291 130.9919889 133.2348411 134.367152
#theta 8.2214374 0.0132434918 0.45236221 7.4609612 7.9127800 8.164685
#s[1] 0.1772044 0.0004923951 0.01165119 0.1547003 0.1705841 0.177522
#s[2] 1.6933846 0.0322953612 1.18334358 0.6516669 1.1630900 1.463148
#sigma 2.2601677 0.0034146522 0.13271459 2.0138514 2.1657260 2.256678
# 75% 97.5% n_eff Rhat
#h1[1] 144.4795105 147.8847202 532.7265 1.008214
#h1[2] 145.2395543 148.8419618 530.5599 1.008187
#h1[3] 145.6064981 149.2080965 529.6183 1.008087
#h1[4] 145.4202919 149.0105666 531.5363 1.008046
#h1[5] 146.3200407 150.0701757 529.0592 1.008189
#h1[6] 145.5551778 149.1365279 531.0372 1.008109
#h1[7] 145.9594956 149.5996605 530.4593 1.008271
#h1[8] 145.3032680 148.8695637 531.6824 1.008226
#h1[9] 145.2401743 148.7674840 532.3662 1.008023
#h1[10] 144.4811712 147.9218834 534.5465 1.007937
#h1[11] 145.1153635 148.5968945 533.1235 1.007988
#h1[12] 144.6479561 148.0546831 535.0652 1.008115
#h1[13] 145.1660639 148.6562172 532.5386 1.008138
#h1[14] 144.9975197 148.5273804 532.9900 1.008067
#h1[15] 144.6733010 148.1130207 534.6057 1.008128
#h1[16] 145.1163764 148.6027096 532.0396 1.008036
#h1[17] 145.5578107 149.2014363 532.1519 1.008052
#h1[18] 145.0249329 148.4886949 531.7060 1.008055
#h_theta 135.4870338 137.6753254 573.1698 1.006818
#theta 8.4812339 9.2516700 1166.7226 1.002306
#s[1] 0.1841457 0.1988365 559.9036 1.005333
#s[2] 1.8673249 4.1143099 1342.5839 1.001562
#sigma 2.3470429 2.5374239 1510.5824 1.001219
Visualise adult height estimates
Finally we plot adult height estimates h_1 for all boys, including 50% and 97% confidence intervals.
# Plot h1 values
summary(fit, pars = c("h1"))$summary %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
mutate(
Variable = gsub("(h1\\[|\\])", "", Variable),
Variable = df$key[match(Variable, df$boy)]) %>%
ggplot(aes(x = `50%`, y = Variable)) +
geom_point(size = 3) +
geom_segment(aes(x = `2.5%`, xend = `97.5%`, yend = Variable), size = 1) +
geom_segment(aes(x = `25%`, xend = `75%`, yend = Variable), size = 2) +
labs(x = "Median (plus/minus 95% and 50% CIs)", y = "h1")
Related
I have run a logrank test with survdiff like below:
survdiff(formula = Surv(YearsToEvent, Event) ~ Cat, data = RegressionData)`
I get the following output:
N Observed Expected (O-E)^2/E (O-E)^2/V
0 30913 487 437.9 5.50 11.9
1 3755 56 23.2 46.19 48.0
2 3322 36 45.2 1.89 2.0
3 15796 260 332.6 15.85 27.3
Chisq= 71.9 on 3 degrees of freedom, p= 0.000000000000002
How can I save this (especially the p-value) to a .txt file? I am looping a bunch of regressions like this and want to save them all to a .text file.
I have a mixed effects model (with lme4) with a 2-way interaction term, each term having multiple levels (each 4) and I would like to investigate their effects in reference to their grand mean. I present this example here from the car data set and omit the error term since it is not neccessary for this example:
## shorten data frame for simplicity
df=Cars93[c(1:15),]
df=Cars93[is.element(Cars93$Make,c('Acura Integra', 'Audi 90','BMW 535i','Subaru Legacy')),]
df$Make=drop.levels(df$Make)
df$Model=drop.levels(df$Model)
## define contrasts (every factor has 4 levels)
contrasts(df$Make) = contr.treatment(4)
contrasts(df$Model) = contr.treatment(4)
## model
m1 <- lm(Price ~ Model*Make,data=df)
summary(m1)
as you can see, the first levels are omitted in the interaction term. And I would like to have all 4 levels in the output, referenced to the grand mean (often referred to deviant coding). These are the sources I looked at: https://marissabarlaz.github.io/portfolio/contrastcoding/#coding-schemes and How to change contrasts to compare with mean of all levels rather than reference level (R, lmer)?. The last reference does not report interactions though.
The simple answer is that what you want is not possible directly. You have to use a slightly different approach.
In a model with interactions, you want to use contrasts in which the mean is zero and not a specific level. Otherwise, the lower-order effects (i.e., main effects) are not main effects but simple effects (evaluated when the other factor level is at its reference level). This is explained in more details in my chapter on mixed models:
http://singmann.org/download/publications/singmann_kellen-introduction-mixed-models.pdf
To get what you want, you have to fit the model in a reasonable manner and then pass it to emmeans to compare against the intercept (i.e., the unweighted grand mean). This works also for interactions as shown below (as your code did not work, I use warpbreaks).
afex::set_sum_contrasts() ## uses contr.sum globally
library("emmeans")
## model
m1 <- lm(breaks ~ wool * tension,data=warpbreaks)
car::Anova(m1, type = 3)
coef(m1)[1]
# (Intercept)
# 28.14815
## both CIs include grand mean:
emmeans(m1, "wool")
# wool emmean SE df lower.CL upper.CL
# A 31.0 2.11 48 26.8 35.3
# B 25.3 2.11 48 21.0 29.5
#
# Results are averaged over the levels of: tension
# Confidence level used: 0.95
## same using test
emmeans(m1, "wool", null = coef(m1)[1], infer = TRUE)
# wool emmean SE df lower.CL upper.CL null t.ratio p.value
# A 31.0 2.11 48 26.8 35.3 28.1 1.372 0.1764
# B 25.3 2.11 48 21.0 29.5 28.1 -1.372 0.1764
#
# Results are averaged over the levels of: tension
# Confidence level used: 0.95
emmeans(m1, "tension", null = coef(m1)[1], infer = TRUE)
# tension emmean SE df lower.CL upper.CL null t.ratio p.value
# L 36.4 2.58 48 31.2 41.6 28.1 3.196 0.0025
# M 26.4 2.58 48 21.2 31.6 28.1 -0.682 0.4984
# H 21.7 2.58 48 16.5 26.9 28.1 -2.514 0.0154
#
# Results are averaged over the levels of: wool
# Confidence level used: 0.95
emmeans(m1, c("tension", "wool"), null = coef(m1)[1], infer = TRUE)
# tension wool emmean SE df lower.CL upper.CL null t.ratio p.value
# L A 44.6 3.65 48 37.2 51.9 28.1 4.499 <.0001
# M A 24.0 3.65 48 16.7 31.3 28.1 -1.137 0.2610
# H A 24.6 3.65 48 17.2 31.9 28.1 -0.985 0.3295
# L B 28.2 3.65 48 20.9 35.6 28.1 0.020 0.9839
# M B 28.8 3.65 48 21.4 36.1 28.1 0.173 0.8636
# H B 18.8 3.65 48 11.4 26.1 28.1 -2.570 0.0133
#
# Confidence level used: 0.95
Note that for coef() you probably want to use fixef() for lme4 models.
I have a mixed effect model with a log(x+1) transformed response variable. The output from emmeans with the type as "response" provides the mean and confidence intervals for both groups that I am comparing. However what I want is the mean and CI of the difference between the groups (i.e. the estimate). emmeans only provides the ratio (with type="response") or the log ratio (with type="link") and I am unsure how to change this into absolute values. If you run the model without the log(x+1) transformation then emmeans provides the estimated difference and CI around this difference, not the ratios. How can I also do this when my response variable is log(x+1) transformed?
bmnameF.lme2 = lme(log(bm+1)~TorC*name, random=~TorC|site,
data=matched.cases3F, method='REML')
emmeans(lme, pairwise~TorC,
type='response')%>%confint(OmeanFHR[[2]])%>%as.data.frame
emmeans.TorC emmeans.emmean emmeans.SE emmeans.df emmeans.lower.CL emmeans.upper.CL contrasts.contrast contrasts.estimate contrasts.SE contrasts.df contrasts.lower.CL contrasts.upper.CL
Managed 376.5484 98.66305 25 219.5120 645.9267 Managed - Open 3.390123 1.068689 217 1.821298 6.310297
Open 111.0722 43.15374 25 49.8994 247.2381 Managed - Open 3.390123 1.068689 217 1.821298 6.310297
Let me show a different example so the results are reproducible to all viewers:
mod = lm(log(breaks+1) ~ wool*tension, data = warpbreaks)
As you see, with a log transformation, comparisons/contrasts are expressed as ratios by default. But this can be changed by specifying transform instead of type in the emmeans() call:
> emmeans(mod, pairwise ~ tension|wool, transform = "response")
$emmeans
wool = A:
tension response SE df lower.CL upper.CL
L 42.3 5.06 48 32.1 52.4
M 23.6 2.83 48 17.9 29.3
H 23.7 2.83 48 18.0 29.4
wool = B:
tension response SE df lower.CL upper.CL
L 27.7 3.32 48 21.0 34.4
M 28.4 3.40 48 21.6 35.3
H 19.3 2.31 48 14.6 23.9
Confidence level used: 0.95
$contrasts
wool = A:
contrast estimate SE df t.ratio p.value
L - M 18.6253 5.80 48 3.213 0.0065
L - H 18.5775 5.80 48 3.204 0.0067
M - H -0.0479 4.01 48 -0.012 0.9999
wool = B:
contrast estimate SE df t.ratio p.value
L - M -0.7180 4.75 48 -0.151 0.9875
L - H 8.4247 4.04 48 2.086 0.1035
M - H 9.1426 4.11 48 2.224 0.0772
P value adjustment: tukey method for comparing a family of 3 estimates
Or, you can do this later via the regrid() function:
emm1 = emmeans(mod, ~ tension | wool)
emm2 = regrid(emm1)
emm2 # estimates
pairs(emm2) # comparisons
regrid() creates a new emmGrid object where everything is already back-transformed, thus side-stepping the behavior that happens with contrasts of log-transformed results. (In the previous illustration, the transform argument just calls regrid after it constructs the reference grid.)
But there is another subtle thing going on: The transformation is auto-detected as log; the +1 part is ignored. Thus, the back-transformed estimates are all too large by 1. To get this right, you need to use the make.tran() function to create this generalization of the log transformation:
> emm3 = update(emmeans(mod, ~ tension | wool), tran = make.tran("genlog", 1))
> str(emm3)
'emmGrid' object with variables:
tension = L, M, H
wool = A, B
Transformation: “log(mu + 1)”
> regrid(emm3)
wool = A:
tension response SE df lower.CL upper.CL
L 41.3 5.06 48 31.1 51.4
M 22.6 2.83 48 16.9 28.3
H 22.7 2.83 48 17.0 28.4
wool = B:
tension response SE df lower.CL upper.CL
L 26.7 3.32 48 20.0 33.4
M 27.4 3.40 48 20.6 34.3
H 18.3 2.31 48 13.6 22.9
Confidence level used: 0.95
The comparisons will come out the same as shown earlier, because offsetting all the means by 1 doesn't affect the pairwise differences.
See vignette("transformations", "emmeans") or https://cran.r-project.org/web/packages/emmeans/vignettes/transformations.html for more details.
I am an R beginner user and I face the following problem. I have the following data frame:
distance speed
1 61.0 36.4
2 51.4 35.3
3 42.2 34.2
4 33.4 32.8
5 24.9 31.3
6 17.5 28.4
7 11.5 24.1
8 7.1 19.4
9 3.3 16.9
10 0.5 15.5
11 4.4 15.1
12 8.5 15.5
13 13.1 17.3
14 18.8 20.5
15 25.7 24.1
16 33.3 26.3
17 41.0 27.0
18 48.7 27.7
19 56.6 28.4
20 64.8 29.2
21 73.6 31.7
22 83.3 34.2
23 93.4 35.3
The column distance represents the distance of a following object over a specific point and the column speed the object's speed. As you can see the object is getting closer to the point and then it is getting away. I am trying to make its speed profile. I tried the following code but it didn't give me the plot I want (because I want to show how its speed is changing when the moving object moves closer and past the reference point)
ggplot(speedprofile, aes(x = distance, y = speed)) + #speedprofile is the data frame
geom_line(color = "red") +
geom_smooth() +
geom_vline(xintercept = 0) # the vline is the reference line
The plot is the following:
Then, I tried to set the first 10 distances as negative manually which are prior to zero (0). So I get a plot closer to that I want:
But there is a problem. The distance can't be defined as negative.
To sum up, the expected plot is the following (and I am sorry for the quality).
Do you have any ideas on how to solve this?
Thank you in advance!
You can do something like this to auto-compute the change point (to know when the distance should be negative) and then set the axis labels to be positive.
Your data (in case anyone needs it to answer):
read.table(text="distance speed
61.0 36.4
51.4 35.3
42.2 34.2
33.4 32.8
24.9 31.3
17.5 28.4
11.5 24.1
7.1 19.4
3.3 16.9
0.5 15.5
4.4 15.1
8.5 15.5
13.1 17.3
18.8 20.5
25.7 24.1
33.3 26.3
41.0 27.0
48.7 27.7
56.6 28.4
64.8 29.2
73.6 31.7
83.3 34.2
93.4 35.3", stringsAsFactors=FALSE, header=TRUE) -> speed_profile
Now, compute the "real" distance (negative for approaching, positive for receding):
speed_profile$real_distance <- c(-1, sign(diff(speed_profile$distance))) * speed_profile$distance
Now, compute the X axis breaks ahead of time:
breaks <- scales::pretty_breaks(10)(range(speed_profile$real_distance))
ggplot(speed_profile, aes(real_distance, speed)) +
geom_smooth(linetype = "dashed") +
geom_line(color = "#cb181d", size = 1) +
scale_x_continuous(
name = "distance",
breaks = breaks,
labels = abs(breaks) # make all the labels for the axis positive
)
Provided fonts are working well on your system you could even do:
labels <- abs(breaks)
labels[(!breaks == 0)] <- sprintf("%s\n→", labels[(!breaks == 0)])
ggplot(speed_profile, aes(real_distance, speed)) +
geom_smooth(linetype = "dashed") +
geom_line(color = "#cb181d", size = 1) +
scale_x_continuous(
name = "distance",
breaks = breaks,
labels = labels,
)
When i using cross validation technique with my data it gives me two types of prediction. CVpredict and Predict. What is difference between two of that? I guess cvpredict is cross validation predict but what is the other?
Here is some of my code:
crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)
And this is the result:
fold 1
Observations in test set: 5
3 11 15 22 23
Predicted 28.02 32.21 26.53 25.1 21.28
cvpred 20.23 40.69 26.57 34.1 26.06
verim 30.00 31.00 28.00 24.0 20.00
CV residual 9.77 -9.69 1.43 -10.1 -6.06
Sum of squares = 330 Mean square = 66 n = 5
fold 2
Observations in test set: 5
2 7 21 24 25
Predicted 28.4 32.0 26.2 19.95 25.9
cvpred 52.0 81.8 36.3 14.28 90.1
verim 30.0 33.0 24.0 21.00 24.0
CV residual -22.0 -48.8 -12.3 6.72 -66.1
Sum of squares = 7428 Mean square = 1486 n = 5
fold 3
Observations in test set: 5
6 14 18 19 20
Predicted 34.48 36.93 19.0 27.79 25.13
cvpred 37.66 44.54 16.7 21.15 7.91
verim 33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54 1.3 9.85 18.09
Sum of squares = 539 Mean square = 108 n = 5
fold 4
Observations in test set: 5
1 4 5 9 13
Predicted 31.91 29.07 32.5 32.7685 28.9
cvpred 30.05 28.44 54.9 32.0465 11.4
verim 32.00 27.00 31.0 32.0000 30.0
CV residual 1.95 -1.44 -23.9 -0.0465 18.6
Sum of squares = 924 Mean square = 185 n = 5
fold 5
Observations in test set: 5
8 10 12 16 17
Predicted 27.8 30.28 26.0 27.856 35.14
cvpred 50.3 33.92 45.8 31.347 29.43
verim 28.0 30.00 24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347 8.57
Sum of squares = 1065 Mean square = 213 n = 5
Overall (Sum over all 5 folds)
ms
411
You can check that by reading the help of the function you are using cv.lm. There you will find this paragraph:
The input data frame is returned, with additional columns
‘Predicted’ (Predicted values using all observations) and ‘cvpred’
(cross-validation predictions). The cross-validation residual sum
of squares (‘ss’) and degrees of freedom (‘df’) are returned as
attributes of the data frame.
Which says that Predicted is a vector of predicted values made using all the observations. In other words it seems like a predictions made on your "training" data or made "in sample".
To check wether this is so you can fit the same model using lm:
fit <- lm(verim~X4+X4.1, data=total)
And see if the predicted values from this model:
predict(fit)
are the same as those returned by cv.lm
When I tried it on the iris dataset in R - cv.lm() predicted returned the same values as predict(lm). So in that case - they are in-sample predictions where the model is fitted and used using the same observations.
lm() does not give "better results." I am not sure how predict() and lm.cv() can be the same. Predict() returns the expected values of Y for each sample, estimated from the fitted model (covariates (X) and their corresponding estimated Beta values). Those Beta values, and the model error (E), were estimated from that original data. By using predict(), you get an overly optimistic estimate of model performance. That is why it seems better. You get a better (more realistic) estimate of model performance using an iterated sample holdout technique, like cross validation (CV). The least biased estimate comes from leave-one-out CV and the estimate with the least uncertainty (prediction error) comes from 2-fold (K=2) CV.