In an experiment, female fish were exposed to two levels of photoperiod (Ambient & Compressed), two levels of temperature (4 & 7). They were in four tanks (two tanks for each photoperiod, one tank for each temperature within photoperiod). There were nine samplings denoted by time_date in the data. Among other responses is "k". My interest is on the effects of photoperiod, temperature and time_date on "k".
Challenges faced: Unbalanced design (one photoperiod or temperature level not sampled during a sampling), pseudo-replication (each tank is a treatment (temperature masked within photoperiod)). with some reading, I came across the mixed models. I have tried with lmer (more importantly: I am not sure if am right) and fell into warnings and outputs with no p-values. I appreciate your help. Thank you in advance.
Here is the sample data
fem.fish <- structure(list(time_date = structure(c(8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
14L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L,
15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 16L, 16L, 16L,
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("30-Jan-18",
"11-Apr-18", "13-Jun-18", "07-Aug-18", "19-Sep-18", "30-Oct-18",
"28-Nov-18", "03-Jan-19", "17-Jan-19", "31-Jan-19", "14-Feb-19",
"28-Feb-19", "14-Mar-19", "27-Mar-19", "10-Apr-19", "24-Apr-19"
), class = "factor"), photo = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Ambient",
"Compress"), class = "factor"), temp = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("4",
"7"), class = "factor"), tank = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("T1",
"T2", "T3", "T4"), class = "factor"), k = c(5.041791145, 5.408503999,
5.535282299, 5.346402317, 5.376649977, 5.072021484, 6.097412109,
4.390658006, 5.13676712, 4.472827193, 5.381892125, 4.882544582,
4.655393586, 5.435528121, 4.985185185, 4.548431822, 5.041791145,
5.408503999, 5.535282299, 5.346402317, 5.376649977, 5.072021484,
6.097412109, 4.390658006, 5.13676712, 4.472827193, 5.381892125,
4.882544582, 4.655393586, 5.435528121, 4.985185185, 4.548431822,
5.517125816, 4.772205603, 5.928149807, 4.152323266, 4.666037968,
4.638984928, 4.044444444, 4.720296599, 5.315500686, 4.967790359,
3.520804755, 4.722326417, 5.051895044, 4.807450844, 5.096461818,
5.28703008, 5.653368614, 6.357164944, 3.979492188, 3.928861374,
5.632685221, 5.264668498, 5.281464786, 5.387205387, 4.332381668,
5.250388878, 4.580237638, 4.650926114, 5.65951009, 4.401587625,
5.194587481, 4.184813255, 4.44738449, 5.829977261, 4.331985587,
4.827988338, 4.022222222, 3.672891297, 5.148148148, 4.068381688,
5.71922963, 4.566763848, 5.330442907, 2.422536369, 5.346580575,
4.971865289, 5.018922289, 5.513702624, 4.432146456, 5.692296224,
4.738120151, 4.896057489, 5.50365439, 5.249023438, 5.737818961,
4.260276996, 5.242507722, 4.580758017, 5.021888504, 5.013662642,
4.308286338, 5.50840192, 4.732342764, 4.672289386, 5.715557782,
3.827088497, 4.632069971, 4.935541824, 4.008746356, 4.963859809,
4.836806618, 4.46244856, 4.839677641, 4.498269896, 4.88357943,
4.984069185, 4.596844478, 5.196200195, 5.165529005, 14.74622771,
5.397084548, 7.983198678, 5.691090246, 5.707491082, 5.187172012,
6.297376093, 4.647178889, 4.282407407, 4.333496094, 4.773656052,
4.770999725, 4.092207407, 3.917638484, 5.193905817, 3.704833984,
5.571239611, 4.226680384, 3.65230095, 4.78515625, 5.603027344,
4.159218067, 4.719370009, 4.437016946, 4.407713499, 4.284050303,
4.676783265, 4.311689337, 4.540625, 4.864470022, 4.668176455,
5.221193416, 4.997084123, 4.112752873, 5.587217586, 6.045051626,
4.605417744, 4.35030714, 5.185252617, 4.752696927, 4.446670562,
4.268256569, 4.30372087, 4.025205761, 5.696474074, 4.068342788,
3.5212701, 4.544646911, 5.212620027, 5.31978738, 4.879910442,
4.606482493, 4.33502906, 5.294067215, 5.770262391, 4.264308136,
4.501028807, 2.944958848, 4.180638577, 4.120435057, 3.833076111,
4.496793003, 4.232167131, 3.783896334, 5.070553936, 4.825776352,
4.643534043, 6.318587106, 5.66205358, 5.194631597, 4.72557037,
4.195096521, 4.956238551, 3.503093444, 5.24857851, 4.792524005,
4.44229595, 5.285131195, 4.335878892, 4.170953361, 4.045779268
)), row.names = c(NA, -192L), class = "data.frame")
What I tried and the first warning
fit1 <- lmer(k ~ 0 + photo*temp*time_date + (1|tank), data = fem.fish, REML = FALSE)
fixed-effect model matrix is rank deficient so dropping 12 columns / coefficients
boundary (singular) fit: see ?isSingular
My summary and another warning on correlation matrix
summary(fit1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: k ~ 0 + photo * temp * time_date + (1 | tank)
Data: fem.fish
AIC BIC logLik deviance df.resid
551.2 635.9 -249.6 499.2 166
Scaled residuals:
Min 1Q Median 3Q Max
-2.7467 -0.4380 -0.0447 0.3663 9.7226
Random effects:
Groups Name Variance Std.Dev.
tank (Intercept) 0.0000 0.0000
Residual 0.7883 0.8879
Number of obs: 192, groups: tank, 4
Fixed effects:
Estimate Std. Error t value
photoAmbient 5.284e+00 3.139e-01 16.832
photoCompress 4.937e+00 3.139e-01 15.728
temp7 -1.218e-14 4.439e-01 0.000
time_date17-Jan-19 -9.116e-02 4.439e-01 -0.205
time_date31-Jan-19 -9.798e-02 4.439e-01 -0.221
time_date14-Feb-19 1.264e-01 4.439e-01 0.285
time_date28-Feb-19 -3.986e-01 4.439e-01 -0.898
time_date14-Mar-19 3.655e-01 4.439e-01 0.823
time_date27-Mar-19 -3.979e-01 4.439e-01 -0.896
time_date10-Apr-19 -4.122e-01 4.439e-01 -0.929
time_date24-Apr-19 -2.184e-01 4.439e-01 -0.492
photoCompress:temp7 8.874e-15 6.278e-01 0.000
photoCompress:time_date31-Jan-19 -2.957e-01 6.278e-01 -0.471
photoCompress:time_date28-Feb-19 1.575e+00 6.278e-01 2.509
photoCompress:time_date14-Mar-19 -6.073e-01 6.278e-01 -0.967
temp7:time_date17-Jan-19 -4.121e-02 6.278e-01 -0.066
temp7:time_date31-Jan-19 2.382e-01 6.278e-01 0.379
temp7:time_date14-Feb-19 -2.024e-01 6.278e-01 -0.322
temp7:time_date28-Feb-19 -1.441e+00 6.278e-01 -2.295
temp7:time_date14-Mar-19 -1.104e+00 6.278e-01 -1.759
temp7:time_date27-Mar-19 -4.306e-01 6.278e-01 -0.686
temp7:time_date10-Apr-19 -7.885e-01 6.278e-01 -1.256
temp7:time_date24-Apr-19 -5.872e-01 6.278e-01 -0.935
photoCompress:temp7:time_date14-Mar-19 9.077e-01 8.879e-01 1.022
Correlation matrix not shown by default, as p = 24 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
fit warnings:
fixed-effect model matrix is rank deficient so dropping 12 columns / coefficients
convergence code: 0
boundary (singular) fit: see ?isSingular
My understanding on t-values is not good at all, so I cannot establish whether there are significant effects or even whether the interactions are significant or not.
I will appreciate your suggestions on the modelling (Fitting the right model?) and more of what you find useful
Thank you so much all.
Try to import the "lmerTtest" package.
Before fit your model import this package, in this way you will see the p-value and the "*" of significance:
library("lme4")
library("lmerTest")
I have used your data for the following example. I think due to the fact that all your terms are categorical, you get the rank deficient model. I'd suggest you use time as continuous predictor, thereby you get rid of the rank-deficiency warning.
library(lme4)
library(parameters)
library(performance)
levels(fem.fish$time_date) <- 1:nlevels(fem.fish$time_date)
fem.fish$time_date <- as.numeric(fem.fish$time_date)
fit1 <- lmer(
k ~ 1 + photo * temp * time_date + (1 | tank),
data = fem.fish,
REML = FALSE
)
#> boundary (singular) fit: see ?isSingular
The second warning about the singular fit (now the first, and only warning) is because you literally have no variability in your outcome across the different groups (indicated by tank). This means that the random effects model here gives you not much more benefit than a simple linear model.
ranef(fit1)
#> $tank
#> (Intercept)
#> T1 0
#> T2 0
#> T3 0
#> T4 0
#>
#> with conditional variances for "tank"
Finally, you could use the packages parameters and performance to get comprehensive model summaries (including different p-value approximations like Satterthwaite or Kenward-Roger, standardized parameters or (cluster) robust standard errors) or model fit indices (like r2).
parameters::model_parameters(fit1)
#> Parameter | Coefficient | SE | 95% CI | t | df | p
#> -----------------------------------------------------------------------------------------------------
#> (Intercept) | 5.56 | 0.61 | [ 4.36, 6.76] | 9.06 | 182 | < .001
#> photo [Compress] | -1.46 | 1.04 | [-3.50, 0.58] | -1.41 | 182 | 0.160
#> temp [7] | 0.69 | 0.94 | [-1.16, 2.53] | 0.73 | 182 | 0.467
#> time_date | -0.04 | 0.05 | [-0.13, 0.06] | -0.74 | 182 | 0.461
#> photo [Compress] * temp [7] | 0.73 | 1.52 | [-2.24, 3.70] | 0.48 | 182 | 0.631
#> photo [Compress] * time_date | 0.12 | 0.09 | [-0.06, 0.31] | 1.35 | 182 | 0.178
#> temp [7] * time_date | -0.09 | 0.07 | [-0.23, 0.05] | -1.29 | 182 | 0.198
#> (photo [Compress] * temp [7]) * time_date | -0.07 | 0.13 | [-0.33, 0.19] | -0.52 | 182 | 0.604
performance::r2(fit1)
#> Warning: Can't compute random effect variances. Some variance components equal zero.
#> Solution: Respecify random structure!
#> Random effect variances not available. Returned R2 does not account for random effects.
#> # R2 for Mixed Models
#>
#> Conditional R2: NA
#> Marginal R2: 0.088
Now that your time variable is continuous, you may think about a non-linear relationship of the time trend. You could use the spline package to model this, and ggeffects to get effects plots. This works, of course, for the model with linear time trend as well as other curvilinear time trends.
library(ggeffects)
pr <- ggpredict(fit1, c("time_date", "photo", "temp"))
plot(pr)
library(splines)
fit2 <- lmer(
k ~ 1 + photo * temp * bs(time_date) + (1 | tank),
data = fem.fish,
REML = FALSE
)
#> boundary (singular) fit: see ?isSingular
pr <- ggpredict(fit2, c("time_date [all]", "photo", "temp"))
plot(pr)
Hope that helps!
Related
I have two countries with different starting and endings years. I have the mean earnings of different social classes. I would like to have the earnings of the first starting year = 100 as an index for each social class. Then i would like to see how they progressed from 100 in subsequent years. This in general should be easy to do, but it seems that since my countries have different numbers of observations, it is not working.
Here is the code that i have tried, but i only got missing values:
df=df %>%
group_by(cntry, year,class_m) %>%
mutate(base_year = (mean[first(year)]/mean)*100)
Here is the data:
df= structure(list(cntry = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("at", "be"), class = "factor"),
year = structure(c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L,
10L, 10L, 10L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 13L,
13L, 13L, 13L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 16L,
16L, 16L, 16L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 19L,
19L, 19L, 19L, 20L, 20L, 20L, 20L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L,
10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L,
13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L,
16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L
), .Label = c("1995", "1997", "2000", "2003", "2004", "2005",
"2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017", "2018", "2019"), class = "factor"),
class_m = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("Low-skilled working class",
"Skilled working class", "Middle class", "Upper class"), class = "factor"),
mean = c(21667.3165297756, 31141.2479100646, 38694.5317839067,
48897.5586114381, 21893.6782367936, 29866.0796003899, 36846.1057208349,
46115.8225807015, 19914.101136956, 30201.1848571751, 36688.5006276306,
44334.4349912073, 20505.9102212244, 30071.1070352498, 37093.4347815202,
44630.7476325564, 20265.9465807599, 29827.9369893851, 40549.4855257344,
48107.2865241041, 22378.7756708627, 31334.7756747725, 39981.9785570756,
50347.8600052063, 23101.010596959, 31412.9240693068, 40458.6454333296,
51740.898756006, 19805.2965921531, 30817.6682795387, 41165.6041754244,
52782.5026014194, 19078.5626059941, 30499.5262897878, 41177.4423103889,
51240.6014436097, 20393.1169949116, 29796.8273849528, 39234.5103600113,
50494.5284121857, 20786.560760249, 31306.6058474771, 40854.36428628,
50339.5860855376, 20033.5844477617, 30424.7651611075, 39659.447696875,
49191.4195426966, 18851.261369003, 30412.4669765863, 41857.2930659497,
51097.4975692186, 22333.7894908968, 30863.010648668, 41852.0093099513,
54112.6228115753, 21709.19921875, 30039.5068801246, 41097.4541047158,
49862.44140625, 20113.5586718618, 30733.8952367545, 41658.1716627373,
51754.4018503782, 21818.4311173551, 33225.8409123812, 43882.2512500977,
51037.5228976151, 15858.5028150308, 18782.8272745439, 22871.4020551682,
26288.6154497508, 26599.1650213236, 31720.3186300543, 41940.5016413888,
51187.0060567118, 18564.6510736198, 21526.72898147, 24807.2116933588,
29207.4658820585, 23058.5603825146, 31862.7097532934, 37588.62928007,
45160.9518839946, 25495.8949453907, 31851.1999874662, 38276.6899334939,
46318.331560595, 23165.6350767837, 32586.7829065825, 37256.5740814167,
45285.0662561028, 23975.7581116063, 30787.3910726117, 37346.8507982085,
45180.6091420909, 23786.1529599028, 32413.707905246, 38596.3467614532,
47026.6344280445, 24272.92088131, 31167.7104944988, 37745.6268718255,
46128.4799968946, 24583.9968164343, 29819.2298432657, 40053.8477213667,
48223.1556254353, 23227.04705051, 29611.9190298389, 39086.0012315702,
46742.9511396314, 20980.1647228858, 29627.2417955117, 38648.6829503705,
45677.0658477392, 21397.8125304146, 30675.2233482807, 40735.634479222,
46355.3748374436, 22836.5595055445, 29859.0336509053, 40335.3885497182,
47934.8837121327, 21465.185981748, 30436.1330929852, 40091.5582937488,
48743.3268548605, 21375.6534656544, 31060.2359133816, 40006.7183770635,
47618.8685730448, 20901.803025412, 29971.1886677767, 39526.0725185188,
46793.098588355, 21710.1246251194, 30894.12481284, 39699.3077814615,
47179.3071888513)), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -140L), groups = structure(list(
cntry = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("at",
"be"), class = "factor"), year = structure(c(4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L), .Label = c("1995", "1997",
"2000", "2003", "2004", "2005", "2006", "2007", "2008", "2009",
"2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017",
"2018", "2019"), class = "factor"), .rows = structure(list(
1:4, 5:8, 9:12, 13:16, 17:20, 21:24, 25:28, 29:32, 33:36,
37:40, 41:44, 45:48, 49:52, 53:56, 57:60, 61:64, 65:68,
69:72, 73:76, 77:80, 81:84, 85:88, 89:92, 93:96, 97:100,
101:104, 105:108, 109:112, 113:116, 117:120, 121:124,
125:128, 129:132, 133:136, 137:140), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -35L), .drop = TRUE))
Is this what you want?
df %>%
group_by(cntry, class_m) %>%
mutate(base_year = 100 * (1 - (first(mean) - mean) / mean))
# # A tibble: 140 x 5
# # Groups: cntry, class_m [8]
# cntry year class_m mean base_year
# <fct> <fct> <fct> <dbl> <dbl>
# 1 at 2003 Low-skilled working class 21667. 100
# 2 at 2003 Skilled working class 31141. 100
# 3 at 2003 Middle class 38695. 100
# 4 at 2003 Upper class 48898. 100
# 5 at 2004 Low-skilled working class 21894. 101.
# 6 at 2004 Skilled working class 29866. 95.7
# 7 at 2004 Middle class 36846. 95.0
# 8 at 2004 Upper class 46116. 94.0
# 9 at 2005 Low-skilled working class 19914. 91.2
# 10 at 2005 Skilled working class 30201. 96.9
# # ... with 130 more rows
I was following this post, but I do not get how can I manage it with my data.
My plot looks like:
And I would like that the "strings" were the same color as the 2nd column, i.e. for ESR1 I would like the orange string, and for PIK3CA green.
Any idea about how can I manage with scale_fill_manual or any other argument?
Thanks!
My code:
colorfill <- c("white", "white", "darkgreen", "orange", "white", "white", "white", "white", "white", "white", "white", "white", "white", "white", "white", "white", "white")
ggplot(data = Allu,
aes(axis1 = Gene_mut, axis2 = Metastasis_Location, y = Freq)) +
geom_alluvium(aes(fill = Gene_mut),
curve_type = "quintic") +
geom_stratum(width = 1/4, fill = colorfill) +
geom_text(stat = "stratum", size = 3,
aes(label = after_stat(stratum))) +
scale_x_discrete(limits = c("Metastasis_Location", "Gene_mut"),
expand = c(0.05, .05)) +
theme_void()
My data:
structure(list(Metastasis_Location = structure(c(1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L,
11L, 11L), .Label = c("adrenal", "bone", "breast", "liver", "lung",
"muscle", "node", "pancreatic", "peritoneum", "pleural", "skin"
), class = "factor"), T0_T2_THERAPY_COD = structure(c(2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"F"), class = "factor"), T0_T2_PD_event = structure(c(2L, 2L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("No Progression",
"Progression"), class = "factor"), Gene_mut = structure(c(4L,
5L, 1L, 3L, 4L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 3L, 6L, 6L, 6L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 6L, 2L, 3L, 4L, 4L, 3L, 3L, 3L, 4L, 5L, 6L, 3L, 6L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 3L, 4L, 4L, 5L, 6L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L,
5L, 5L, 5L, 3L, 4L, 3L, 4L, 5L, 6L, 3L, 3L, 4L, 5L, 6L, 6L, 6L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 3L, 4L, 3L, 4L, 5L,
6L, 3L, 4L, 5L, 6L, 3L, 4L, 5L, 6L, 1L, 6L, 3L, 3L, 4L, 4L, 5L
), .Label = c("AKT1", "ERBB2", "ESR1", "PIK3CA", "TP53", "WT"
), class = "factor"), LABO_ID = structure(c(45L, 8L, 13L, 11L,
11L, 26L, 7L, 15L, 23L, 26L, 35L, 39L, 7L, 19L, 26L, 32L, 33L,
35L, 39L, 15L, 19L, 35L, 1L, 37L, 34L, 43L, 47L, 3L, 10L, 18L,
20L, 28L, 31L, 36L, 42L, 9L, 10L, 14L, 18L, 20L, 28L, 31L, 36L,
44L, 45L, 8L, 10L, 18L, 28L, 42L, 2L, 7L, 39L, 7L, 39L, 3L, 4L,
42L, 5L, 42L, 6L, 21L, 1L, 10L, 22L, 28L, 46L, 9L, 10L, 14L,
28L, 46L, 10L, 28L, 48L, 25L, 23L, 32L, 33L, 40L, 43L, 24L, 3L,
18L, 24L, 28L, 31L, 36L, 42L, 18L, 27L, 28L, 31L, 36L, 45L, 18L,
24L, 27L, 28L, 42L, 16L, 16L, 18L, 18L, 18L, 29L, 23L, 39L, 39L,
40L, 1L, 12L, 47L, 3L, 18L, 20L, 28L, 31L, 36L, 38L, 42L, 5L,
18L, 20L, 27L, 28L, 31L, 36L, 38L, 41L, 45L, 8L, 18L, 27L, 28L,
42L, 48L, 6L, 17L, 30L, 31L, 31L, 18L, 18L, 18L, 29L, 39L, 39L,
40L, 43L, 31L, 31L, 48L, 30L, 13L, 34L, 18L, 36L, 18L, 36L, 18L
), .Label = c("ER-11", "ER-19", "ER-21", "ER-22", "ER-29", "ER-30",
"ER-31", "ER-32", "ER-33", "ER-38", "ER-40", "ER-43", "ER-49",
"ER-8", "ER-AZ-04", "ER-AZ-05", "ER-AZ-06", "ER-AZ-07", "ER-AZ-08",
"ER-AZ-10", "ER-AZ-11", "ER-AZ-11=ER-47", "ER-AZ-13", "ER-AZ-14",
"ER-AZ-15", "ER-AZ-16", "ER-AZ-17", "ER-AZ-18", "ER-AZ-20", "ER-AZ-20=ER-27",
"ER-AZ-21", "ER-AZ-23", "ER-AZ-23=ER-52", "ER-AZ-24", "ER-AZ-29",
"ER-AZ-31", "ER-AZ-33", "ER-AZ-35", "ER-AZ-37", "ER-AZ-38", "ER-AZ-39",
"ER-AZ-40", "ER-AZ-43", "ER-AZ-44", "ER-AZ-45", "ER-AZ-49", "ER-AZ-51",
"ER-AZ-53"), class = "factor"), Freq = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -161L), groups = structure(list(
Metastasis_Location = structure(c(1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L), .Label = c("adrenal",
"bone", "breast", "liver", "lung", "muscle", "node", "pancreatic",
"peritoneum", "pleural", "skin"), class = "factor"), T0_T2_THERAPY_COD = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("A",
"F"), class = "factor"), T0_T2_PD_event = structure(c(2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("No Progression",
"Progression"), class = "factor"), Gene_mut = structure(c(4L,
5L, 1L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 3L, 6L, 3L, 4L, 5L,
6L, 2L, 3L, 4L, 3L, 4L, 5L, 6L, 3L, 6L, 3L, 4L, 5L, 6L, 3L,
4L, 5L, 6L, 1L, 3L, 4L, 5L, 3L, 4L, 3L, 4L, 5L, 6L, 3L, 4L,
5L, 6L, 6L, 3L, 4L, 5L, 6L, 3L, 4L, 3L, 4L, 5L, 6L, 3L, 4L,
5L, 6L, 3L, 4L, 5L, 6L, 1L, 6L, 3L, 4L, 5L), .Label = c("AKT1",
"ERBB2", "ESR1", "PIK3CA", "TP53", "WT"), class = "factor"),
.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8:12,
13:19, 20:22, 23L, 24L, 25:27, 28:35, 36:45, 46:50, 51L,
52L, 53L, 54:55, 56:58, 59L, 60L, 61L, 62L, 63L, 64:67,
68:72, 73:75, 76L, 77L, 78:79, 80L, 81L, 82L, 83:89,
90:95, 96:100, 101L, 102L, 103L, 104L, 105L, 106L, 107:108,
109L, 110L, 111:112, 113L, 114:121, 122:131, 132:137,
138:140, 141L, 142L, 143L, 144L, 145L, 146L, 147L, 148L,
149L, 150L, 151L, 152L, 153L, 154L, 155L, 156L, 157:158,
159:160, 161L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -72L), .drop = TRUE))
You're right to think of scale_fill_manual(). I think this is the more programmable alternative to passing a vector like colorfill to an aesthetic outside aes(). The following plot uses your data and color vector to control how the fill aesthetic is coded throughout the plot, and notice that fill is passed the same variable, Gene_mut, in both layers (alluvium and stratum):
ggplot(data = Allu,
aes(axis1 = Gene_mut, axis2 = Metastasis_Location, y = Freq)) +
geom_alluvium(aes(fill = Gene_mut),
curve_type = "quintic") +
geom_stratum(aes(fill = Gene_mut), width = 1/4) +
scale_fill_manual(values = colorfill) +
geom_text(stat = "stratum", size = 3,
aes(label = after_stat(stratum))) +
scale_x_discrete(limits = c("Metastasis_Location", "Gene_mut"),
expand = c(0.05, .05)) +
theme_void()
Since Metastasis_Location takes different values than Gene_mut, fill treats those strata as having missing values, which by default are colored grey. You can change that behavior by passing a color string to the na.value parameter of scale_fill_manual().
I have a time series data at hourly level. I am trying to build a forecast for that data. The following is sample of data:
sample <-
structure(list(group_type = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Group 1",
"Group 2", "Group 5"), class = "factor"), sub_group_type = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("Sub Group 1", "Sub Group 2", "Sub Group 3"),
class = "factor"), date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1/1/17",
"1/2/17", "1/3/17"), class = "factor"), hour = c(6L, 7L, 8L, 9L, 10L, 11L, 12L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), weekday = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
.Label = c("Monday", "Sunday", "Tuesday"), class = "factor"), total = c(9L, 9L,
10L, 6L, 2L, 14L, 3L, 11L, 12L, 12L, 0L, 10L, 8L, 13L, 14L, 17L, 12L, 5L, 9L, 7L,
10L, 13L, 23L, 11L, 3L, 6L, 10L, 11L, 14L, 16L, 13L, 2L, 3L, 4L, 14L, 11L, 16L,
8L, 12L, 7L, 6L, 13L, 13L, 22L, 12L, 7L, 9L, 8L, 14L, 9L, 16L, 15L, 6L, 7L, 6L,
12L, 13L, 14L, 7L, 3L, 13L, 11L, 6L, 8L, 15L, 11L, 3L, 10L, 9L, 7L, 12L, 10L, 10L,
3L, 14L, 8L, 12L, 10L, 20L, 5L, 4L, 8L, 12L, 3L, 0L, 4L, 5L, 1L, 6L, 7L, 0L, 3L,
1L, 1L, 0L, 2L, 2L, 0L, 2L, 0L, 3L, 7L, 6L, 2L, 1L)), .Names = c("group_type",
"sub_group_type", "date", "hour", "weekday", "total"), class = "data.frame",
row.names = c(NA, -105L))
I am applying the following functions to the above data:
models <- function(x){
x <- msts(x, seasonal.periods=c(24,168))
mod_exp <- ets(x, ic='aicc', restrict=T)
mod_hwa <- HoltWinters(x,seasonal = "additive")
mod_hwm <- HoltWinters(x,seasonal = "multiplicative")
mod_neural <- nnetar(x, p=7, size=25)
mod_tbats <- tbats(x, ic='aicc', seasonal.periods=7)
mod_bats <- bats(x, ic='aicc', seasonal.periods=7)
mod_stl <- stlm(x, s.window=7, ic='aicc', robust=TRUE, method='ets')
mod_sts <- StructTS(x)
}
test <- by(sample,list(sample$group_type,sample$sub_group_type,sample$date, sample$hour
),models)
However, I am getting the following error:
Error in ets(x, ic = "aicc", restrict = T) : y should be a univariate time series
If I split the data as follows as apply ets() function, I am able to run it without any issues. But, this splitting of data is not a very feasible option for me as the number of Groups and Sub Groups are too many and each of them has a different time series pattern:
sub_sample_1 <- sample[sample$group_type == "Group 1" & sample$sub_group_type == "Sub Group 1",6]
x <- msts(sub_sample_1, seasonal.periods=24)
mod_arima <- auto.arima(x, ic='aicc', stepwise=F)
mod_exp <- ets(x, ic='aicc', restrict=T)
mod_hwa <- HoltWinters(x,seasonal = "additive")
mod_hwm <- HoltWinters(x,seasonal = "multiplicative")
mod_neural <- nnetar(x, p=24, size=10)
mod_tbats <- tbats(x, ic='aicc', seasonal.periods=24)
mod_bats <- bats(x, ic='aicc', seasonal.periods=24)
mod_stl <- stlm(x, s.window=24, ic='aicc', robust=TRUE, method='ets')
mod_sts <- StructTS(x)
Is there any work around so that I can apply the models by group of columns with out encountering any errors?
Also, not all models are working for all the groups. For the sub_sample_1 data, HoltWinters, neuralnet, bats and stl are giving me error and others are working
> mod_hwa <- HoltWinters(x,seasonal = "additive")
Error in decompose(ts(x[1L:wind], start = start(x), frequency = f), seasonal) :
time series has no or less than 2 periods
> mod_hwm <- HoltWinters(x,seasonal = "multiplicative")
Error in HoltWinters(x, seasonal = "multiplicative") :
data must be non-zero for multiplicative Holt-Winters
> mod_bats <- bats(x, ic='aicc', seasonal.periods=24)
Error in optim(par = param.vector$vect, fn = calcLikelihoodNOTransformed, :
function cannot be evaluated at initial parameters
I can understand why these models are not working for my data. How do I exclude them when they give errors when I apply the function?
Thanks in advance for the help!
This question is similar (extension maybe) to my other question here
Several issues arise from your current setup:
Functions return the last line if no return() is specified. So your first attempt will lose all lines except for mod_sts which will be the value test is assigned for each subset of by.
In your subset code, you are actually passing the 6th column (an atomic vector) whereas you pass all columns of dataframe in your first code attempt. This may be the reason for your error where input should be per the msts docs:
A numeric vector, ts object, matrix or data frame. It is intended that
the time series data is univariate, otherwise treated the same as
ts().
Your by is receiving four groupings, group_type, sub_group_type, date, and hour unlike your second subset code of two. Unless your data is very large, these many groupings may result with few rows or no rows, and hence not enough data points for model procedures as your last code block seems to suggest.
With that said, consider the following adjustment in returning a list of named elements by first two groupings, specifying the 6th column. And because by takes a combinations of factors which in subsetting dataframe may yield no rows, below uses tryCatch to capture any errors and return empty lists to be filtered out in final line.
models <- function(x){
x <- msts(x, seasonal.periods=c(24,168))
list(
mod_exp = ets(x, ic='aicc', restrict=T),
mod_hwa = HoltWinters(x,seasonal = "additive"),
mod_hwm = HoltWinters(x,seasonal = "multiplicative"),
mod_neural = nnetar(x, p=7, size=25),
mod_tbats = tbats(x, ic='aicc', seasonal.periods=7),
mod_bats = bats(x, ic='aicc', seasonal.periods=7),
mod_stl = stlm(x, s.window=7, ic='aicc', robust=TRUE, method='ets'),
mod_sts = StructTS(x)
)
}
# TRY/CATCH TO CAPTURE ERRORS AND RETURN EMPTY LIST
test <- by(sample[,6], list(sample$group_type, sample$sub_group_type),
function(x) tryCatch({ models(x)
}, error=function(e) return(list(NA))))
# TO REMOVE NULLs AND NAs (EMPTY ITEMS)
test <- Filter(function(i) length(i) > 0, test)
I am trying to connect sets of (two) points at each level of x, in each facet. Here is a reproducible example:
datum <- structure(list(frequency = c(8L, 7L, 6L, 18L, 5L, 11L, 16L, 15L,
9L, 8L, 8L, 10L, 2L, 20L, 14L, 3L, 6L, 2L, 2L, 11L, 10L, 6L,
15L, 19L, 18L, 18L, 8L, 2L, 10L, 15L, 12L, 17L, 1L, 18L, 7L,
8L, 16L, 4L, 9L, 2L, 7L, 3L, 16L, 7L, 18L, 20L, 9L, 10L, 13L,
2L, 15L, 7L, 3L, 20L, 4L, 15L, 5L, 7L, 9L, 16L, 5L, 8L, 10L,
10L, 7L, 10L, 10L, 17L, 7L, 8L, 13L, 13L, 16L, 5L, 20L, 18L,
13L, 19L, 3L, 8L, 14L, 12L, 20L, 2L, 9L, 13L, 7L, 2L, 5L, 5L,
13L, 9L, 13L, 7L, 9L, 4L, 4L, 20L, 1L, 4L), band = structure(c(2L,
4L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 1L, 3L, 4L, 2L, 4L, 3L, 4L, 3L,
2L, 3L, 2L, 2L, 4L, 2L, 1L, 1L, 2L, 1L, 4L, 4L, 1L, 4L, 4L, 2L,
1L, 4L, 4L, 3L, 4L, 1L, 1L, 3L, 4L, 1L, 3L, 4L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 3L, 4L, 2L, 1L, 2L, 4L, 2L, 2L, 4L, 4L, 2L, 4L, 4L,
1L, 1L, 4L, 2L, 3L, 4L, 1L, 2L, 4L, 1L, 2L, 4L, 1L, 1L, 3L, 4L,
4L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 4L, 3L, 3L,
1L, 3L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
test = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L
), .Label = c("1", "2"), class = "factor"), knowledge = structure(c(2L,
3L, 1L, 3L, 1L, 1L, 3L, 3L, 1L, 3L, 1L, 3L, 2L, 2L, 1L, 1L,
1L, 1L, 3L, 3L, 1L, 2L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 3L, 2L,
3L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 1L, 1L, 2L, 3L,
3L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 3L, 1L, 3L, 1L, 1L, 2L,
1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 3L, 2L, 2L, 1L, 2L, 3L, 2L,
1L, 2L, 3L, 3L, 2L, 1L, 3L, 1L, 3L, 2L, 1L, 3L, 2L, 2L, 3L,
1L, 1L, 2L, 1L, 2L, 3L, 1L, 3L, 1L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("frequency", "band",
"test", "knowledge"), row.names = c(NA, -100L), class = "data.frame")
Here is the code I have so far:
ggplot(datum, aes(knowledge, frequency, color=test)) +
stat_summary(fun.y='mean', geom='point', position=position_dodge(width=.9), size=3) +
facet_grid(~band) +
labs(y='number of words (max = 20)', x='self-report knowledge') +
scale_x_discrete(labels=c('none', 'form', 'meaning'))
Looking at the left-most facet ('1') in the graph, I would like a line to connect the pretest to posttest in the none column, another line connecting pretest to posttest in the form column, and a line connecting the pretest to the posttest in the meaning column. I would like this done in each facet.
I hope that makes sense, and thanks!
I find relying on ggplot too much for data manipulation/summarizing can hurt more than it helps. I have no idea how to connect the position-dodged points with a line. Instead, I'd do something like this:
library(dplyr)
datsum = datum %>%
group_by(band, knowledge, test) %>%
summarize(mean = mean(frequency)) %>%
ungroup %>%
mutate(knowledge_fac = factor(knowledge, labels = c('none', 'form', 'meaning')))
ggplot(datsum, aes(x = test, y = mean)) +
geom_path(aes(group = band:knowledge)) +
geom_point(aes(color = factor(test))) +
facet_grid(band ~ knowledge_fac) +
labs(y='number of words (max = 20)', x='self-report knowledge')
Borrowing from Gregor's work in munging the data, I think this does what was requested. The mutate() chunk creates Test to be a numeric offset of -0.1 for test 1 and 0.1 for test 2. This is then added to the numeric value of knowledge. The result is the numeric x passed to ggplot2. Gregor correctly defined the groups, so the rest is straightforward.
library(dplyr)
datsum <- datum %>%
group_by(band, knowledge, test) %>%
summarize(mean = mean(frequency)) %>%
mutate(Test = 0.1 * (2 * (test == 2) - 1),
Knowledge = as.numeric(knowledge) + Test) %>%
ungroup
ggplot(datsum, aes(x = Knowledge, y = mean, color = test)) +
geom_path(aes(group = band:knowledge), color = "black") +
geom_point(size = 3) +
facet_wrap(~ band, nrow = 1) +
labs(y='number of words (max = 20)', x='self-report knowledge') +
scale_color_manual(values = c("orange", "blue")) +
scale_x_continuous(limits = c(0.5, 3.5), breaks = 1:3,
labels = c("none", "form", "meaning"))
I am trying to replicate a 3 Factor nested ANOVA anlaysis in a paper: Underwood, AJ (1993) The Mechanics of spatially replicated sampling programmes to detect environmental impacts in a variable world.
The data for the example (from Table 3, Underwood 1993) can be produced by:
dat <-
structure(list(B = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("A", "B"), class = "factor"), C = structure(c(2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("C", "I"), class = "factor"),
Times = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
Locations = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L,
1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L,
1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L), X = c(59L, 51L, 45L, 46L, 40L, 32L, 39L, 32L, 25L, 51L,
44L, 37L, 55L, 47L, 41L, 31L, 38L, 45L, 41L, 47L, 55L, 43L,
36L, 29L, 23L, 30L, 37L, 57L, 50L, 43L, 36L, 44L, 51L, 39L,
29L, 23L, 38L, 44L, 52L, 31L, 38L, 45L, 42L, 35L, 28L, 52L,
44L, 37L, 51L, 43L, 37L, 38L, 31L, 24L, 60L, 52L, 46L, 30L,
37L, 44L, 41L, 34L, 27L, 53L, 46L, 39L, 40L, 34L, 26L, 21L,
27L, 35L), Times.unique = structure(c(5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L,
7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L,
8L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A_1", "A_2", "A_3",
"A_4", "B_1", "B_2", "B_3", "B_4"), class = "factor")), .Names = c("B",
"C", "Times", "Locations", "Y", "Times.unique"), row.names = c(NA,
-72L), class = "data.frame")
dat
The data frame dat has 4 factors:
B - has two levels "A" and "B" (before v after)
Times - 8 levels, 4 within before "B" and 4 within after "A", coded as 1:4 within each. note that variable Times.unique is the same thing but with a unique code for each time (before and after)
Locations - has three levels, all measured every time both before and after
C - has two levels control (C) and (I). note: two locations are control and one is impact
While I am clear on how to analyse such a design using mixed models (lmer), I would like to replicate his example exactly so that I can run some simulations to compare his method.
In particular I am attempting to replicate the SS values presented in table 4 under column "a". He fits a design that has SS and df values for the following terms:
B -> SS = 66.13, df = 1
Times(B) -> SS = 280.64, df = 6
Locations -> SS = 283.86, df = 2
B x Locations -> SS = 29.26, df = 2
Times(B) x Locations-> SS = 575.45, df = 12
Residual -> SS = 2420.00, df = 48
Total -> SS = 6208.34, df = 71
I assume the Times(B) term represents Times nested within the Before/After treatment "B". For this example he ignores that Locations are from control and impact treatments and leaves out factor C altogether.
I have tried all possible combinations I can think of to reproduce this nested anova, using both unique Times coding and Times coded as 1:4 within B (before and after). I have tried using %in%, / and Error() arguments, as well as Anova from car to change the type of SS calculated. Examples of the %in% and / nested fits include:
aov(Y~B+Locations+Times%in%B+B:Locations+Times%in%B:Locations, data=dat)
aov(Y~B+Locations+B/Times+B:Locations+B/Times:Locations, data=dat)
I seem to be unable to replicate Underwood's SS values exactly, particularly for the two interaction terms. A friend let me fit the model in statistix, where the SS values can be reproduced exactly, so it is possible to obtain the above SS values for this model.
Can anyone help me fit this model in R? I wish to embed it in a larger simulation and really need to be able to run the model in R, such that the Underwood 1993 SS values are reproduced exactly?
Your problem is that dat$Locations is an integer, when it should be a factor (three unique locations). One hint is that your ANOVA line thinks Locations takes up only 1 df, while Underwood gives it 2.
Simply add the line:
dat$Locations = factor(dat$Locations)
And then your line of code reproduces the Underwood results perfectly:
aov(Y~B+Locations+B/Times+B:Locations+B/Times:Locations, data=dat)
#Call:
# aov(formula = Y ~ B + Locations + B/Times + B:Locations + B/Times:Locations,
# data = dat)
#
#Terms:
# B Locations B:Times B:Locations B:Locations:Times
#Sum of Squares 66.1250 2836.8611 280.6389 29.2500 575.4444
#Deg. of Freedom 1 2 6 2 12
# Residuals
#Sum of Squares 2420.0000
#Deg. of Freedom 48