I am attempting to convert existing SAS code I have for a research project into R. I am, unfortunately, finding myself totally clueless on how to approach this for repeated measures ANOVA despite a few hours of looking at other people's questions across both StackExchange and the web at large. I suspect this may at least, if not in total, be due to my not knowing the right questions to ask and limited statistics background.
First, I will present some sample data (tab-delimited, which I'm not sure will be preserved on SE), then explain what I'm attempting to do, and then the code I have written as of this moment.
Sample data:
Full data frame at: http://grandprairiefriends.org/document/data.df
Obs SbjctID Sex Treatment Measured BirthDate DateStarted DateAssayed SubjectAge_Start_days SubjectAgeAssay.d. PreMass_mg PostMass_mg DiffMass_mg PerCentMassDiff Length_mm Width_mm PO1_abs_min PO1_r2 PO2_abs_min PO2_r2 ProteinConc_ul Protein1_net_abs Protein1_mg_ml Protein1_adjusted_mg_ml Protein2_net_abs Protein2_mg_ml Protein2_adjusted_mg_ml zPO_avg_abs_min z_Protein_avg_adjusted_mg_ml POPer_ug_Protein POPer_ug_Protein_x1000 ImgDarkness1 ImgDarkness2 ImgDarkness3 ImgDarkness4 DarknessAvg AGV_1_1 AGV_1_2 AGV_2_1 AGV_2_2 AGV_12_1 AGV_12_2 z_AGV predicted_premass resid_premass predicted_premass_calculated resid_premass_calculated predicted_postmass_calculated resid_postmass_calculated predicted_postmass resid_postmass ln_premass_mg ln_postmass_mg ln_length ln_melanization ln_po sqrt_p
1 aF001 Female a PO_P 08/05/09 09/06/09 09/13/09 32 39 282.7 309.4 26.66 9.43 10.1 5.3 0.0175 0.996 0.0201 0.996 40 0.227 0.960 0.960 0.234 1.030 1.030 0.0188 0.995 0.00031 0.31491 33.7045 35.9165 28.8383 30.3763 32.2089 NA NA NA NA NA NA NA 5.660963 -0.016576413 4.077123 1.567263 4.077123 1.657382 5.660963 0.0735429694 8.143128 8.273329 3.336283 NA -5.733124 -0.007231569
2 aF002 Female a PO_P 08/02/09 09/06/09 09/13/09 35 42 298.9 313.1 14.23 4.76 10.0 5.9 0.0123 0.999 0.0134 0.996 40 0.213 0.840 0.840 0.219 0.860 0.860 0.0129 0.850 0.00025 0.25196 31.8700 31.8800 32.4680 32.3020 32.1300 NA NA NA NA NA NA NA 5.640012 0.059996453 4.056173 1.643836 4.056173 1.690350 5.640012 0.1065103847 8.223519 8.290480 3.321928 NA -6.276485 -0.234465254
3 aF003 Female a PO_P 08/03/09 09/06/09 09/13/09 34 41 237.1 270.6 33.53 14.14 9.4 5.3 0.0227 0.992 0.0248 0.994 40 0.245 1.120 1.120 0.235 1.030 1.030 0.0238 1.075 0.00037 0.36822 36.0565 41.9355 41.6260 40.0180 39.9090 NA NA NA NA NA NA NA 5.509734 -0.041209334 3.925894 1.542630 3.925894 1.674895 5.509734 0.0910560222 7.889352 8.080018 3.232661 NA -5.392895 0.104336660
82 bM001 Male b PO_P 08/02/09 08/31/09 09/07/09 29 36 468.1 371.7 -96.38 -20.59 10.7 6.8 0.0049 0.999 0.0056 1.000 40 0.228 0.350 0.350 0.222 0.330 0.330 0.0053 0.340 0.00026 0.25735 NA NA NA NA NA NA NA NA NA NA NA NA 5.782468 0.366214334 4.198628 1.950054 4.198628 1.719513 5.640012 -0.0844204671 8.870673 8.537995 3.419539 NA -7.559792 -1.556393349
157 cM022 Male c PO_P 08/03/09 10/31/09 11/07/09 89 96 451.1 402.4 -48.71 -10.80 11.3 6.9 0.0024 0.995 0.0026 0.995 10 0.091 0.110 0.028 NA NA NA 0.0025 0.028 0.00152 1.51515 NA NA NA NA NA NA NA NA NA NA NA NA 5.897342 0.214325251 4.313502 1.798165 4.313502 1.683895 5.897342 0.1000552907 8.817303 8.652486 3.498251 NA -8.643856 -5.158429363
Explanation of what I'm looking to accomplish:
This experiment was attempting to determine if a particular feeding regime (Treatment) had an effect on the after-experiment mass of the subject (ln_postmass_mg). The mass of each individual was measured twice, once at the beginning (ln_premass_mg), and once at the end of the feeding regime. Sex, Treatment, and Measured are all categorical variables.
I have generated some R code, but the output does not match the SAS code, which it shouldn't, as I don't believe it's coded for repeated measures. It's not clear to me if I need to transpose or otherwise manipulate my dataframe in R to perform additional analyses, or what. I seem to be reading multiple different approaches to repeated measures problems, and am not sure which, if any, apply to my particular problem. If anyone can put me in the right track to learn how to write the additional lines of code necessary for the R equivalent, or have suggestions, I'd much appreciate it.
SAS Code:
/* test for effect of diet regime */
/* repeated measures ANOVA for mass */
proc glm data=No_diet_lab;
class measured sex Treatment;
model ln_premass ln_postmass=Measured Sex Treatment Measured*Sex Measured*Treatment Sex*Treatment Measured*Sex*Treatment /nouni;
repeated time 2;
R Code:
options(contrasts=c("contr.sum","contr.poly"))
model <- lm(cbind(ln_premass_mg, ln_postmass_mg) ~ Sex + Treatment + Measured + Sex:Treatment + Sex:Measured + Measured:Treatment + Sex:Treatment:Measured, data = diet_lab_data, na.action=na.omit)
This should hopefully replicate your SAS output:
First we'll put the data in long form:
df <- subset(diet_lab_data, select = c("SubjectID", "Sex", "Treatment", "Measured",
"ln_premass_mg", "ln_postmass_mg"))
dfL <- reshape(df, varying = list(5:6), idvar = "SubjectID", direction = "long",
v.names = "ln_mass_mg")
dfL$time <- factor(dfL$time, levels = 1:2, labels = c("pre", "post"))
head(dfL); tail(dfL)
SubjectID Sex Treatment Measured time ln_mass_mg
aF001.1 aF001 Female a PO_P pre 8.143128
aF002.1 aF002 Female a PO_P pre 8.223519
aF003.1 aF003 Female a PO_P pre 7.889352
aF004.1 aF004 Female a PO_P pre 8.521993
aF005.1 aF005 Female a PO_P pre 8.335390
aF006.1 aF006 Female a PO_P pre 8.259743
SubjectID Sex Treatment Measured time ln_mass_mg
cM033.2 cM033 Male c Melaniz post 8.163398
bF037.2 bF037 Female b Melaniz post 8.222070
cM032.2 cM032 Male c Melaniz post 8.422485
cF030.2 cF030 Female c Melaniz post 8.580447
cM039.2 cM039 Male c Melaniz post 8.710118
cM036.2 cM036 Male c Melaniz post 8.049849
That's better. Now we fit the model using aov and specifying time as a within subjects factor.
aovMod <- aov(ln_mass_mg ~ Sex * Treatment * Measured * time +
Error(SubjectID/time), data = dfL)
All that being said, I'm not sure this is the appropriate analysis, as your design is unbalanced. Consider a mixed-effects model.
Related
I am currently analyzing eye-tracking data using the Sequential Bayes Factor method, and I would like to plot how the resulting Bayes Factor (BF; calculated from average looking times) changes as participants are added.
I would like the x-axis to represent the number of participants included in the calculation, and the y-axis to represent the resulting Bayes Factor.
For example, when participants 1-10 are included, BF = [y-value], and that is one plot point on the graph. When participants 1-11 are included, BF = [y-value], and that is the second plot point on the graph.
Is there a way to do this in R?
For example, I have this data set:
ID avg_PTL
<chr> <dbl>
1 D07 -0.0609
2 D08 0.0427
3 D12 0.112
4 D15 -0.106
5 D16 0.199
6 D19 0.0677
7 D20 0.0459
8 d21 -0.158
9 D23 0.0650
10 D25 0.0579
11 D27 0.0463
12 D29 0.00822
13 D30 0.00613
14 D36 -0.0484
15 D37 0.0312
16 D39 0.000547
17 D44 0.0336
18 D46 0.0514
19 D48 0.236
20 D51 -0.000487
21 D60 0.0410
22 D61 0.0622
23 D62 0.0337
24 D64 -0.125
25 D65 0.215
26 D66 0.200
And I calculate the BF with:
bf.mono.correct = ttestBF(x = avg_PTL_mono_correct$avg_PTL)
Any tips are much appreciated!
You can use sapply to run the test multuiple times and just subset the vector of observations each time. For example
srange <- 10:nrow(avg_PTL_mono_correct)
BF <- sapply(srange, function(i) {
extractBF(ttestBF(x = avg_PTL_mono_correct$avg_PTL[1:i]), onlybf=TRUE)
})
plot(srange, BF)
Will result in
I would like to assign a variable with a custom factor from an ANOVA model to the emmeans() statement. Here I use the oranges dataset from R to make the code reproducible. This is my model and how I would usually calculate the emmmeans of the factor store:
library(emmeans)
oranges$store<-as.factor(oranges$store)
model <- lm (sales1 ~ 1 + price1 + store ,data=oranges)
means<-emmeans(model, pairwise ~ store, adjust="tukey")
Now I would like to assign a variable (lsmeanfact) defining the factor for which the lsmeans are calculated.
lsmeanfact<-"store"
However, when I want to evaluate this variable in the emmeans() function it returns an error, it basically does not find the variable lsmeanfact, so it does not evaluate this variable.
means<-emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust="tukey")
Error in emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust = "tukey") :
No variable named lsmeanfact in the reference grid
How should I change my code to be able to evaluate the variable lsmeanfact so that the lsmeans for "plantcode" are correctly calculated?
You can make use of reformulate function.
library(emmeans)
lsmeanfact<-"store"
means <- emmeans(model, reformulate(lsmeanfact, 'pairwise'), adjust="tukey")
Or construct a formula with formula/as.formula.
means <- emmeans(model, formula(paste('pairwise', lsmeanfact, sep = '~')), adjust="tukey")
Here both reformulate(lsmeanfact, 'pairwise') and formula(paste('pairwise', lsmeanfact, sep = '~')) return pairwise ~ store.
You do not need to do anything special at all. The specs argument to emmeans() can be a character value. You can get the pairwise comparisons in a separate call, which is actually a better way to go anyway.
library(emmeans)
model <- lm(sales1 ~ price1 + store, data = oranges)
lsmeanfact <- "store"
( EMM <- emmeans(model, lsmeanfact) )
## store emmean SE df lower.CL upper.CL
## 1 8.01 2.61 29 2.67 13.3
## 2 9.60 2.30 29 4.89 14.3
## 3 7.84 2.30 29 3.13 12.6
## 4 10.44 2.35 29 5.63 15.2
## 5 10.19 2.28 29 5.53 14.9
## 6 15.22 2.28 29 10.56 19.9
##
## Confidence level used: 0.95
pairs(EMM)
## contrast estimate SE df t.ratio p.value
## 1 - 2 -1.595 3.60 29 -0.443 0.9976
## 1 - 3 0.165 3.60 29 0.046 1.0000
## 1 - 4 -2.428 3.72 29 -0.653 0.9856
## 1 - 5 -2.185 3.50 29 -0.625 0.9882
## 1 - 6 -7.209 3.45 29 -2.089 0.3206
## 2 - 3 1.761 3.22 29 0.546 0.9936
## 2 - 4 -0.833 3.23 29 -0.258 0.9998
## 2 - 5 -0.590 3.23 29 -0.182 1.0000
## 2 - 6 -5.614 3.24 29 -1.730 0.5239
## 3 - 4 -2.593 3.23 29 -0.802 0.9648
## 3 - 5 -2.350 3.23 29 -0.727 0.9769
## 3 - 6 -7.375 3.24 29 -2.273 0.2373
## 4 - 5 0.243 3.26 29 0.075 1.0000
## 4 - 6 -4.781 3.28 29 -1.457 0.6930
## 5 - 6 -5.024 3.23 29 -1.558 0.6314
##
## P value adjustment: tukey method for comparing a family of 6 estimates
Created on 2021-06-29 by the reprex package (v2.0.0)
Moreover, in any case what is needed in specs are the name(s) of the factors involved, not the factors themselves. Note also that it was unnecessary to convert store to a factor before fitting the model
Please be patient with me. I'm new to this site.
I am modeling turtle nest survival using the coxph() function and have run into a confusing problem with an interaction term between species and nest cages. I have nests from 3 species of turtles (7, 10, and 111 nests per species).
There are nest cages on all nests for the species(1) with 7 nests.
There are no nest cages on all the nests for the species(2) with 10 nests.
There are nest cages on about half of the nests for the species(3) with 111 nests.
Here is my model with the summary output:
S<-Surv(time, event)
n8<-coxph(S~species:cage, data=nesta1)
Warning message:
In coxph(S ~ species:cage, data = nesta1) :
X matrix deemed to be singular; variable 1 5 6
summary(n8)
Call:
coxph(formula = S ~ species:cage, data = nesta1)
n= 128, number of events= 73
coef exp(coef) se(coef) z Pr(>|z|)
species1:cageN NA NA 0.0000 NA NA
species2:cageN 1.2399 3.4554 0.3965 3.128 0.00176 **
species3:cageN 0.5511 1.7351 0.2664 2.068 0.03860 *
species1:cageY -0.1054 0.8999 0.6145 -0.172 0.86379
species2:cageY NA NA 0.0000 NA NA
species3:cageY NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
species1:cageN NA NA NA NA
species2:cageN 3.4554 0.2894 1.5887 7.515
species3:cageN 1.7351 0.5763 1.0293 2.925
species1:cageY 0.8999 1.1112 0.2698 3.001
species2:cageY NA NA NA NA
species3:cageY NA NA NA NA
Concordance= 0.61 (se = 0.038 )
Rsquare= 0.079 (max possible= 0.993 )
Likelihood ratio test= 10.57 on 3 df, p=0.01426
Wald test = 11.36 on 3 df, p=0.009908
Score (logrank) test = 12.22 on 3 df, p=0.006672
I understand that I would have singularities for species 1 and 2, but not for species 3. Why would the "species3:cageY" line be singular when there are species 3 nests with nest cages on them?
Is it ok to include species 1 and 2 even though they have those singularities?
Edit: I cannot find any errors in my data. I have decimal numbers for the time variable for a few nests, but that doesn't seem to be a problem for species 3 nests without a nest cage. For species 3, I have the full range of time values for nests with and without a nest cage and I have both true and false events for nests with and without a nest cage.
Edit:
with( nesta1, table(event, species, cage))
, , cage = N
species
event 1 2 3
0 0 1 24
1 0 9 38
, , cage = Y
species
event 1 2 3
0 4 0 26
1 3 0 23
Edit 2: I understand that interaction-only models are not very useful, but the interaction term results behave the same way whether I have other main effects in the model or not. I've removed the other main effects to simplify this question.
Thank you!
I have one time series, let's say
694 281 479 646 282 317 790 591 573 605 423 639 873 420 626 849 596 486 578 457 465 518 272 549 437 445 596 396 259 390
Now, I want to forecast the following values by ARIMA Model, but ARIMA requires the time series to be stationarity, so before this, I have to identify the time series above matches the requirement or not, then fUnitRoots comes up.
I think http://cran.r-project.org/web/packages/fUnitRoots/fUnitRoots.pdf can offer some help, but there is no simple tutorial
I just want one small demo to show how to identify one time series, is there any one?
thanks in advance.
I will give example using urca package in R.
library(urca)
data(npext) # This is the data used by Nelson and Plosser (1982)
sample.data<-npext
head(sample.data)
year cpi employmt gnpdefl nomgnp interest indprod gnpperca realgnp wages realwag sp500 unemploy velocity M
1 1860 3.295837 NA NA NA NA -0.1053605 NA NA NA NA NA NA NA NA
2 1861 3.295837 NA NA NA NA -0.1053605 NA NA NA NA NA NA NA NA
3 1862 3.401197 NA NA NA NA -0.1053605 NA NA NA NA NA NA NA NA
4 1863 3.610918 NA NA NA NA 0.0000000 NA NA NA NA NA NA NA NA
5 1864 3.871201 NA NA NA NA 0.0000000 NA NA NA NA NA NA NA NA
6 1865 3.850148 NA NA NA NA 0.0000000 NA NA NA NA NA NA NA NA
I will use ADF to perform the unit root test on industrial production index as an illustration. The lag is selected based on the SIC. I use trend as there is trend in the date .
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.31644 -0.04813 0.00965 0.05252 0.20504
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.052208 0.017273 3.022 0.003051 **
z.lag.1 -0.176575 0.049406 -3.574 0.000503 ***
tt 0.007185 0.002061 3.486 0.000680 ***
z.diff.lag 0.124320 0.089153 1.394 0.165695
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09252 on 123 degrees of freedom
Multiple R-squared: 0.09796, Adjusted R-squared: 0.07596
F-statistic: 4.452 on 3 and 123 DF, p-value: 0.005255
Value of test-statistic is: -3.574 11.1715 6.5748
Critical values for test statistics:
1pct 5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2 6.22 4.75 4.07
phi3 8.43 6.49 5.47
#Interpretation: BIC selects the lag 1 as optimal lag. The test statistics -3.574 is less than the critical value tau3 at 5 percent (-3.430). So, the null that there is an unit root is is rejected only at 5 percent.
Also, check the free forecasting book available here
You can, of course, carry out formal tests such as the ADF test, but I would suggest carrying out "informal tests" of stationarity as a first step.
Inspecting the data visually using plot() will help you identify whether or not the data is stationary.
The next step would be to investigate the autocorrelation function and partial autocorrelation function of the data. You can do this by calling both the acf() and pacf() functions. This will not only help you decide whether or not the data is stationary, but it will also help you identify tentative ARIMA models that can later be estimated and used for forecasting if they get the all clear after carrying out the necessary diagnostic checks.
You should, indeed, pay caution to the fact that there are only 30 observations in the data that you provided. This falls below the practical minimum level of about 50 observations necessary for forecasting using ARIMA models.
If it helps, a moment after I plotted the data, I was almost certain the data was probably stationary. The estimated acf and pacf seem to confirm this view. Sometimes informal tests like that suffice.
This little-book-of-r-for-time-series may help you further.
I am a novice R user trying to work with a data set of 40,000 rows and 300 columns. I have found a solution for what I would like to do, however my machine takes over an hour to run my code and I feel like an expert could help me with a quicker solution (as I can do this in excel in half the time). I will post my solution at the end.
What I would like to do is the following:
Compute the average value for each column NY1 to NYn based on the value of the YYYYMMbucket column.
Divide original value by the its average YYYYMMbucket value.
Here is sample of my original data set:
YYYYMMbucket NY1 NY2 NY3 NY4
1 200701.3 0.309 NA 20.719 16260
2 200701.3 0.265 NA 19.482 15138
3 200701.3 0.239 NA 19.168 14418
4 200701.3 0.225 NA 19.106 14046
5 200701.3 0.223 NA 19.211 14040
6 200701.3 0.234 NA 19.621 14718
7 200701.3 0.270 NA 20.522 15780
8 200701.3 0.298 NA 22.284 16662
9 200701.2 0.330 NA 23.420 16914
10 200701.2 0.354 NA 23.805 17310
11 200701.2 0.388 NA 24.095 17448
12 200701.2 0.367 NA 23.954 17640
13 200701.2 0.355 NA 23.255 17748
14 200701.2 0.346 NA 22.731 17544
15 200701.2 0.347 NA 22.445 17472
16 200701.2 0.366 NA 21.945 17634
17 200701.2 0.408 NA 22.683 18876
18 200701.2 0.478 NA 23.189 21498
19 200701.2 0.550 NA 23.785 22284
20 200701.2 0.601 NA 24.515 22368
This is what my averages look like:
YYYYMMbucket NY1M NY2M
1 200701.1 0.4424574 NA
2 200701.2 0.4530000 NA
3 200701.3 0.2936935 NA
4 200702.1 0.4624063 NA
5 200702.2 0.4785937 NA
6 200702.3 0.3091161 NA
7 200703.1 0.4159687 NA
8 200703.2 0.4491875 NA
9 200703.3 0.2840081 NA
10 200704.1 0.4279137 NA
How I would like my final output to look:
NY1avgs NY2avgs NY3avgs
1 1.052117 NA 0.7560868
2 0.9023011 NA 0.7109456
3 0.8137734 NA 0.699487
4 0.7661047 NA 0.6972245
5 0.7592949 NA 0.7010562
6 0.7967489 NA 0.7160181
7 0.9193256 NA 0.7488978
8 1.014663 NA 0.8131974
9 0.7284768 NA 0.857904
Here's how I did it:
First I used "plyr" to compute my averages, simple enough:
test <- ddply(prf.delete2b,. (YYYYMMbucket), summarise,
NY1M = mean(NY1), NY2M = mean(NY2) ... ...))
Then used a series of the following:
x <- c(1:40893)
lookv <- function(x,ltab,rcol=2) ltab[max(which(ltab[,1]<=x)),rcol]
NY1Fun <- function(x) (prf.delete2b$NY1[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,2))
NY2Fun <- function(x) (prf.delete2b$NY2[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,3))
NY1Avgs <- lapply(x, NY1Fun)
NY2Avgs <- lapply(x, NY2Fun)
I also tried a variant of the above by saying:
NY1Fun <- function(x) (prf.delete2b$NY1[x] / subset(test, YYYYMMbucket == prf.delete2b$YYYYMMbucket[x], select =c(NY1M)))
lapply(x, NY1Fun)
Each variant of NYnFun takes a good 20 seconds to run so doing this 300 times takes much too long. Can anyone recommend any alternative to what I posted or point out any novice mistakes I've made?
Here is the customary data.table approach, which works pretty fast.
# CREATE DUMMY DATA
N = 1000
mydf = data.frame(
bucket = sample(letters, N, replace = T),
NY1 = runif(N),
NY2 = runif(N),
NY3 = runif(N),
NY4 = runif(N)
)
# SCALE COLUMNS BY AVG
library(data.table)
scale_x = function(x) x/ave(x)
mydt = data.table(mydf)
ans = mydt[,lapply(.SD, scale_x), by = 'bucket']
How about:
test2 <- merge(prfdelete2b,test,all.x=TRUE)
test2[2:ncol(prefdelete2b)]/test2[(ncol(prefdelete2b)+1):ncol(test2)]
In this case, I would use ave instead of ddply because ave returns a vector the same length as its input. ave only accepts a vector, so you need to use lapply to loop over the columns of your data.frame.
myFun <- function(x, groupVar) {
x / ave(x, groupVar, FUN=function(y) mean(y, na.rm=TRUE))
}
relToMeans <- data.frame(prf.delete2b[1],
lapply(prf.delete2b[-1], myFun, groupVar=prf.delete2b[1]))