I want to show that seeds of different species display different length due to the factor Species.
For each species, I have several trees and for each tree, I have several seeds measured.
Using R, I did an ANOVA:
summary(aov(Length ~ Species))
However, the reviewer noticed a problem of independence because seeds may provide from the same tree. (and this is indeed a real problem !)
To answer this issue, I think that I should do a nested ANOVA. Is that right ?
However, there are plenty of ways to write the code:
summary(aov(Length ~ Species*Tree))
summary(aov(Length ~ Tree*Species))
summary(aov(Length ~ Species/Tree))
summary(aov(Length ~ Species+Error(Tree)))
I believe this is the last possibility listed that will allow me to show that the length of seeds is different due to the species and taking into account that the seeds may come from the same tree.
Can you confirm ?
When I run the command, I obtain this:
Error: Tree
Df Sum Sq Mean Sq F value Pr(>F)
Species 12 320.6 26.715 14.98 4.96e-15 ***
Residuals 71 126.6 1.784
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 1541 11.92 0.007733
Which indeed means that species have a significative impact on the seed length, is that right ?
Thanks so much for your help !!
Muriel
See here for some examples of nested ANOVA in R as well as some insight into mixed models.
I'd install the package lme4, do ?lmer in R, and look into the section "Mixed and Multilevel Models" on the page provided. Perhaps this is a better approach for your data.
Related
I have a basic pre-post trial design. Two randomized Groups and two tests for each participant in each group one prior to the intervention (here V1) and one post (V2).
I am completely new to this and have been reading up a lot on this and based on a few sources it was suggested that an ANCOVA test with the pre-test as a covariate was the most appropriate.
So, I modeled as follows:
y <- aov(V2~Group+V1, data=x)
And checked for normality of residuals and used the Levene's test to test for correlation between V2 and Group.
I got the following result for a certain variable of interest -
summary(y)
Df Sum Sq Mean Sq F value Pr(>F)
Group 1 29996 29996 4.315 0.0386 *
V1 1 3710598 3710598 533.844 <2e-16 ***
Residuals 325 2258983 6951
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I followed it up with a post-hoc Turkey Test and found that it was significant there as well.
I have a couple of questions:
Is this the correct way to go?
Why is my V1 (pretest) covariate having such a high level of significance and what does it mean? (I assumed that randomization essentially means that there is no difference between the groups at baseline).
Can I conclude that there truly is a difference between the two groups for this particular aspect based on this?
I'm testing how visual perspective(1, completely first person -> 11, completely third person) can vary as a function of Culture (AA, EA), Valence (Positive, Negative) and Event Type (Memory, Imagination) while control age (continuous), sex (M, F) and SES (continuous) and allowing individual differences.
This is an unbalanced design as participants can have as we give participants 10 prompts, but participants can choose to either recall or imagine a relevant event. Therefore, each participants may have as many memories (no greater than 10) and as many imaginations (no greater than 10) as they want. In total we have 363 participants.
My dataset looks like this:
The model I fit looks like
VP.full.lm <- lmer(Visual.Perspective ~ Culture * Event.Type * Valence +
Sex + Age + SES +
(1|Participant.Number),
data=VP_Long)
When I run anova() function to see the effects of all variables, here is the output:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Culture 30.73 30.73 1 859.1 4.9732 0.0260008 *
Valence 6.38 6.38 1 3360.3 1.0322 0.3097185
Event.Type 1088.61 1088.61 1 3385.9 176.1759 < 2.2e-16 ***
Sex 45.12 45.12 1 358.1 7.3014 0.0072181 **
Age 7.95 7.95 1 358.1 1.2869 0.2573719
SES 6.06 6.06 1 358.7 0.9807 0.3226824
Culture:Valence 6.39 6.39 1 3364.6 1.0348 0.3091004
Culture:Event.Type 71.53 71.53 1 3389.7 11.5766 0.0006756 ***
Valence:Event.Type 2.89 2.89 1 3385.4 0.4682 0.4938573
Culture:Valence:Event.Type 3.47 3.47 1 3390.6 0.5617 0.4536399
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you can see, the DF for effect of culture is off -- since culture is a between-subject factor, its DF cannot be larger than our sample size. I've tried to use ddf = Roger-Kenward and tested the effect of culture using emmeans::test(contrast(emmeans(VP.full.lm,c("Culture")), "trt.vs.ctrl"), joint = T), yet none of these methods solved the problems with the degree of freedom issue.
I also thought about that maybe those participants who did not provide both memories and imaginations are confusing the lmer model, so I subsetted my data to only include participants who provided both types of events. However, the degree of freedom problem persists. It's also worth mentioning that once I removed the interaction between Culture and Event.Type, the degree of freedom became plausible.
I wonder if anyone knows what is going on here, and how can we fix this issue? Or is there way we can explain away this weird issue...?
Thanks so much in advance!
This question might be more appropriate for CrossValidated ...
Not a complete solution, but some ideas:
from a practical point of view, the difference between 363 (or even 350) denominator df and 859 ddf is very small: the manual p-value calculation based on an F-statistic of 4.9732 gives pf(4.9732,1,350,lower.tail=FALSE)=0.0264, hardly different from your value of 0.260.
since you are fitting a simple model (LMM not GLMM, only a single simple random effect, etc.), you might be able to refit your model in lme (from the nlme package): it uses a simpler df computation that might give you the 'right' answer in this case. Alternatively, you can get code from here that implements a (slightly extended) version of the algorithm from lme.
since you're doing type-III Anova, you should be very careful with the parameterization/contrasts in your model: if you're not using centered (sum-to-zero) contrasts, your results may not mean what you think (the afex::mixed() function does some checks to make sure that this is true). It's conceivable (although I doubt it) that the contrasts are throwing of your ddf calculations as well.
it's not clear how you're measuring "visual perspective", but if it's a ratings scale you might be better off with an ordinal response model ...
I've seen a few of these previously for very simple functions, however the function i'm trying to fit is basically a mixture of 3 functions
A gaussian (which dominates at x=0)
An exponential (which takes over post gaussian)
and a constant which rounds out the values
From the other examples of this error that I have read it seems that the issue is caused by poor initial guesses, but I have no idea how to correct this or if this is even the actual issue given the size of my function.
Here is my code and one sample of the data I'm looking at.:
Value<-c(163301.080,269704.110,334570.550,409536.530,433021.260,418962.060,349554.460,253987.570,124461.710,140750.480,52612.790,54286.427,26150.025,14631.210,15780.244,8053.618,4402.581,2251.137,2743.511,1707.508,1246.894)
Height<-c(400,300,200,0,-200,-400,-600,-800,-1000,-1000,-1200,-1220,-1300,-1400,-1400,-1500,-1600,-1700,-1700,-1800,-1900)
Framed<-data.frame(Value,Height)
i<-nls(Value~a*exp(-Height^2/(2*b^2))+ c*exp(-d*abs(Height)) + e,
data=Framed,start = list(a=410000,b=5,c=10000,d=5,e=1200))
plot(Value~Height)
summary(i)
Thanks for your help now i have the same problem again, i've used your technique below (R noob) was using the manipulate plot in mathematica previously and i think i've got a relatively good fit for the data, here is a graph of the data i'm also attempting to fit (Sorry can't upload it, not enough reputation)
http://imgur.com/GtzIzSr
However i am getting the same issue, is this to do with my fit or the massive amounts of variability at low distances?
You're right about this usually being about bat starting values, and that's (part of) your case. Looking at your data and your guesses, it's clear that something is wrong. But before going into that, note that Framed was not created in the correct order. It should be X Y, or:
Framed <- data.frame(Height, Value)
With that in mind, try the following:
Vals2 <- 410000*exp(-Height^2/(2*5^2)) - 10000*exp(-5*abs(Height)) + 1200
plot(Framed)
lines(Height, Vals2)
You should get
This shows how bad your guesses are. Playing around with your function, it can be easily seen that b is far off. Change it to 500, and then:
That's much better, but still won't fit. And if you change the other parameters (c, d, and e), you'll notice they don't seem to affect the data too much, or at all. That's probably because a is much bigger and you have Height^2 in the first term. If you simplify your function, and run:
i<-nls(Value~a*exp(-Height^2/(2*b^2)), start = list(a=410000,b=500))
You'll find a fit. This is probably because non-linear functions get harder to fit as the number of parameter increases, specially if there are covariance between them. Less parameters are fitted much easier. You'll have to decide however if you can work with only a and b.
But if you plot that, it still doesn't look good. It's clear that your Value does not have its maximum at Height = 0, like it should from your description and from the simulated curve. There seems to be an error with your data, because if you try Height <- Height+200 along with the above changes, you'll get
> summary(i)
Formula: Value ~ a * exp(-Height^2/(2 * b^2))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 449820.71 10236.43 43.94 <2e-16 ***
b 496.60 12.54 39.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17790 on 19 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 2.164e-06
Now that's up to you to check if your data is indeed shifted and if you can simplify the function.
I am attempting to use mlogit in R to produce a transportation mode choice. The problem is that I have a variable that only applies to certain alternatives.
To be more specific, I am attempting to predict the probability of using auto, transit and non motorized modes of transportation. My predictors are: distance, transit wait time, number of vehicles in household and in vehicle travel time.
It works when I format it this way:
> amres<-mlogit(mode~ivt+board|distance+nveh,data=AMLOGIT)
However, the results I get for in vehicle travel time (ivt) does not make sense:
> summary(amres)
Call:
mlogit(formula = mode ~ ivt + board | distance + nveh, data = AMLOGIT,
method = "nr", print.level = 0)
Frequencies of alternatives:
auto tansit nonmotor
0.24654 0.28378 0.46968
nr method
5 iterations, 0h:0m:2s
g'(-H)^-1g = 6.34E-08
gradient close to zero
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
tansit:(intercept) 7.8392e-01 8.3761e-02 9.3590 < 2.2e-16 ***
nonmotor:(intercept) 3.2853e+00 7.1492e-02 45.9532 < 2.2e-16 ***
ivt 1.6435e-03 1.2673e-04 12.9691 < 2.2e-16 ***
board -3.9996e-04 1.2436e-04 -3.2161 0.001299 **
tansit:distance 3.2618e-04 2.0217e-05 16.1336 < 2.2e-16 ***
nonmotor:distance -2.9457e-04 3.3772e-05 -8.7224 < 2.2e-16 ***
tansit:nveh -1.5791e+00 4.5932e-02 -34.3799 < 2.2e-16 ***
nonmotor:nveh -1.8008e+00 4.8577e-02 -37.0720 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -10107
McFadden R^2: 0.30354
Likelihood ratio test : chisq = 8810.1 (p.value = < 2.22e-16)
As you can see, the stats look great, but ivt should be a negitive coefficient and not a positive one. My thoughts are that the non-motorized portion, which is all 0, is affecting it. I believe what I have to do is use the third par of the equation as seen below:
> amres<-mlogit(mode~board|distance+nveh|ivt,data=AMLOGIT)
However, this results in:
Error in solve.default(H, g[!fixed]) :
Lapack routine dgesv: system is exactly singular: U[10,10] = 0
I believe this is, again, because the variable is all 0's for non-motorized but I am unsure how to fix this. How do I include an alternative specific variable if it does not apply to all alternatives?
I am not well versed in the various implementations of logit models, but I imagine it has to do with making sure you have variation across persons and alternatives to the matrix can be properly determined with variation across alternatives and choosers.
What do you get from saying
amres<-mlogit(mode~distance| nveh | ivt+board,data=AMLOGIT)
mlogit has a group separation between the pipes, as I understand it as follows: first part is your basic formula, the second part is variables that don't vary across alternatives (i.e. are only person specific, gender, income--I think nveh should be here) while the third part varies by alternative.
Ken Train, incidentally, has a set of vignettes on mlogit specifically that might be helpful. Viton mentions the partition with pipes.
Ken Train's Vignettes
Philip Viton's Vignettes
Yves Croissant's Vignettes
Looks like you may have perfect separation. Have you checked this by e.g. looking at crosstables of the variables? (Can't fit a model if one combination of predictors allows for perfect prediction...) Would be helpful to know size of dataset in this regard - you may be over-fitting for the amount of data you have. This is a general problem in modelling, not specific to mlogit.
You say "the stats look great" but values for Pr(>|t|)s and the Likelihood ratio test look implausibly significant, which would be consistent with this problem. This means the estimates of the coefficients are likely to be inaccurate. (Are they similar to the coefficients produced by univariate modelling ?). Perhaps a simpler model would be more appropriate.
Edit #user3092719 :
You're fitting a generalized linear model, which can easily be overfit (as the outcome variable is discrete or nominal - i.e. has a restricted no. of values). mlogit is an extension of logistic regression; here's a simple example of the latter to illustrate:
> df1 <- data.frame(x=c(0, rep(1, 3)),
y=rep(c(0, 1), 2))
> xtabs( ~ x + y, data=df1)
y
x 0 1
0 1 0
1 1 2
Note the zero in the top right corner. This shows 'perfect separation' which means you that if x=0 you know for sure that y=0 based on this set. So a probabilistic predictive model doesn't make much sense.
Some output from
> summary(glm(y ~ x, data=df1, binomial(link = "logit")))
gives
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.57 6522.64 -0.003 0.998
x 19.26 6522.64 0.003 0.998
Here the size of the Std. Errors are suspiciously large relative to the value of the coefficients. You should also be alerted by Number of Fisher Scoring iterations: 17 - the large no. iterations needed to fit suggests numerical instability.
Your solution seems to involve ensuring that this problem of complete separation does not occur in your model, although hard to be sure without having a minimal working example.
I want to do single df orthogonal contrast in anova (fixed or mixed model). Here is just example:
require(nlme)
data (Alfalfa)
Variety: a factor with levels Cossack, Ladak, and Ranger
Date : a factor with levels None S1 S20 O7
Block: a factor with levels 1 2 3 4 5 6
Yield : a numeric vector
These data are described in Snedecor and Cochran (1980) as an example
of a split-plot design. The treatment structure used in the experiment
was a 3\times4 full factorial, with three varieties of alfalfa and four
dates of third cutting in 1943. The experimental units were arranged
into six blocks, each subdivided into four plots. The varieties of alfalfa
(Cossac, Ladak, and Ranger) were assigned randomly to the blocks and
the dates of third cutting (None, S1—September 1, S20—September 20,
and O7—October 7) were randomly assigned to the plots.
All four dates were used on each block.
model<-with (Alfalfa, aov(Yield~Variety*Date +Error(Block/Date/Variety)))
> summary(model)
Error: Block
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 5 4.15 0.83
Error: Block:Date
Df Sum Sq Mean Sq F value Pr(>F)
Date 3 1.9625 0.6542 17.84 3.29e-05 ***
Residuals 15 0.5501 0.0367
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: Block:Date:Variety
Df Sum Sq Mean Sq F value Pr(>F)
Variety 2 0.1780 0.08901 1.719 0.192
Variety:Date 6 0.2106 0.03509 0.678 0.668
Residuals 40 2.0708 0.05177
I want to perform some comparison (orthogonal contrasts within a group), for example for date, two contrasts:
(a) S1 vs others (S20 O7)
(b) S20 vs 07,
For variety factor two contrasts:
(c) Cossack vs others (Ladak and Ranger)
(d) Ladak vs Ranger
Thus the anova output would look like:
Error: Block
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 5 4.15 0.83
Error: Block:Date
Df Sum Sq Mean Sq F value Pr(>F)
Date 3 1.9625 0.6542 17.84 3.29e-05 ***
(a) S1 vs others ? ?
(b) S20 vs 07 ? ?
Residuals 15 0.5501 0.0367
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Error: Block:Date:Variety
Df Sum Sq Mean Sq F value Pr(>F)
Variety 2 0.1780 0.08901 1.719 0.192
(c) Cossack vs others ? ? ?
(d) Ladak vs Ranger ? ? ?
Variety:Date 6 0.2106 0.03509 0.678 0.668
Residuals 40 2.0708 0.05177
How can I perform this ? ....................
First of all, why use ANOVA? You can use lme from the nlme package and in addition to the hypothesis tests aov gives you, you also get interpretable estimates of the effect sizes and the directions of the effects. At any rate, two approaches come to mind:
Specify contrasts on the variables manually, as explained here.
Install the multcomp package and use glht.
glht is a little opinionated about models that are multivariate in their predictors. Long story short, though, if you were to create a diagonal matrix cm0 with the same dimensions and dimnames as the vcov of your model (let's assume it's an lme fit called model0), then summary(glht(model0,linfct=cm0)) should give the same estimates, SEs, and test statistics as summary(model0)$tTable (but incorrect p-values). Now, if you mess around with linear combinations of rows from cm0 and create new matrices with the same number of columns as cm0 but these linear combinations as rows, you'll eventually figure out the pattern to creating a matrix that will give you the intercept estimate for each cell (check it against predict(model0,level=0)). Now, another matrix with differences between various rows of this matrix will give you corresponding between-group differences. The same approach but with numeric values set to 1 instead of 0 can be used to get the slope estimates for each cell. Then the differences between these slope estimates can be used to get between-group slope differences.
Three things to keep in mind:
As I said the p-values are going to be wrong for models other than lm, (possibly, haven't tried) aov, and certain survival models. This is because glht assumes a z distribution instead of a t distribution by default (except for lm). To get correct p-values, take the test statistic glht calculates and manually do 2*pt(abs(STAT),df=DF,lower=F) to get the two-tailed p-value where STAT is the test statistic returned by glht and DF is the df from the corresponding type of default contrast in summary(model0)$tTable.
Your contrasts probably no longer test independent hypotheses, and multiple testing correction is necessary, if it wasn't already. Run the p-values through p.adjust.
This is my own distillation of a lot of handwaving from professors and colleagues, and a lot of reading of Crossvalidated and Stackoverflow on related topics. I could be wrong in multiple ways, and if I am, hopefully someone more knowlegeable will correct us both.