Having produced a Bray-Curtis dissimilarity with my Hellinger-transformed data (26 samples, 3000+ species/OTUs), I went on to build a MDS plot.
I got the following metrics:
Dimensions: 2
Stress: 0.111155
Stress type 1, weak ties
Two convergent solutions found after 2 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘ALG_Hellinger’
However, the corresponding Shepard's plot looked as follows:
Which, although achieving good fits seems as if the BC dissimilarity has not enough resolution to differentiate across samples. Is this right?
Testing it through ANOSIM, I got the following,
ANOSIM statistic R: 1
Significance: 0.001
Permutation: free
Number of permutations: 999
Upper quantiles of permutations (null model):
90% 95% 97.5% 99%
0.123 0.166 0.203 0.249
Dissimilarity ranks between and within classes:
0% 25% 50% 75% 100% N
Between 97 154.0 212.0 266.50 325 229
Cliona celata complex 19 32.0 46.0 59.00 66 21
Cliona viridis 3 26.5 37.5 48.50 60 6
Dysidea fragilis 56 56.5 57.0 59.50 62 3
Phorbas fictitius 1 18.5 48.5 79.75 96 66
And ADONIS told me the same:
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
SCIE_NAME 3 7.8738 2.62461 43.049 0.85445 0.001 ***
Residuals 22 1.3413 0.06097 0.14555
Total 25 9.2151 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This is, the differences among the samples are significant, but the MDS ordination seems somewhat misleading.
How can I test another aspect of the MDS or change anything about this analysis, if even needed?
Thank you in advance!
André
I don't think that the Shepard plot is poor. Rather, it shows that your data are strongly clustered. This is consistent with adonis which says that most (85%) of variation is between clusters. It is also consistent with anosim which shows that within-cluster distances are much shorter than between-cluster distances.
Related
I have activity budget data from wild orangutans for which I am investigating if there is a difference in the time they spend feeding, resting and travelling before a forest fire event and after the fire event. I am running a linear mixed effects model with the minutes spent feeding on a particular day as my response variable (with the number of minutes the orangutan is awake as an offset). Fire period and age/sex class are fixed effects, and orangutan ID is the random effect.
I have 2 levels of the fire_time factor ('pre' and 'post'), 4 levels of the Age_Sex factor ('SAF', 'FM', 'UFM', 'Adolescent'), 47 orangutans for the random effect and a total of 817 datapoints in this dataset.
My dataframe looks like this:
head(F)
Follow_num Ou_name Date Month fire_time Age_Sex Primary_Act AP_obs minutesin24hr Perc_of_waking_day Perc_of_24hr
1 2029 Teresia 2011-10-04 Oct-11 pre SAF Feeding 625 310 49.60 21.53
5 2030 Teresia 2011-10-05 Oct-11 pre SAF Feeding 610 285 46.72 19.79
9 2032 Teresia 2011-10-09 Oct-11 pre SAF Feeding 620 340 54.84 23.61
13 2034 Teresia 2011-10-11 Oct-11 pre SAF Feeding 670 405 60.45 28.13
17 2038 Victor 2011-10-27 Oct-11 pre FM Feeding 675 155 22.96 10.76
21 2040 Nero 2011-11-03 Nov-11 pre FM Feeding 640 295 46.09 20.49
The code for my model is as follows:
library(lme4)
lmer(minutesin24hr ~ Age_Sex + fire_time + (1|Ou_name), data = F, offset = AP_obs, REML = TRUE, na.action = "na.fail")
When I run this model using the lmerTest package to check degrees of freedom and p-values, it seems I have very large degrees of freedom for the levels that are significant (see Age_SexSAF and fire_timepre).
lmerTestmodel <- lmerTest::lmer(minutesin24hr ~ Age_Sex + fire_time + (1|Ou_name), data = F, offset = AP_obs, REML = TRUE, na.action = "na.fail")
REML criterion at convergence: 9370.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.8955 -0.6304 0.1006 0.7141 2.3109
Random effects:
Groups Name Variance Std.Dev.
Ou_name (Intercept) 1636 40.44
Residual 5460 73.89
Number of obs: 817, groups: Ou_name, 47
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -188.614 14.711 26.765 -12.821 6.14e-13 ***
Age_SexFM -20.297 17.978 24.696 -1.129 0.2698
Age_SexSAF -25.670 11.799 318.473 -2.176 0.0303 *
Age_SexUFM 12.925 22.806 27.319 0.567 0.5755
fire_timepre -29.558 6.214 709.117 -4.757 2.38e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) Ag_SFM A_SSAF A_SUFM
Age_SexFM -0.741
Age_SexSAF -0.505 0.374
Age_SexUFM -0.598 0.480 0.302
fire_timepr -0.298 -0.015 0.149 0.034
I imagine these large degrees of freedom are making the p-values significant so am sceptical about the model. Why is it I am getting such large degrees of freedom on just these two levels? There are more data in the Age_SexSAF and fire_timepre levels but it doesn't seem normal to me.
I am planning on reporting the estimate, confidence intervals and p-values in my thesis but am concerned about reporting if these degrees of freedom are wrong.
Apologies if this may be a naïve question, this is the first time I have ventured into mixed effects models. Any advice is greatly appreciated, thanks!
I'm working on a publication examining rising authorship of minorities for certain articles. There is a clear increasing trend, but I wanted to apply some statistical rigor. My data frame is simple: Years, and % minority authorship. However, Cochran-Armitage input dataframe doesn't make sense in my context. Am I using the right test?
I have perfected the dataframe and prepared it by producing the number of years on the x-axis, and the % minority authorship on the y-axis. essentially 1 row, and 10 columns (each column representing one year). However, cochran-armitage cannot accept 1 row dataframes
my dataframe exists as so
year 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
race 11.1 12.1 14.2 15.2 19.2 20.5 21.8 27.9 30.1 31.1
The Cochrane Armitage is likely to be the wrong test, it is for association between a variable with 2 categories and an ordinal variable with K categories. You have two variables each with one category.
I think a simple linear regression would work. In fact, when you run one on the data you provided (you are missing the % for 2018, so I removed that row), this is what you get:
> summary(y_p)
Call:
lm(formula = year ~ percent, data = y_p)
Residuals:
Min 1Q Median 3Q Max
-0.77079 -0.38560 -0.03582 0.35535 0.90139
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.004e+03 5.428e-01 3692.52 < 2e-16 ***
percent 4.045e-01 2.526e-02 16.01 2.32e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5586 on 8 degrees of freedom
Multiple R-squared: 0.9697, Adjusted R-squared: 0.966
F-statistic: 256.4 on 1 and 8 DF, p-value: 2.319e-07
This looks fairly significant to me, but you would need to check the residuals etc to be sure.
Good morning,
I am having trouble understanding some of my outputs for my Kaplan Meier analyses.
I have managed to produce the following plots and outputs using ggsurvplot and survfit.
I first made a plot of survival time of 55 nest with time and then did the same with the top predictors for nest failure, one being microtopography, as seen in this example.
Call: npsurv(formula = (S) ~ 1, data = nestdata, conf.type = "log-log")
26 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
55 45 0 13 29 2 NA
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata, conf.type = "log-log")
29 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
Microtopography=0 14 13 0 1 NA NA NA
Microtopography=1 26 21 0 7 NA 29 NA
Microtopography=2 12 8 0 5 3 2 NA
So, I have two primary questions.
1. The survival curves are for a ground nesting bird with an egg incubation time of 21-23 days. Incubation time is the number of days the hen sits of the eggs before they hatch. Knowing that, how is it possible that the median survival time in plot #1 is 29 days? It seems to fit with the literature I have read on this same species, however, I assume it has something to do with the left censoring in my models, but am honestly at a loss. If anyone has any insight or even any litterature that could help me understand this concept, I would really appreciate it.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt, the median survival times returned are NA. I understand I can chose another interval, such as .75, but in this example that still wouldnt help me because microtopography 0 never drops below .9 or so. How would one go about reporting this data. Would the work around be to choose a survival interval, using:
summary(s,times=c(7,14,21,29))
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata,
conf.type =
"log-log")
29 observations deleted due to missingness
Microtopography=0
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 3 0 0 1.000 0.0000 1.000 1.000
14 7 0 0 1.000 0.0000 1.000 1.000
21 13 0 0 1.000 0.0000 1.000 1.000
29 8 1 5 0.909 0.0867 0.508 0.987
Microtopography=1
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 9 0 0 1.000 0.0000 1.000 1.000
14 17 1 0 0.933 0.0644 0.613 0.990
21 21 3 0 0.798 0.0909 0.545 0.919
29 15 3 7 0.655 0.1060 0.409 0.819
Microtopography=2
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 1 2 0 0.333 0.272 0.00896 0.774
14 7 1 0 0.267 0.226 0.00968 0.686
21 8 1 0 0.233 0.200 0.00990 0.632
29 3 1 5 0.156 0.148 0.00636 0.504
Late to the party...
The median survival time of 29 days is the median incubation time that birds of this species are expected to be in the egg until they hatch - based on your data. Your median of 21-24 (based on ?) is probably based on many experiments/studies of eggs that have hatched, ignoring those that haven't hatched yet (those that failed?).
From your overall survival curve, it is clear that some eggs have not yet hatched, even after more than 35 days. These are taken into account when calculating the expected survival times. If you think that these eggs will fail, then omit them. Otherwise, the software cannot possibly know that they will eventually fail. But how can anyone know for sure if an egg is going to fail, even after 30 days? Is there a known maximum hatching time? The record-breaker of all hatched eggs?
There are not really R questions, so this question might be more appropriate for the statistics site. But the following might help.
how is it possible that the median survival time in plot #1 is 29 days?
The median survival is where the survival curve passes the 50% mark. Eyeballing it, 29 days looks right.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt.
Given your data, you cannot compare the median. You can compare the 75% or 90%, if you must. You can compare the point survival at, say, 30 days. You can compare the truncated average survival in the first 30 days.
In order to compare the median, you would have to make an assumption. I reasonable assumption would be an exponential decay after some tenure point that includes at least one failure.
Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)
I'm quite new to R but I've tried recently to make a two way repeated measures ANOVA, to replicate the results that my supervisor did on SPSS.
I've struggled for days and read dozens of articles to understand what was going on in R, but I still don't get the same results.
> mod <- lm(Y~A*B)
> Anova(mod, type="III")
Anova Table (Type III tests)
Response: Y
Sum Sq Df F value Pr(>F)
(Intercept) 0.000 1 0.0000 1.00000
A 2.403 5 8.6516 4.991e-08 ***
B 0.403 2 3.6251 0.02702 *
A:B 1.220 10 2.1962 0.01615 *
Residuals 51.987 936
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
My data are from a balanced design, I used the type III SS since it's the one used in SPSS as well. The Sum squares the Df, and the linear model are the same in SPSS, the only things that differ being the F and p value. Thus, it should not be a Sum square mistake.
Results in SPSS are:
F Sig.
A 7.831 .000
B 2.681 .073
A:B 2.247 .014
I'm a little bit lost. Would it be a problem related to the contrasts?
Lucas