> data(dune)
> data(dune.env)
> str(dune.env)
'data.frame': 20 obs. of 5 variables:
$ A1 : num 2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
$ Moisture : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
$ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
$ Use : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
$ Manure : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
As shown above, Moisture has four groups and Management has four groups, Manure has five groups when I run:
adonis(dune ~ Manure*Management*A1*Moisture, data=dune.env, permutations=99)
Call:
adonis(formula = dune ~ Manure * Management * A1 * Moisture, data = dune.env, permutations = 99)
Permutation: free
Number of permutations: 99
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Manure 4 1.5239 0.38097 2.03088 0.35447 0.13
Management 2 0.6118 0.30592 1.63081 0.14232 0.16
A1 1 0.3674 0.36743 1.95872 0.08547 0.21
Moisture 3 0.6929 0.23095 1.23116 0.16116 0.33
Manure:Management 1 0.1091 0.10906 0.58138 0.02537 0.75
Manure:A1 4 0.3964 0.09909 0.52826 0.09220 0.91
Management:A1 1 0.1828 0.18277 0.97431 0.04251 0.50
Manure:Moisture 1 0.0396 0.03963 0.21126 0.00922 0.93
Residuals 2 0.3752 0.18759 0.08727
Total 19 4.2990 1.00000
Why is DF of Management not 3(4-1)?
This is a general, rather than a specific answer.
Your formula Moisture*Management*A1*Manure corresponds to a linear model with 160 (!) predictors (2*4*4*5):
dim(model.matrix(~Moisture*Management*A1*Manure, dune.env))
adonis builds this model matrix internally and uses it to construct the machinery for calculating the permutation statistics. When there are multicollinear combinations of predictors, it drops enough columns to make the problem well-defined again. The detailed rules for which columns get dropped depends on the order of the columns; if you reorder the factors in your question you'll see the reported Df change.
For what it's worth, I don't think the df calculations change the statistical outcomes at all — the statistics are based on the distributions derived from permutations, not from an analytical calculation that depends on the df.
Ben Bolker got it right. If you only look at Management and Manure and forget all other variables, you will see this:
> with(dune.env, table(Management, Manure))
Manure
Management 0 1 2 3 4
BF 0 2 1 0 0
HF 0 1 2 2 0
NM 6 0 0 0 0
SF 0 0 1 2 3
Look at row Management NM and column Manure 0 that only have one non-zero case. This means that Management NM and Manure 0 are synonyms, the same thing (or "aliased"). After you have had Manure in your model, Management only has three new levels, and hence 2 d.f. If you do it in reversed order and first have Management then you only have four levels Manure that you do not yet know, and that would give you 3 d.f. of Manure.
Although you really have overparametrized your model, you would also get the same result with only these two variables. Compare models:
adonis2(dune ~ Manure + Management, data=dune.env)
adonis2(dune ~ Management + Manure, data=dune.env)
I'm performing a count data analysis in R, dealing with a data called 'doctor' which is:
V2 V3 L4 V5
1 1 32 10.866795 1
2 2 104 10.674706 1
3 3 206 10.261581 1
4 4 186 9.446440 1
5 5 102 8.578665 1
6 1 2 9.841080 2
7 2 12 9.275472 2
8 3 28 8.649974 2
9 4 28 7.857481 2
10 5 31 7.287561 2
The best model was V3~V2+L4+V5+V2:L4:V5 using stepwise AIC. Now I want to set L4 as the offset and perform negative binomial regression including the interaction, so I used the code nbinom=glm.nb(V3~V2+V5+V2:V5,offset=L4) but get this error message that says Error in glm.control(...) : unused argument (offset = L4). What have I done wrong here?
Offsets are entered using an offset term in the model formula:
nbinom=glm.nb(V3~V2+V5+V2:V5+offset(L4))
Also you can use V2*V5 instead of V2+V5+V2:V5
I have produced a logistic regression model in R using the logistf function from the logistf package due to quasi-complete separation. I get the error message:
Error in solve.default(object$var[2:(object$df + 1), 2:(object$df + 1)]) :
system is computationally singular: reciprocal condition number = 3.39158e-17
The data is structured as shown below, though a lot of the data has been cut here. Numbers represent levels (i.e 1 = very low, 5 = very high) not count data. Variables OrdA to OrdH are ordered factors. The variable Binary is a factor.
OrdA OrdB OrdC OrdE OrdF OrdG OrdH Binary
1 3 4 1 1 2 1 1
2 3 4 5 1 3 1 1
1 3 2 5 2 4 1 0
1 1 1 1 3 1 2 0
3 2 2 2 1 1 1 0
I have read here that this can be caused by multicollinearity, but have tested this and it is not the problem.
VIFModel <- lm(Binary ~ OrdA + OrdB + OrdC + OrdD + OrdE +
OrdF + OrdG + OrdH, data = VIFdata)
vif(VIFModel)
GVIF Df GVIF^(1/(2*Df))
OrdA 6.09 3 1.35
OrdB 3.50 2 1.37
OrdC 7.09 3 1.38
OrdD 6.07 2 1.57
OrdE 5.48 4 1.23
OrdF 3.05 2 1.32
OrdG 5.41 4 1.23
OrdH 3.03 2 1.31
The post also indicates that the problem can be caused by having "more variables than observations." However, I have 8 independent variables and 82 observations.
For context each independent variable is ordinal with 5 levels, and the binary dependent variable has 30% of the observations with "successes." I'm not sure if this could be associated with the issue. How do I fix this issue?
X <- model.matrix(Binary ~ OrdA+OrdB+OrdC+OrdD+OrdE+OrdF+OrdG+OrdH,
Data3, family = "binomial"); dim(X); Matrix::rankMatrix(X)
[1] 82 24
[1] 23
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.820766e-14
Short answer: your ordinal input variables are transformed to 24 predictor variables (number of columns of the model matrix), but the rank of your model matrix is only 23, so you do indeed have multicollinearity in your predictor variables. I don't know what vif is doing ...
You can use svd(X) to help figure out which components are collinear ...
How to include the interaction between a covariate and and time for a non-proportional hazards model?
I often find that the proportional hazards assumption for the Cox regressions doesn’t hold.
Take the following data as an example.
> head(data2)
no np_p age_dx1 race1 mr_dx er_1 pr_1 sct_1 surv_mo km_stts1
1 20 1 2 4 1 2 2 4 52 1
2 33 1 3 1 2 1 2 1 11 1
3 67 1 2 4 4 1 1 3 20 1
4 90 1 3 1 3 3 3 2 11 1
5 143 1 2 4 3 1 1 2 123 0
6 180 1 3 1 3 1 1 2 9 1
First, I fitted a Cox regression model.
> fit2 <- coxph(Surv(surv_mo, km_stts1) ~ np_p + age_dx1 + race1 + mr_dx + er_1 + pr_1 + sct_1, data = data)
Second, I assessed the proportional hazards assumption.
> check_PH2 <- cox.zph(fit2, transform = "km")
> check_PH2
rho chisq p
np_p 0.00946 0.0748 7.84e-01
age_dx1 -0.00889 0.0640 8.00e-01
race1 -0.03148 0.7827 3.76e-01
mr_dx -0.03120 0.7607 3.83e-01
er_1 -0.14741 18.5972 1.61e-05
pr_1 0.05906 2.9330 8.68e-02
sct_1 0.17651 23.8030 1.07e-06
GLOBAL NA 53.2844 3.26e-09
So, this means that the hazard function of er_1 and sct_1 were nonproportional over time (Right?).
In my opinion, I can include the interaction between these two covariates and time seperately in the model. But I don't know how to perform it using R.
Thank you.
I am trying to fit a parametric survival model. I think I managed to do so. However, I could not succeed in calculating the survival probabilities:
library(survival)
zaman <- c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,65,
56,65,17,7,16,22,3,4,2,3,8,4,3,30,4,43)
test <- c(rep(1,17),rep(0,16))
WBC <- c(2.3,0.75,4.3,2.6,6,10.5,10,17,5.4,7,9.4,32,35,100,
100,52,100,4.4,3,4,1.5,9,5.3,10,19,27,28,31,26,21,79,100,100)
status <- c(rep(1,33))
data <- data.frame(zaman,test,WBC)
surv3 <- Surv(zaman[test==1], status[test==1])
fit3 <- survreg( surv3 ~ log(WBC[test==1]),dist="w")
On the other hand, no problem at all while calculating the survival probabilities using the Kaplan-Meier Estimation:
fit2 <- survfit(Surv(zaman[test==0], status[test==0]) ~ 1)
summary(fit2)$surv
Any idea why?
You can get the predicted probabilities from a survreg object with predict:
predict(fit3)
If you're interested in combining this with the original data, and also in the residual and standard errors of the predictions, you can use the augment function in my broom package:
library(broom)
augment(fit3)
A full analysis might look something like:
library(survival)
library(broom)
data <- data.frame(zaman, test, WBC, status)
subdata <- data[data$test == 1, ]
fit3 <- survreg( Surv(zaman, status) ~ log(WBC), subdata, dist="w")
augment(fit3, subdata)
With the output:
zaman test WBC status .fitted .se.fit .resid
1 65 1 2.30 1 115.46728 43.913188 -50.467281
2 156 1 0.75 1 197.05852 108.389586 -41.058516
3 100 1 4.30 1 85.67236 26.043277 14.327641
4 134 1 2.60 1 108.90836 39.624106 25.091636
5 16 1 6.00 1 73.08498 20.029707 -57.084979
6 108 1 10.50 1 55.96298 13.989099 52.037022
7 121 1 10.00 1 57.28065 14.350609 63.719348
8 4 1 17.00 1 44.47189 11.607368 -40.471888
9 39 1 5.40 1 76.85181 21.708514 -37.851810
10 143 1 7.00 1 67.90395 17.911170 75.096054
11 56 1 9.40 1 58.99643 14.848751 -2.996434
12 26 1 32.00 1 32.88935 10.333303 -6.889346
13 22 1 35.00 1 31.51314 10.219871 -9.513136
14 1 1 100.00 1 19.09922 8.963022 -18.099216
15 1 1 100.00 1 19.09922 8.963022 -18.099216
16 5 1 52.00 1 26.09034 9.763728 -21.090343
17 65 1 100.00 1 19.09922 8.963022 45.900784
In this case, the .fitted column is the predicted probabilities.