Why adonis function DF changes with different factors combination? - r

> data(dune)
> data(dune.env)
> str(dune.env)
'data.frame': 20 obs. of 5 variables:
$ A1 : num 2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
$ Moisture : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
$ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
$ Use : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
$ Manure : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...
As shown above, Moisture has four groups and Management has four groups, Manure has five groups when I run:
adonis(dune ~ Manure*Management*A1*Moisture, data=dune.env, permutations=99)
Call:
adonis(formula = dune ~ Manure * Management * A1 * Moisture, data = dune.env, permutations = 99)
Permutation: free
Number of permutations: 99
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Manure 4 1.5239 0.38097 2.03088 0.35447 0.13
Management 2 0.6118 0.30592 1.63081 0.14232 0.16
A1 1 0.3674 0.36743 1.95872 0.08547 0.21
Moisture 3 0.6929 0.23095 1.23116 0.16116 0.33
Manure:Management 1 0.1091 0.10906 0.58138 0.02537 0.75
Manure:A1 4 0.3964 0.09909 0.52826 0.09220 0.91
Management:A1 1 0.1828 0.18277 0.97431 0.04251 0.50
Manure:Moisture 1 0.0396 0.03963 0.21126 0.00922 0.93
Residuals 2 0.3752 0.18759 0.08727
Total 19 4.2990 1.00000
Why is DF of Management not 3(4-1)?

This is a general, rather than a specific answer.
Your formula Moisture*Management*A1*Manure corresponds to a linear model with 160 (!) predictors (2*4*4*5):
dim(model.matrix(~Moisture*Management*A1*Manure, dune.env))
adonis builds this model matrix internally and uses it to construct the machinery for calculating the permutation statistics. When there are multicollinear combinations of predictors, it drops enough columns to make the problem well-defined again. The detailed rules for which columns get dropped depends on the order of the columns; if you reorder the factors in your question you'll see the reported Df change.
For what it's worth, I don't think the df calculations change the statistical outcomes at all — the statistics are based on the distributions derived from permutations, not from an analytical calculation that depends on the df.

Ben Bolker got it right. If you only look at Management and Manure and forget all other variables, you will see this:
> with(dune.env, table(Management, Manure))
Manure
Management 0 1 2 3 4
BF 0 2 1 0 0
HF 0 1 2 2 0
NM 6 0 0 0 0
SF 0 0 1 2 3
Look at row Management NM and column Manure 0 that only have one non-zero case. This means that Management NM and Manure 0 are synonyms, the same thing (or "aliased"). After you have had Manure in your model, Management only has three new levels, and hence 2 d.f. If you do it in reversed order and first have Management then you only have four levels Manure that you do not yet know, and that would give you 3 d.f. of Manure.
Although you really have overparametrized your model, you would also get the same result with only these two variables. Compare models:
adonis2(dune ~ Manure + Management, data=dune.env)
adonis2(dune ~ Management + Manure, data=dune.env)

Related

Recursive partitioning for factors/characters problem

Currently I am working with the dataset predictions. In this data I have converted clear character type variables into factors because I think factors work better than characters for glmtree() code (tell me if I am wrong with this):
> str(predictions)
'data.frame': 43804 obs. of 14 variables:
$ month : Factor w/ 7 levels "01","02","03",..: 6 6 6 6 1 1 2 2 3 3 ...
$ pred : num 0.21 0.269 0.806 0.945 0.954 ...
$ treatment : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ type : Factor w/ 4 levels "S","MS","ML",..: 1 1 4 4 4 4 4 4 4 4 ...
$ i_mode : Factor w/ 143 levels "AAA","ABC","CBB",..: 28 28 104 104 104 104 104 104 104 104 ...
$ r_mode : Factor w/ 29 levels "0","5","8","11",..: 4 4 2 2 2 2 2 2 2 2 ...
$ in_mode: Factor w/ 22 levels "XY",..: 11 11 6 6 6 6 6 6 6 6 ...
$ v_mode : Factor w/ 5 levels "1","3","4","7",..: 1 1 1 1 1 1 1 1 1 1 ...
$ di : num 1157 1157 1945 1945 1945 ...
$ cont : Factor w/ 5 levels "AN","BE",..: 2 2 2 2 2 2 2 2 2 2 ...
$ hk : num 0.512 0.512 0.977 0.977 0.941 ...
$ np : num 2 2 2 2 2 2 2 2 2 2 ...
$ hd : num 1 1 0.408 0.408 0.504 ...
$ nd : num 1 1 9 9 9 9 7 7 9 9 ...
I want to estimate a recursive partitioning model of this kind:
library("partykit")
glmtr <- glmtree(formula = pred ~ treatment + 1 | (month+type+i_mode+r_mode+in_mode+v_mode+di+cont+np+nd+hd+hk),
data = predictions,
maxdepth=6,
family = quasibinomial)
My data does not have any NA. However, the following error arises (even after changing characters by factors):
Error in matrix(0, nrow = mi, ncol = nl) :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range
Any clue?
Thank you
You are right that glmtree() and the underlying mob() function expect the split variables to be factors in case of nominal information. However, computationally this is only feasible for factors that have either a limited number of levels because the algorithm will try all possible partitions of the number of levels into two groups. Thus, for your i_mode factor this necessitates going through nl levels and mi splits into two groups with:
nl <- 143
mi <- 2^(nl - 1L) - 1L
mi
## [1] 5.575186e+42
Internally, mob() tries to create a matrix for storing all log-likelihoods associated with the corresponding partitioned models. And this is not possible because such a matrix cannot be represented. (And even if you could, then you wouldn't finish fitting all the associated models.) Admittedly, the error message is not very useful and should be improved. We will look into that for the next revision of the package.
For solving the problem, I would recommend to turn the variables i_mode, r_mode, and in_mode into variables that are more suitable for binary splitting with exhaustive search. Maybe, some of the variables are actually ordinal? If so, I would recommend to turn them into ordinal factors or in case of i_mode even into a numeric variable because the number of levels is large enough. Alternatively, you can maybe create several factors with different properties about the different levels that could then be used for partitioning.

Error message in Firth's logistic regression

I have produced a logistic regression model in R using the logistf function from the logistf package due to quasi-complete separation. I get the error message:
Error in solve.default(object$var[2:(object$df + 1), 2:(object$df + 1)]) :
system is computationally singular: reciprocal condition number = 3.39158e-17
The data is structured as shown below, though a lot of the data has been cut here. Numbers represent levels (i.e 1 = very low, 5 = very high) not count data. Variables OrdA to OrdH are ordered factors. The variable Binary is a factor.
OrdA OrdB OrdC OrdE OrdF OrdG OrdH Binary
1 3 4 1 1 2 1 1
2 3 4 5 1 3 1 1
1 3 2 5 2 4 1 0
1 1 1 1 3 1 2 0
3 2 2 2 1 1 1 0
I have read here that this can be caused by multicollinearity, but have tested this and it is not the problem.
VIFModel <- lm(Binary ~ OrdA + OrdB + OrdC + OrdD + OrdE +
OrdF + OrdG + OrdH, data = VIFdata)
vif(VIFModel)
GVIF Df GVIF^(1/(2*Df))
OrdA 6.09 3 1.35
OrdB 3.50 2 1.37
OrdC 7.09 3 1.38
OrdD 6.07 2 1.57
OrdE 5.48 4 1.23
OrdF 3.05 2 1.32
OrdG 5.41 4 1.23
OrdH 3.03 2 1.31
The post also indicates that the problem can be caused by having "more variables than observations." However, I have 8 independent variables and 82 observations.
For context each independent variable is ordinal with 5 levels, and the binary dependent variable has 30% of the observations with "successes." I'm not sure if this could be associated with the issue. How do I fix this issue?
X <- model.matrix(Binary ~ OrdA+OrdB+OrdC+OrdD+OrdE+OrdF+OrdG+OrdH,
Data3, family = "binomial"); dim(X); Matrix::rankMatrix(X)
[1] 82 24
[1] 23
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.820766e-14
Short answer: your ordinal input variables are transformed to 24 predictor variables (number of columns of the model matrix), but the rank of your model matrix is only 23, so you do indeed have multicollinearity in your predictor variables. I don't know what vif is doing ...
You can use svd(X) to help figure out which components are collinear ...

R: Explorative linear regression, setting up a simple model with multiple depentent and independent variables

I have a study with several cases, all containing data from multiple ordinal factor variables (genotypes) and multiple numeric variables (various blood samples (concentrations)). I am trying to set up an explorative model to test linearity between any of the numeric variables (dependent in the model) and any of the ordinal factor variables (independent in the model).
Dataset structure example (independent variables): genotypes
case_id genotype_1 genotype_2 ... genotype_n
1 0 0 1
2 1 0 2
... ... ... ...
n 2 1 0
and dependent variables (with matching case id:s): samples
case_id sample_1 sample_2 ... sample_n
1 0.3 0.12 6.12
2 0.25 0.15 5.66
... ... ... ...
n 0.44 0.26 6.62
Found one similar example in the forum which doesn't solve the problem:
model <- apply(samples,2,function(xl)lm(xl ~.,data= genotypes))
I can't figure out how to make simple linear regressions that go through any combination of a given set of dependent and independent variables. If using apply family I guess the varying (x) term should be the dependent variable in the model since every dependent variable should test linearity for the same set of independent variables (individually).
Extract from true data:
> genotypes
case_id genotype_1 genotype_2 genotype_3 genotype_4 genotype_5
1 1 2 2 1 1 0
2 2 NaN 1 NaN 0 0
3 3 1 0 0 0 NaN
4 4 2 2 1 1 0
5 5 0 0 0 1 NaN
6 6 2 2 1 0 0
7 9 0 0 0 0 1
8 10 0 0 0 NaN 0
9 13 0 0 0 NaN 0
10 15 NaN 1 NaN 0 1
> samples
case_id sample_1 sample_2 sample_3 sample_4 sample_5
1 1 0.16092019 0.08814160 -0.087733372 0.1966070 0.09085343
2 2 -0.21089678 -0.13289427 0.056583528 -0.9077926 -0.27928376
3 3 0.05102400 0.07724300 -0.212567535 0.2485348 0.52406368
4 4 0.04823619 0.12697286 0.010063683 0.2265085 -0.20257192
5 5 -0.04841221 -0.10780329 0.005759269 -0.4092782 0.06212171
6 6 -0.08926734 -0.19925538 0.202887833 -0.1536070 -0.05889369
7 9 -0.03652588 -0.18442457 0.204140717 0.1176950 -0.65290133
8 10 0.07038933 0.05797007 0.082702589 0.2927817 0.01149564
9 13 -0.14082554 0.26783539 -0.316528107 -0.7226103 -0.16165326
10 15 -0.16650266 -0.35291579 0.010063683 0.5210507 0.04404433
SUMMARY: Since I have a lot of data I want to create a simple model to help me select which possible correlations to look further into. Any ideas out there?
NOTE: I am not trying to fit a multiple linear regression model!
I feel like there must be a statistical test for linearity, but I can't recall it. Visual inspection is typically how I do it. Quick and dirty way to test for linearity for a large number of variables would be to test the corr() of each pair of dependent/independent variables. Small multiples would be a handy way to do it.
Alternately, for each dependent ordinal variable, run a corrplot vs. each independent (numerical) variable, a logged version of the independent variable, and the exponentiated version of the independant variable. If the result of CORR for the logged or exponented version has a higher p-value than the regular version, it seems likely you have some linearity issues.

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

Create a pedigree in R when only 1 parent is known for some individuals?

Is there a way to create a pedigree in R when only 1 parent is known for some individuals?
I've tried using kinship2 package in R to create a pedigree but I believe this can only handle when either no parents are known (considered generation 1) and then both parents are known thereafter.
I believe synbreed package is able to deal with this and I have tried the code below but for some reason I receive an error code that I cannot decipher. Is there something wrong with the way my data are structured? Or how I have formulated the arguments of create.pedigree() function? Or is it not possible to construct this pedigree in synbreed package?
Note rows 5 and 6 of data frame 'Ped' which have only one parent ID.
> Ped<-read.csv("Pedigree.csv",header=T)
> library(synbreed)
> head(Ped)
IndividualID SireID DamID Sex
1 019-35751 026-34118 026-34117 male
2 019-35740 <NA> <NA> female
3 019-35791 026-34129 026-34128 male
4 019-35702 <NA> <NA> male
5 019-35784 <NA> 026-34147 female
6 019-35764 <NA> 026-34133 male
> str(Ped)
'data.frame': 1136 obs. of 4 variables:
$ IndividualID: Factor w/ 1136 levels "019-35702","019-35712",..: 6 4 10 1 9 8 3 63 62 108 ...
$ SireID : Factor w/ 136 levels "019-35712","019-35756",..: 8 NA 15 NA NA NA 23 23 23 84 ...
$ DamID : Factor w/ 131 levels "026-34101","026-34103",..: 4 NA 7 NA 13 8 NA NA NA 30 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 2 2 2 2 ...
> create.pedigree(Ped$IndividualID, Ped$SireID, Ped$DamID , unknown = NA)
Error in `[<-.data.frame`(`*tmp*`, is.na(pedigree), value = 0) :
unsupported matrix index in replacement
I don't know the synbreed package and the assumptions it makes about individuals with only one known parent, so the answer below may be different from what you are looking for.
However, if you know/believe that the unknown parents refer to independent individuals then you can "fix" your dataset by adding "empty" parents.
A toy dataset that resembles what you show could be
fid id father mother sex
1 1 . . 1
1 2 . . 2
1 3 1 2 1
1 4 1 2 1
1 5 . 2 2
1 6 . 2 1
Here we have missing fathers for individual 5 and 6. Then we add two new entries to represent the fathers that we have not seen. Thus, the dataset should be
fid id father mother sex
1 1 . . 1
1 2 . . 2
1 3 1 2 1
1 4 1 2 1
1 100 . . 1
1 101 . . 1
1 5 100 2 2
1 6 101 2 1
where we have added two new fathers that are only used to fill out the pedigree structure. This last dataset can be read with
library(kinship2)
indata <- read.table("ped.txt", header=TRUE, na.strings=".")
with(indata, pedigree(id=id, dadid=father, momid=mother, sex=sex, famid=fid))
Now, in the fixed dataset we have made an implicit assumption that individuals 5 and 6 are half siblings and that they cannot be full siblings. If the synbreed package (and the relevant computations) can handle the possibility that those individuals then that is different (and computationally quite difficult) from what I suggest.

Resources