Unbalanced nested design adonis2 - r

I am hoping to get some advice on how to account for unbalanced nested data in a PERMANOVA using adonis2.
My experimental design has several plots within which four samples collected (nested design). Three samples did not contain individuals so were removed prior to analysis creating the unbalanced design. This was necessary as I wanted to run a Bray Curtis distribution that cannot run with empty cells.
The grouping variable that I am hoping to test using the PERMANOVA has two levels that are also unbalanced.
After doing some reading I gathered that if my data was balanced the code below could (?) work. But I am lost as to how to adjust it for unbalanced strata. Thanks in advance for your advice.
perm<-how(within = Within(type = "free"),
plots = Plots(type = "free", strata = datv$Plot),
blocks = NULL, nperm = 999)
m1 <- adonis2(dat ~ groups, data = datv, permutations = perm)
m1

Related

How to do an Exploratory Factor Analysis on data tha some variables are ordinal and others dichotomous in R?

I am trying to do an exploratory factor analysis using a data set that contains items from questionnaires. Some questions were answered and have 4 levels and some had only two. However I am not sure what is the implementation in R when using both types of variables.
I have seen that psych is a good toolbox and have used the ordinal and dichotomous items separately well, but I am not sure how to combine both. The tetrachoric or polyphoric function does not seem to work for example when I am using both data types together.
My code as it stands is this:
poly_cor = polychoric(data)
rho = poly_cor$rho
cor.plot(rho, numbers=T, upper=FALSE, main = "Polychoric Correlation", show.legend = FALSE)
fa.parallel(data, fm="pa", fa="fa", main = "Scree Plot")
poly_model = fa(data, nfactor=4, cor="poly", fm="mle", rotate = "oblimin")
However the polychoric functions says that there is not an equal number of response alternatives.

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

Limiting Number of Predictions - R

I'm trying to create a model that predicts whether a given team makes the playoffs in the NHL, based on a variety of available team stats. However, I'm running into a problem.
I'm using R, specifically the caret package, and I'm having fairly good success so far, with one issue: I can't limit the the number of teams that are predicted to make the playoffs.
I'm using a categorical variable as the prediction -- Y or N.
For example, using the random forest method from the caret package,
rf_fit <- train(playoff ~ ., data = train_set, method = "rf")
rf_predict <- predict(rf_fit,newdata = test_data_playoffs)
mean(rf_predict == test_data_playoffs$playoff)
gives an accuracy of approximately 90% for my test set, but that's because it's overpredicting. In the NHL, 16 teams make the playoffs, but this predicts 19 teams to make the playoffs. So I want to limit the number of "Y" predictions to 16.
Is there a way to limit the number of possible responses for a categorical variable? I'm sure there is, but google searching has given me limited success so far.
EDIT: Provide sample data which can be created with the following code:
set.seed(100) # For reproducibility
data <- data.frame(Y = sample(1:10,32,replace = T)/10, N = rep(NA,32))
data$N <- 1-data$Y
This creates a data frame similar to what you get by using the "prob" option, where you have a list of probabilities for Y and N
pred <- predict(fit,newdata = test_data_playoffs, "prob")

Treatment of categorical variables in rpart

I wonder how rpart treats categorical variables. There are several references suggesting that for unordered factors it looks through all combinations. Actually, even the vignette at the end section 6.2 states
(F)or a categorical predictor with m levels, all 2^m−1 different possible
splits are tested.
However, given my experience with the code, I find it difficult to believe. The vignette shows a supporting evidence that running
rpart(Reliability ~ ., data=car90)
takes a really long, long time. However, in my case, it runs in seconds. Despite having an unordered factor variable with 30 levels.
To demonstrate the issue further, I have created several variables with 52 levels, meaning that 2^51 - 1 ~ 2.2 10^15 splits would need to be checked if all possibilities were explored. This code runs in about a minute, IMHO proving that all combinations are not checked.
NROW = 50000
NVAR = 20
rand_letters = data.frame(replicate(NVAR, as.factor(c(
letters[sample.int(26, floor(NROW/2), replace = TRUE)],
LETTERS[sample.int(26, ceiling(NROW/2), replace = TRUE)]))))
rand_letters$target = rbinom(n = NROW, size = 1, prob = 0.1)
system.time({
tree_letter = rpart(target ~., data = rand_letters, cp = 0.0003)
})
tree_letter
What combinations of categorical variables are ACTUALLY checked in rpart?
I know it is an old question but I found this link that might answer some of it.
Bottom line is that rpart seems to be applying a simple algorithm:
First, sort the conditional means, p_i = E(Y|X = x_i)
Then compute Gini indices based on groups obtained from that ordering.
Pick the two groups giving the maximum of these Gini indices.
So it should not be nearly as computationally expensive.
However, I personally have a case where I have a single categorical variable, whose categories are US states, and rpart overtimes when trying to use it to produce a classification tree. Creating dummy variables and running rpart with the 51 variables (1 for each state) works fine.

Labeling the centroids of a PCoA based on betadisper() multivariate dispersions in R

I've used the function betadisper() in the vegan package to generate multivariate dispersions and plot those data in a PCoA. In this example I'll be looking at the difference between the sexes in a singular species.
Load the original data. For our purposes this can legit be anything here. The data I'm using isn't special. Its feature measurements are from a bioacoustic dataset. I am walking through my process:
my_original_data = read.csv("mydata.csv", as.is = T, check.names = F)
#Just extract the numeric/quantitative data.
myData=my_original_data[, 13:107]
Based on previous research, we used an unsupervised randomForest to determine similarity within our original feature measurements:
require(randomForest)
full_urf = randomForest(myData, proximity=T, scale=TRUE, ntree=4999,importance = TRUE)
A index was then generated using the proximity matrix:
urf_dist_full = as.dist(1-full_urf$proximity)
An permutational MANOVA was run on the resulting index using the vegan package. The use of the pMANOVA was well researched and is the correct test for my purposes:
mod=adonis(formula = urf_dist_full ~ Sex * Age * Variant, data = my_original_data, permutations = 999, method = "euclidean")
my_original_data had qualitative factors, Sex, Age and Variant. I could have extracted them, but it seemed cleaner to keep them within the original dataset.
After running a few homogeneity tests, I want to plot the multivariate dispersions. To do this I have been using the betadisper function:
Sex=betadisper(urf_dist_full,my_original_data$Sex)
plot(Sex, main="Sex Multivariate Dispersions")
That plots this beauty:
How can I label the centroids as Male and Female? I also want to run this plot for the Variant category, but that has five factors rather than two, which really warrants labeling.
I've seen the boxplot() variant of this, but I like how the PCoA also shows clustering.
You can add labels to centroids like this:
ordilabel(scores(Sex, "centroids"))
where Sex is your betadisper result. If you do not want to use the original names of your centroids, you can change the names with:
ordilabel(scores(Sex, "centroids"), labels=c("A","B"))
You can use the identify-function:
A <- plot(sex)
identify(A, "centroids")
Or look at the scores (this don't add labels to the plot, but shows you the centroid position)
scores(sex, 1:2, display = "centroids")

Resources