pairwise ANOVA of subset of data - r

I need to perform multiple pairwise ANOVA's in R, and correct the p-values using bonferroni. However I don't need to compare every CLASS to each other. Below is my data format and selcontrasts: the pairs of which I need to contrast the log10relquant. Does any of you know how I could execute this? I use the dplyr, lsmeans and broom packages.
SEX EXPERIENCED AGE CLASS compound relquant log10relquant
1 FEMALE NO 1D 1F C14 0.004012910 -2.396541
2 FEMALE NO 1D 1F C14 0.003759812 -2.424834
3 FEMALE NO 1D 1F C14 0.003838553 -2.415832
4 FEMALE NO 1D 1F C14 0.003582754 -2.445783
5 MALE NO 1D 1M C14 0.005099237 -2.292495
6 MALE NO 1D 1M C14 0.005379093 -2.269291
selcontrasts <- c("1F - 1M", "4F - 4M", "4EF - 4EM",
"7F - 7M", "7EF - 7EM", # sex differences
"1M - 4M", "4M - 7M", "1M - 7M", "1F - 4F",
"4F - 7F", "1F - 7F", # age differences
"4M - 4EM", "7M - 7EM", "4F - 4EF",
"7F - 7EF" # social experience)
x=list(selcontrasts)
Currently I'm using this to pair the whole dataset (so to compair every class) instead of the selected contrasts:
pvalsage=data.frame(datagr %>%
do( data.frame(summary(contrast(lsmeans(
aov(log10relquant ~ CLASS, data = .), ~ CLASS ),
method="pairwise",adjust="none"))) ))
To only do the selected contrasts in list x, I tried:
pvalsage=data.frame(datagr %>%
do(data.frame(summary(contrast(lsmeans(
aov(log10relquant ~ CLASS, data = .),~ CLASS),
method = x, adjust="none"))) ))
But I get the error:
error in contrast.ref.grid(lsmeans(aov(log10relquant ~ CLASS, data = .), :
Nonconforming number of contrast coefficients

If I understand the question correctly (and I very well might not), there are really three factors involved: SEX (two levels), EXPERIENCED (two levels), and AGE (3 levels, namely 1, 4, and 7). And what is required is separate comparisons of the levels of each factor, for each combination of the other two.
If that is the case, then combining the three factors into one named CLASS just makes it a lot harder, because it makes it much harder to keep track of the levels of the factors separately. What's simpler is to fit a model that accounts for all three factors, estimate the means for each factor combination, and then do the required comparisons using by variables. Thus, for each dataset dat, you do:
require(emmeans)
mod = aov(log10relquant ~ SEX * EXPERIENCED * AGE, data = dat)
emm = emmeans(mod, ~ SEX * EXPERIENCED * AGE)
rbind(pairs(emm, by = c("EXPERIENCED", "AGE")),
pairs(emm, by = c("SEX", "EXPERIENCED")),
pairs(emm, by = c("SEX", "AGE")),
adjust = "bonferroni")
I did not try to embed this in the functional-programming paradigm; I'll leave it to the OP to figure out those details.
Note: The emmeans package (estimated marginal means) is a continuation of lsmeans, which will be deprecated in the future. It works the same way.
PS -- Looking at the code in the question, I am concerned that the end results will not show the actual estimates being compared (the EMMs), only the comparisons; and the further implication in the naming that really, only P values are sought. This grates on me. I don't like to watch people go straight to statistical tests without even looking at the quantities being tested.

You could do the pairwise contrast anyway and then filter the rows only containing your selcontrasts into a new dataframe followed by p.adjust= bonferroni with only the contrasts of your interest.
or you could write a mycontr.lmsc function and define selcontrasts and use that as method =
(Y)

Related

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

How to conduct LSD test with interactions in R?

I have a field data
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
and I conducted 3-way ANOVA
anova3way <- aov (Yield ~ sowing_date + herbicide + nitrogen +
sowing_date:herbicide + sowing_date:nitrogen +
herbicide:nitrogen + sowing_date:herbicide:nitrogen +
factor(Block), data=DataA)
summary(anova3way)
There is a 3-way interaction among 3 factors. So, I'd like to see which combination shows the greatest yield.
I know how to compare mean difference with single factor like below, but in case of interactions, how can I do that?
library(agricolae)
LSD_Test<- LSD.test(anova3way,"sowing_date")
LSD_Test
For example, I'd like to check the mean difference under 3 way interaction, and also interaction between each two factors.
For example, I'd like to get this LSD result in R
Could you tell me how can I do that?
Many thanks,
One way which does take some manual work is to encode the experimental parameters as -1 and 1 in order to properly separate the 2 and 3 parameter interactions.
Once you have values encoded you can pull the residual degree of freedoms and the sum of the error square from the ANOVA model and pass it to the LSD.test function.
See Example below:
sowing_date<- rep(c("Early" ,"Normal"), each=12)
herbicide<- rep (rep(c("No" ,"Yes"), each=6),2)
nitrogen<- rep (rep(c("No" ,"Yes"), each=3),4)
Block<- rep(c("Block 1" ,"Block 2", "Block 3"), times=8)
Yield<- c(30,27,25,40,41,42,37,38,40,48,47,46,25,27,26,
41,41,42,38,39,42,57,59,60)
DataA<- data.frame(sowing_date,herbicide,nitrogen,Block,Yield)
anova3way <- aov (Yield ~ sowing_date * herbicide * nitrogen +
factor(Block), data=DataA)
summary(anova3way)
#Encode the experiment's parameters as -1 and 1
DataA$codeSD <- ifelse(DataA$sowing_date == "Early", -1, 1)
DataA$codeherb <- ifelse(DataA$herbicide == "No", -1, 1)
DataA$codeN2 <- ifelse(DataA$nitrogen == "No", -1, 1)
library(agricolae)
LSD_Test<- LSD.test(anova3way, c("sowing_date"))
LSD_Test
#Manually defining the treatment and specifying the
# degrees of freedom and Sum of the squares (Frin the resduals from the ANOVA)
print(LSD.test(y=DataA$Yield, trt=DataA$sowing_date, DFerror=14, MSerror=34.3))
#Example for a two parameter value
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb), DFerror=14, MSerror=34.3))
print(LSD.test(y=DataA$Yield, trt=(DataA$codeSD*DataA$codeherb*DataA$codeN2), DFerror=14, MSerror=34.3))
#calaculate the means and std (as a check)
#DataA %>% group_by(sowing_date) %>% summarize(mean=mean(Yield), sd=sd(Yield))
#DataA %>% group_by(codeSD*codeherb*codeN2) %>% summarize(mean=mean(Yield), sd=sd(Yield))
You will need to manually track which run/condition goes with the -1 and 1 in the final report.
Edit:
So my answer above with show the overall effect based on interactions. For example how does the interaction of herbicide and nitrogen effect yield.
Based on your comment where you want to determine which combination provides the greatest yield, you the use the LSD.test() function again but passing a vector of parameter names.
LSD_Test<- LSD.test(anova3way, c("sowing_date", "herbicide", "nitrogen"))
LSD_Test
From the groups part of the out put you can see Normal, Yes and Yes is the optimal yield. The "groups" column will identify the unique clusters of results. For example the last 2 rows provide a similar yield.
...
$groups
Yield groups
Normal:Yes:Yes 58.66667 a
Early:Yes:Yes 47.00000 b
Normal:No:Yes 41.33333 c
Early:No:Yes 41.00000 cd
Normal:Yes:No 39.66667 cd
Early:Yes:No 38.33333 d
Early:No:No 27.33333 e
Normal:No:No 26.00000 e
...

I need to find all predictors(p-value < 0.05) from my dataset using loops. Is there any way to do it?

I am new to R and I am using glm() function to fit a logistic model. I have 5 columns. I need to find all possible predictors using a loop based on their p-values(less than 0.05).
My dataset has 40,000 entries which contains numerical and categorical variables and it looks more or less like this:
"Age" "Sex" "Occupation" "Education" "Income"
50 Male Farmer High School False
30 Female Maid High School False
25 Male Engineer Graduate True
The target variable "Income" denotes if the person earns more or less than 30K. If true means, they earn more than 30K and vice versa. I would like to find the predictor variables that can be used to predict the target using loops. Also, can I find the best 3 predictors based on their p-values?
Thanks in Advance!
If I understood correctly your question you are looking into a way of test univariable models given your dataframe (i am in fact in doubt if you want to test every combination of these variables including cross variation)
My suggestion is to use purrr::map function and create list for every column. Check the following example based on your information:
library(tidyr)
library(purrr)
## Sample data
df <- data.frame(
Age = rnorm(n = 40000,
mean = mean(c(50,30,25)),
sd(c(50,30,25))),
Ocupation = sample(x = c("Farmer", "Maid", "Engineer"),
size = 40000,
replace = TRUE),
Education = sample(x = c("High School", "Graduate", "UnderGraduate"),
size = 40000,
replace = TRUE),
Income = as.logical(rbinom(40000, 1, 0.5))
)
## Split dataframe into lists
list_df <- Map(cbind, split.default(df[-4], names(df)[-4]))
list_df <- lapply(list_df, cbind, "target" = df[4])
## Use map to fit a model for each list
list_models <- map(.x = list_df,
.f = ~glm(Income ~ ., data = .x, family = binomial))
You can call each model using list_models[i].
Now addressing the second part of your question concerning p-values. Given that each project is unique and so are their metrics i suggest you double check you usage of p-values. Granted, they are important, but they provid you a probability of acceptance given a specific statistic test and treshold which depends on context. It is a fundamental tool of statistical quality and decision (not only about t-test, but f-test and hence forward). But for ranking ? hmm i would say is a litle odd. But just saying :)

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Resources