Letters group Games-Howell post hoc in R - r

I use the sweetpotato database included in library agricolae of R:
data(sweetpotato)
This dataset contains two variables: yield(continous variable) and virus(factor variable).
Due to Levene test is significant I cannot assume homogeneity of variances and I apply Welch test in R instead of one-way ANOVA followed by Tukey posthoc.
Nevertheless, the problems come from when I apply posthoc test. In Tukey posthoc test I use library(agricolae) and displays me the superscript letters between virus groups. Therefore there are no problems.
Nevertheless, to perform Games-Howell posthoc, I use library(userfriendlyscience) and I obtain Games-Howell output but it's impossible for me to obtain a letter superscript comparison between virus groups as it is obtained through library(agricolae).
The code used it was the following:
library(userfriendlyscience)
data(sweetpotato)
oneway<-oneway(sweetpotato$virus, y=sweetpotato$yield, posthoc =
'games-howell')
oneway
I try with cld() importing previously library(multcompView) but doesn't work.
Can somebody could helps me?
Thanks in advance.

This functionality does not exist in userfriendlyscience at the moment. You can see which means differ, and with which p-values, by looking at the row names of the dataframe with the post-hoc test results. I'm not sure which package contains the sweetpotato dataset, but using the ChickWeight dataset that comes with R (and is used on the oneway manual page):
oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
Yields:
### (First bit removed as it's not relevant.)
### Post hoc test: games-howell
diff ci.lo ci.hi t df p
2-1 19.97 0.36 39.58 2.64 201.38 .044
3-1 40.30 17.54 63.07 4.59 175.92 <.001
4-1 32.62 13.45 51.78 4.41 203.16 <.001
3-2 20.33 -6.20 46.87 1.98 229.94 .197
4-2 12.65 -10.91 36.20 1.39 235.88 .507
4-3 -7.69 -33.90 18.52 0.76 226.16 .873
The first three rows compare groups 2, 3 and 4 to 1: using alpha = .05, 1 and 2 have the same means, but 3 and 4 are higher. This allows you to compute the logical vector you need for multCompLetters in multcompView. Based on the example from the manual page at ?multcompView:
### Run oneway anova and store result in object 'res'
res <- oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
### Extract dataframe with post hoc test results,
### and overwrite object 'res'
res <- res$intermediate$posthoc;
### Extract p-values and comparison 'names'
pValues <- res$p;
### Create logical vector, assuming alpha of .05
dif3 <- pValues > .05;
### Assign names (row names of post hoc test dataframe)
names(dif3) <- row.names(res);
### convert this vector to the letters to compare
### the group means (see `?multcompView` for the
### references for the algorithm):
multcompLetters(dif3);
This yields as final result:
2 3 4 1
"a" "b" "c" "abc"
This is what you need, right?
I added this functionality to userfriendlyscience, but it will be a while before this new version will be on CRAN. In the meantime, you can get the source code for this update at https://github.com/Matherion/userfriendlyscience/blob/master/R/oneway.R if you want (press the 'raw' button to get an easy-to-download version of the source code).
Note that if you need this updated version, you need to set parameter posthocLetters to TRUE, because it's FALSE by default. For example:
oneway(y=ChickWeight$weight,
x=ChickWeight$Diet,
posthoc='games-howell',
posthocLetters=TRUE);

shouldn't it be
dif3 <- pValues < .05, instead of dif3 <- pValues > .05 ?
This way the letters are the same if the distributions are 'the same' (this is, no evidence that they are different).
Please correct me if I'm interpreting this wrong.

Related

Calculating scaled score using Item Response Theory (IRT, 3PL) in R with MIRT package

I have an English Listening Comprehension test consisting of 50 items taken by about 400 students. I would like to score the test using the TOEFL scale (31-68), and it is claimed that TOEFL is scored using IRT (3PL model). I am using MIRT package in R to obtain three parameters, as I used 3PL model.
library(readxl)
TOEFL006 <- read_excel("TOEFL Prediction Mei 2021 Form 006.xlsx")
TOEFL006_LIST <- TOEFL006[, 9:58]
library(mirt)
sv <- mirt(TOEFL006_LIST, # data frame (ordinal only)
1, # 1 for unidimentional, 2 for exploratory
itemtype = '3PL') # models, i.e. Rasch, 1PL, 2PL, and 3PL
sv_coeffs <- coef(sv,
simplify = T,
IRTpars = T)
sv_coeffs
The result is shown below :
a
b
g
u
L1
2.198
0.165
0.198
1
L2
2.254
0.117
0.248
1
L3
2.103
-0.049
0.232
1
L4
4.663
0.293
0.248
1
L5
1.612
-0.374
0.001
1
...
...
...
...
...
Then I calculated the factor score using the following codes:
factor_score <- fscores(sv, method = "EAP", full.scores = T)
The first five results are shown below:
head(factor_score)
[1,] 2.1839888
[2,] 1.8886260
[3,] 0.6791995
[4,] 1.2761837
[5,] 0.8195919
[6,] -1.5257231
The problem is that I do not know how to convert the factor score into the TOEFL score of 31 - 68 (as shown on the ETS website for listening score: https://www.ets.org/s/toefl_itp/pdf/38781-TOEFL-itp-flyer_Level1_HR.pdf).
Would anyone help me show how I can do that in R, please? Or maybe there are other ways of obtaining students' scores. Your help is much appreciated.
The data can be downloaded here: https://drive.google.com/file/d/1WwwjzgxJRBByCXAjdlNkGNRCtXjMlddW/view?usp=sharing
Thank you very much for your help.
To calculate each respondents ability score use:
Theta <- fscores(mod, method = 'MAP', itemtype = '2-PL', stats.only = TRUE) head(personfit(mod, Theta=Theta))
Working through a similar project and found it in the mirt help files :)
I think you misunderstood something. It doesn't seem to be scored using IRT - according to wikipedia, it's just summed up, which is a quite normal way to score tests. Maybe I misunderstood it, but if it's done like you say, I think it uses a proprietary scoring model and you should likely contact and pay them to get you your scores.
So to get the scores, you would simply do TOEFL006_LIST$total <- RowSums(TOEFL006_LIST) and get the head using head(TOEFL006_LIST$total).
Also, I'm glad not to be a student of yours - you're putting their names out in the open in your linked dataset... If you're in Europe it would be illegal, possibly in other countries too.
I was looking for a solution to your question and I found that you can use this function to get a homologous score from a given model and theta value.
expected.test(model, as.matrix(theta))
theta should be specifically a matrix.

Error when using the Benjamini-Hochberg false discovery rate in R after Wilcoxon Rank

I have carried out a Wilcoxon rank sum test to see if there is any significant difference between the expression of 598019 genes between three disease samples vs three control samples. I am in R.
When I see how many genes have a p value < 0.05, I get 41913 altogether. I set the parameters of the Wilcoxon as follows;
wilcox.test(currRow[4:6], currRow[1:3], paired=F, alternative="two.sided", exact=F, correct=F)$p.value
(This is within an apply function, and I can provide my total code if necessary, I was a little unsure as to whether alternative="two.sided" was correct).
However, as I assume correcting for multiple comparisons using the Benjamini Hochberg False Discovery rate would lower this number, I then adjusted the p values via the following code
pvaluesadjust1 <- p.adjust(pvalues_genes, method="BH")
Re-assessing which p values are less than 0.05 via the below code, I get 0!
p_thresh1 <- 0.05
names(pvaluesadjust1) <- rownames(gene_analysis1)
output <- names(pvaluesadjust1)[pvaluesadjust1 < p_thresh1]
length(output)
I would be grateful if anybody could please explain, or direct me to somewhere which can help me understand what is going on!
Thank-you
(As an extra question, would a t-test be fine due to the size of the data, the Anderson-Darling test showed that the underlying data is not normal. I had far less genes which were less than 0.05 using this statistical test rather than Wilcoxon (around 2000).
Wilcoxon is a parametric test based on ranks. If you have only 6 samples, the best result you can get is rank 2,2,2 in disease versus 5,5,5 in control, or vice-versa.
For example, try the parameters you used in your test, on these random values below, and you that you get the same p.value 0.02534732.
wilcox.test(c(100,100,100),c(1,1,1),exact=F, correct=F)$p.value
wilcox.test(c(5,5,5),c(15,15,15),exact=F, correct=F)$p.value
So yes, with 598019 you can get 41913 < 0.05, these p-values are not low enough and with FDR adjustment, none will ever pass.
You are using the wrong test. To answer your second question, a t.test does not work so well because you don't have enough samples to estimate the standard deviation correctly. Below I show you an example using DESeq2 to find differential genes
library(zebrafishRNASeq)
data(zfGenes)
# remove spikeins
zfGenes = zfGenes[-grep("^ERCC", rownames(zfGenes)),]
head(zfGenes)
Ctl1 Ctl3 Ctl5 Trt9 Trt11 Trt13
ENSDARG00000000001 304 129 339 102 16 617
ENSDARG00000000002 605 637 406 82 230 1245
First three are controls, last three are treatment, like your dataset. To validate what I have said before, you can see that if you do a wilcoxon.test, the minimum value is 0.02534732
all_pvalues = apply(zfGenes,1,function(i){
wilcox.test(i[1:3],i[4:6],exact=F, correct=F)$p.value
})
min(all_pvalues,na.rm=T)
# returns 0.02534732
So we proceed with DESeq2
library(DESeq2)
#create a data.frame to annotate your samples
DF = data.frame(id=colnames(zfGenes),type=rep(c("ctrl","treat"),each=3))
# run DESeq2
dds = DESeqDataSetFromMatrix(zfGenes,DF,~type)
dds = DESeq(dds)
summary(results(dds),alpha=0.05)
out of 25839 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 69, 0.27%
LFC < 0 (down) : 47, 0.18%
outliers [1] : 1270, 4.9%
low counts [2] : 5930, 23%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
So you do get hits which pass the FDR cutoff. Lastly we can pull out list of significant genes
res = results(dds)
res[which(res$padj < 0.05),]

Am I following the correct procedures with the dunn.test function?

I tested differences among sampling sites in terms of abundance values using kruskal.test. However, I want to determine the multiple differences between sites.
The dunn.test function has the option to use a vector data with a categorical vector or use the formula expression as lm.
I write the function in the way to use in a data frame with many columns, but I have not found an example that confirms my procedures.
library(dunn.test)
df<-data.frame(a=runif(5,1,20),b=runif(5,1,20), c=runif(5,1,20))
kruskal.test(df)
dunn.test(df)
My results were:
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.04929
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.05
Comparison of df by group
Between 1 and 2 2.050609, 0.0202
Between 1 and 3 -0.141421, 0.4438
Between 2 and 3 -2.192031, 0.0142
I took a look at your code, and you are close. One issue is that you should be specifying a method to correct for multiple comparisons, using the method argument.
Correcting for Multiple Comparisons
For your example data, I'll use the Benjamini-Yekutieli variant of the False Discovery Rate (FDR). The reasons why I think this is a good performer for your data are beyond the scope of StackOverflow, but you can read more about it and other correction methods here. I also suggest you read the associated papers; most of them are open-access.
library(dunn.test)
set.seed(711) # set pseudorandom seed
df <- data.frame(a = runif(5,1,20),
b = runif(5,1,20),
c = runif(5,1,20))
dunn.test(df, method = "by") # correct for multiple comparisons using "B-Y" procedure
# Output
data: df and group
Kruskal-Wallis chi-squared = 3.62, df = 2, p-value = 0.16
Comparison of df by group
(Benjamini-Yekuteili)
Col Mean-|
Row Mean | 1 2
---------+----------------------
2 | 0.494974
| 0.5689
|
3 | -1.343502 -1.838477
| 0.2463 0.1815
alpha = 0.05
Reject Ho if p <= alpha/2
Interpreting the Results
The first row in each cell provides the Dunn's pairwise z test statistic for each comparison, and the second row provides your corrected p-values.
Notice that, once corrected for multiple comparisons, none of your pairwise tests are significant at an alpha of 0.05, which is not surprising given that each of your example "sites" was generated by exactly the same distribution. I hope this has been helpful. Happy analyzing!
P.S. In the future, you should use set.seed() if you're going to construct example dataframes using runif (or any other kind of pseudorandom number generation). Also, if you have other questions about statistical analysis, it's better to ask at: https://stats.stackexchange.com/

predicting and calculating reliability test statistics from repeated multiple regression model in r

I want to run MLR on my data using lm function in R. However, I am using data splitting cross validation method to access the reliability of the model. I intend using "sample" function to randomly split the data into the calibration and validation datasets by 80:20 ratio. This I want to repeat in say 100 times. Without setting a seed I believe the model from the different samplings will differ. I came across the function in previous post here and it solves the first part;
lst <- lapply(1:100, function(repetition) {
mod <- lm(...)
# Replace this with the code you need to train your model
return(mod)
})
save(lst, file="myfile.RData")
The concern now is how do I validate each of these 100 models and obtain reliability test statistics like RSME, ME, Rsquare for each of the models and hopefully obtain the confidence interval.
If I can get an output in the form of dataframe containing the predicted values for all the 100 models then I should proceed from there.
Any help please?
Thanks
To quickly recap your question: it seems that you want to fit an MLR model to a large training set and then use this model to make predictions on the remaining validation set. You want to repeat this process 100 times and afterwards you want to be able to analyze the characteristics and predictions of the individual models.
To accomplisch this you could just store temporary modelinformation in a datastructure during the modelgeneration and prediction process. You can then re-obtain and process all the information afterwards. You did not provide your own dataset in the description, so I will use one of R's built in datasets in order to demonstrate how this might work:
> library(car)
> Prestige <- Prestige[,c("prestige","education","income","women")]
> Prestige[,c("income")] <- log2(Prestige[,c("income")])
> head(Prestige,n=5)
prestige education income women
gov.administrators 68.8 13.11 -0.09620212 11.16
general.managers 69.1 12.26 -0.04955335 4.02
accountants 63.4 12.77 -0.11643822 15.70
purchasing.officers 56.8 11.42 -0.11972061 9.11
chemists 73.5 14.62 -0.12368966 11.68
We start by initializing some variables first. Let's say you want to create 100 models and use 80% of your data for training purposes:
nrIterations=100
totalSize <- nrow(Prestige)
trainingSize <- floor(0.80*totalSize)
We also want to create the datastructure that will be used to hold the intermediate modelinformation. R is quite a generic high level language in this regard, so we will just create a list of lists. This means that every listentry can by itself again hold another list of information. This gives us the flexibility to add whatever we need:
trainTestTuple <- list(mode="list",length=nrIterations)
We are now ready to create our models and predictions. During every loopiteration a different random trainingsubset is created while using the remaining data for testing purposes. Next, we fit our model to the trainingdata and we then use this obtained model to make predictions on the testdata. Note that we explicitly use the independent variables in order to predict the dependent variable:
for(i in 1:nrIterations)
{
trainIndices <- sample(seq_len(totalSize),size = trainingSize)
trainSet <- Prestige[trainIndices,]
testSet <- Prestige[-trainIndices,]
trainingFit <- lm(prestige ~ education + income + women, data=trainSet)
# Perform predictions on the testdata
testingForecast <- predict(trainingFit,newdata=data.frame(education=testSet$education,income=testSet$income,women=testSet$women),interval="confidence",level=0.95)
# Do whatever else you want to do (compare with actual values, calculate other stuff/metrics ...)
# ...
# add your training and testData to a tuple and add it to a list
tuple <- list(trainingFit,testingForecast) # Add whatever else you need ..
trainTestTuple[[i]] <- tuple # Add this list to the "list of lists"
}
Now, the relevant part: At the end of the iteration we put both the fitted model and the out of sample prediction results in a list. This list contains all the intermediate information that we want to save for the current iteration. We finish by putting this list in our list of lists.
Now that we are done with the modeling, we still have access to all the information we need and we can process and analyze it any way we want. We will take a look at the modeling and prediction results of model 50. First, we extract both the model and the prediction results from the list of lists:
> tuple_50 <- trainTestTuple[[50]]
> trainingFit_50 <- tuple_50[[1]]
> testingForecast_50 <- tuple_50[[2]]
We take a look at the model summary:
> summary(trainingFit_50)
Call:
lm(formula = prestige ~ education + log2(income) + women, data = trainSet)
Residuals:
Min 1Q Median 3Q Max
-15.9552 -4.6461 0.5016 4.3196 18.4882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -287.96143 70.39697 -4.091 0.000105 ***
education 4.23426 0.43418 9.752 4.3e-15 ***
log2(income) 155.16246 38.94176 3.984 0.000152 ***
women 0.02506 0.03942 0.636 0.526875
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.308 on 77 degrees of freedom
Multiple R-squared: 0.8072, Adjusted R-squared: 0.7997
F-statistic: 107.5 on 3 and 77 DF, p-value: < 2.2e-16
We then explicitly obtain the model R-squared and RMSE:
> summary(trainingFit_50)$r.squared
[1] 0.8072008
> summary(trainingFit_50)$sigma
[1] 7.308057
We take a look at the out of sample forecasts:
> testingForecast_50
fit lwr upr
1 67.38159 63.848326 70.91485
2 74.10724 70.075823 78.13865
3 64.15322 61.284077 67.02236
4 79.61595 75.513602 83.71830
5 63.88237 60.078095 67.68664
6 71.76869 68.388457 75.14893
7 60.99983 57.052282 64.94738
8 82.84507 78.145035 87.54510
9 72.25896 68.874070 75.64384
10 49.19994 45.033546 53.36633
11 48.00888 46.134464 49.88329
12 20.14195 8.196699 32.08720
13 33.76505 27.439318 40.09079
14 24.31853 18.058742 30.57832
15 40.79585 38.329835 43.26187
16 40.35038 37.970858 42.72990
17 38.38186 35.818814 40.94491
18 40.09030 37.739428 42.44117
19 35.81084 33.139461 38.48223
20 43.43717 40.799715 46.07463
21 29.73700 26.317428 33.15657
And finally, we obtain some more detailed results about the 2nd forecasted value and the corresponding confidence intervals:
> testingPredicted_2ndprediction <- testingForecast_50[2,1]
> testingLowerConfidence_2ndprediction <- testingForecast_50[2,2]
> testingUpperConfidence_2ndprediction <- testingForecast_50[2,3]
EDIT
After rereading, it occured to me that you are obviously not splitting up the the same exact dataset each time. You are using completely different partitions of data during each iteration and they should be split up in a 80/20 fashion. However, the same solution can still be applied with minor modifications.
Also: For cross validation purposes you should probably take a look at cv.lm()
Description from the R help:
This function gives internal and cross-validation measures of predictive accuracy for multiple linear regression. (For binary logistic regression, use the CVbinary function.) The data are randomly assigned to a number of ‘folds’. Each fold is removed, in turn, while the remaining data is used to re-fit the regression model and to predict at the deleted observations.
EDIT: Reply to comment.
You can just take the means of the relevant performance metrics that you saved. For example, you can use an sapply on the trainTestTuple in order to extract the relevant elements from each sublist. sapply will return these elements as a vector from which you can calculate the mean. This should work:
mean_ME <- mean(sapply(trainTestTuple,"[[",2))
mean_MAD <- mean(sapply(trainTestTuple,"[[",3))
mean_MSE <- mean(sapply(trainTestTuple,"[[",4))
mean_RMSE <- mean(sapply(trainTestTuple,"[[",5))
mean_adjRsq <- mean(sapply(trainTestTuple,"[[",6))
Another small edit: The calculation of your MAD looks rather strange. It might be a good thing to double check if this is exactly what you want.

trouble with cbind in manova call in r

I'm trying to do a multivariate ANOVA with the manova function in R. My problem is that I'm trying to find a way to pass the list of dependent variables without typing them all in manually, as there are many and they have horrible names. My data are in a data frame where "unit" is the dependent variable (factor), and the rest of the columns are various numeric response variables. e.g.
unit C_pct Cln C_N_mol Cnmolln C_P_mol N_P_mol
1 C 48.22 3.88 53.92 3.99 3104.75 68.42
2 C 49.91 3.91 56.32 4.03 3454.53 62.04
3 C 50.75 3.93 56.96 4.04 3922.01 69.16
4 SH 50.72 3.93 46.58 3.84 2590.16 57.12
5 SH 51.06 3.93 43.27 3.77 2326.04 53.97
6 SH 48.62 3.88 40.97 3.71 2357.16 59.67
If I write the manova call as
fit <- manova(cbind(C_pct, Cln) ~ unit, data = plots)
it works fine, but I'd like to be able to pass a long list of columns without naming them one by one, something like
fit <- manova(cbind(colnames(plots[5:32])) ~ unit, data = plots)
or
fit <- manove(cbind(plots[,5:32]) ~ unit, data = plots)
I get the error
"Error in model.frame.default(formula = as.matrix(cbind(colnames(plots[5:32]))) ~ :
variable lengths differ (found for 'unit')
I'm sure it's because I'm using cbind wrong, but can't figure it out. Any help is appreciated! Sorry if the formatting is rough, this is my first question posted.
EDIT: Both ways (all 3, actually) work. thanks all!
manova, like most R modelling functions, builds its formula out of the names of the variables found in the dataset. However, when you pass it the colnames, you're technically passing the strings that represent the names of those variables. Hence the function doesn't know what to do with them, and chokes.
You can actually get around this. The LHS of the formula only has to resolve to a matrix; the use of cbind(C_pct, Cln, ...) is a way of obtaining a matrix by evaluating the names of its arguments C_pct, Cln, etc in the environment of your data frame. But if you provide a matrix to start with, then no evaluation is necessary.
fit <- manova(as.matrix(plots[, 5:32]) ~ unit, data=plots)
Some notes. The as.matrix is necessary because getting columns from a data frame like this, returns a data frame. manova won't like this, so we coerce the data frame to a matrix. Second, this works assuming you don't have an actual variable called plots inside your data frame plots. This is because, if R doesn't find a name inside your data frame, it then looks in the environment of the caller, in this case the global environment.
You can also create the matrix before fitting the model, with
plots$response <- as.matrix(plots[, 5:32])
fit <- manova(response ~ unit, data=plots)
You can build your formula as string and cast it to a formula:
responses <- paste( colnames( plots )[2:6], collapse=",")
myformula <- as.formula( paste0( "cbind(", responses , ")~ unit" ) )
manova( myformula, data = plots )
Call:
manova(myformula, data = plots)
Terms:
unit Residuals
resp 1 0.4 6.8
resp 2 0 0
resp 3 220.6 21.0
resp 4 0.1 0.0
resp 5 1715135.8 377938.1
Deg. of Freedom 1 4
Residual standard error: 1.3051760.027080132.293640.04966555307.3834
Estimated effects may be unbalanced

Resources