I'm trying to access individual elements of an object returned from TukeyHSD function. I can see how to access the individual elements, but how do I access the labels?
$`Auto$CargoSpace`
diff lwr upr p adj
wagon-SUV -3747.333 -7664.507 -980.0801 6.855348e-03
trunk-SUV -4792.333 -5621.311 -2371.3357 2.065806e-05
trunk-wagon -968.000 -3823.523 2125.54328 7.410039e-01
I'd like to be able to access each row/column combination the way I can with a tibble or dataframe. That way I can add an interpretation later in the code. Let's say I wanted to end up with this result:
CargoSpace p adj Analysis
wagon-SUV 6.85 Unlikely to produce a benefit
trunk-SUV 2.06 Worth investigating
We extract the output with $, then use either grep (if partial matches) or %in% (for fixed matching) to subset the rows, create a data.frame with the row names of the subset of the dataset along with "p adj" column. Then, we can create 'Analysis' column based on the value of 'p.adj'
out1 <- out$`Auto$CargoSpace`
out2 <- out1[grep("SUV$", row.names(out1),]
out3 <- data.frame(CargoSpace = row.names(out2), p.adj = out2[, "p adj"])
out3$Analysis <- ifelse(out2$p.adj < 0.0001, "Worth investigating", "Unlikely to produce a benefit")
Reproducible example
fm1 <- aov(breaks ~ wool + tension, data = warpbreaks)
out <- TukeyHSD(fm1, "tension", ordered = TRUE)
out$tension
# diff lwr upr p adj
#M-H 4.722222 -4.6311985 14.07564 0.447421021
#L-H 14.722222 5.3688015 24.07564 0.001121788
#L-M 10.000000 0.6465793 19.35342 0.033626219
Related
I am using gamm4:gamm4 to model longitudinal change.
I am trying to use the modelsummary::modelsummary function to create an output table of the following results:
I would like to add t-values and std.error to the output of the fixed effects, and remove the empty tags values from the random effects
model_lmer <- gamm4(Y ~ Tract + s(Age, by = Tract, k = 10) + Sex,
data = (DF1),
random = ~ (0 + Tract | ID))
modelsummary(model_lmer$mer,
statistic = c("s.e. = {std.error}",
"t = {statistic}"))
But I am struggling to write the correct syntax to remove the "t" and "s.e." from the random effects output.
This is kind of tricky, actually. The issue is that modelsummary()
automatically drops empty rows when they are filled with NA or an
empty string "". However, since glue strings can include arbitrary
text, it is hard to think of a general way to figure out if the row is
empty or not, because modelsummary() cannot know ex ante what
constitutes an empty string.
If you have an idea on how this check could be implemented, please report it
on Github:
https://github.com/vincentarelbundock/modelsummary
In the meantime, you could use the powerful tidy_custom.CLASSNAME
mechanism
to customize the statistic and p.value statistics directly instead
of through a glue string:
library(gamm4)
library(modelsummary)
# simulate
x <- runif(100)
fac <- sample(1:20,100,replace=TRUE)
eta <- x^2*3 + fac/20; fac <- as.factor(fac)
y <- rpois(100,exp(eta))
# fit
mod <- gamm4(y~s(x),family=poisson,random=~(1|fac))
# customize
tidy_custom.glmerMod <- function(x) {
out <- parameters::parameters(x)
out <- insight::standardize_names(out, style = "broom")
out$statistic <- sprintf("t = %.3f", out$statistic)
out$p.value <- sprintf("p = %.3f", out$p.value)
out
}
# summarize
modelsummary(mod$mer,
statistic = c("{statistic}", "{p.value}"))
Model 1
X(Intercept)
1.550
t = 17.647
p = 0.000
Xs(x)Fx1
0.855
t = 4.445
p = 0.000
Num.Obs.
100
RMSE
2.49
Note that I used simple glue strings in statistic = "{p.value}", otherwise they would be wrapped up in parentheses, as is default for standard errors.
I'm trying to write a loop that performs anova and TukeyHSD on my data across 3 samples for each "Label". Label in this case is a metabolic pathway. Data that goes into it are the genes expressed in said metabolic pathway.
For the test data, I created a small df that reproduces my error. In my actual data, I'm hoping to do perform this across 2 factors (not just one) and I have thousands of more rows.
library(reshape2)
df<-melt(data.frame(sample1 = c(0,0,3,4,5,1),sample2 = c(1,0,0,4,5,0),sample3 = c(0,0,0,8,0,0),Label = c("TCA cycle", "TCA cycle","TCA cycle", "Glycolysis","Glycolysis","Glycolysis"),Gene = c("k1","k2","k3","k4","k5","k6")))
My approach (annotated the best way I can!):
fxn<-unique(df$Label) #create list
for (i in 1:length(fxn)){
if (!exists("data")){ #if the "data" dataframe does not exist, start here!
depth<-aov(df$value[df$Label==fxn[i]]~df$variable[df$Label==fxn[i]]) #perform anova on my "df", gene values as a factor of samples (for each "fxn")
hsd<-TukeyHSD(depth) #calculate tukeyHSD
data<-as.data.frame(hsd$`df$variable[df$Label == fxn[i]]`) #grab dataframe of tukey HSD output
data$Label<-fxn[i] #add in the Label name as a column (so it looks like my original df, but with TukeyHSD output for each pairwise comparison
data<-as.data.frame(data)
}
if (exists("data")){ #if "data" exists, do this:
tmpdepth<-aov(df$value[df$Label==fxn[i]]~df$variable[df$Label==fxn[i]])
tmphsd<-TukeyHSD(tmpdepth)
tmpdata<-as.data.frame(tmphsd$`df$variable[df$Label == fxn[i]]`)
tmpdata$Label<-fxn[i]
tmpdata<-as.data.frame(tmpdata)
data<-rbind(data,tmpdata) #combine with original data
data<-as.data.frame
rm(tmpdata)
}
}
I'd like my output to look like this:
diff lwr upr p adj Label
sample2-sample1 -0.3333333 -8.600189 7.933522 0.9916089 Glycolysis
sample3-sample1 -0.6666667 -8.933522 7.600189 0.9669963 Glycolysis
sample3-sample2 -0.3333333 -8.600189 7.933522 0.9916089 Glycolysis
but the Label column has all the factors that went into "fxn".
Errors:
Error in rep(xi, length.out = nvar) :
attempt to replicate an object of type 'closure'
You forgot the second data in the last line before rm(tmpdata). It should be:
data<-as.data.frame(data)
I my implementation I changed your code as follows:
datav <- data.frame(diff = double(),
lwr = double(),
upr = double(),
'p adj' = double(),
'Label' = character())
for (fxn in unique(df$Label)){
depth <- aov(df$value[df$Label==fxn] ~ df$variable[df$Label==fxn])
hsd <- TukeyHSD(depth)
tmp <- as.data.frame(hsd$`df$variable[df$Label == fxn]`)
tmp$Label <- fxn
datav <- rbind(datav, tmp)
}
Initializing the data.frame before hand you do not need the if statement. Also data is a function in R, so I rename the variable data to datav.
I would like to apply T test in R within a loop
Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525
I tried this code with loop, but it is not working:
for (i in 2:4){
t.test(table[,c(i)] ~ table$Groups, conf.level = 0.95)
}
Can anyone help me with this?
Thanks!
Your code computes 4 t-tests, but the results are lost, because you don't do anything with them. Try the following:
info <- read.table(header=TRUE, text="Groups Length Size Diet place
A 2.4048381 0.7474989 1.6573392 334.3273456
A 2.72500485 0.86392165 1.8610832 452.5593152
A 1.396782867 0.533330367 0.8634525 225.5998728
B 1.3888505 0.46478175 0.92406875 189.9576476
B 1.38594795 0.60068945 0.7852585 298.3744962
B 2.53491245 0.95608005 1.5788324 303.9052525")
results <- list()
for (i in 2:4){
results[[i]] <- t.test(info[,i] ~ info$Groups, conf.level = 0.95)
}
print(results)
When interacting with the REPL/console, typing the t.test function will compute results and return them. The console will print everything that is returned. In scripts that you source, the t.test function will return results but they wil not be printed. This is why I put them into a list and printed the list later on.
Btw, I stored your information as info not as table. R will deal great with variable names that are also function names, but every now and then you will hava a hard time to read error messages, so avoid naming variables table or matrix or c or df.
Using apply functions you could also do:
res<- cbind(
do.call(rbind,apply(info[,-1],2,function(cv)t.test(cv ~ info$Groups, conf.level = 0.95)
[c("statistic","parameter","p.value")]))
,
t(apply(info[,-1],2,function(cv)unlist(t.test(cv ~ info$Groups, conf.level = 0.95)
[c("conf.int","estimate")])))
)
res
> res
statistic parameter p.value conf.int1 conf.int2 estimate.mean in group A estimate.mean in group B
Length 0.7327329 3.991849 0.5044236 -1.13263 1.943907 2.175542 1.769904
Size 0.2339013 3.467515 0.8282072 -0.47739 0.5595231 0.714917 0.6738504
Diet 0.9336103 3.823748 0.4056203 -0.7396173 1.468761 1.460625 1.096053
place 0.9748978 3.162223 0.398155 -159.4359 306.2686 337.4955 264.0791
My goal is to create a function that, when looped over multiple variables of a data frame, will return a new data frame containing the percents and 95% confidence intervals for each level of each variable.
As an example, if I applied this function to "cyl" and "am" from the mtcars data frame, I would want this as the final result:
variable level ci.95
1 cyl 4 34.38 (19.50, 53.11)
2 cyl 6 21.88 (10.35, 40.45)
3 cyl 8 43.75 (27.10, 61.94)
4 am 0 59.38 (40.94, 75.5)
5 am 1 40.62 (24.50, 59.06)
So, far I have function that seems to work for a single variable; however, I have two issues that I'm hoping the community can help me with:
General R-ifying my code. I'm still an R novice. I've read enough posts to know that R enthusiasts generally discourage using for loops, but I still really struggle with using the apply functions (which seems to be the alternative to for loops in most cases).
Applying this function to a list of variables - resulting in a single data frame containing the returned values from the function for each level of each variable.
Here's where I'm at with my code so far:
t1.props <- function(x, data = NULL) {
# Grab dataframe and/or variable name
if(!missing(data)){
var <- data[,deparse(substitute(x))]
} else {
var <- x
}
# Grab variable name for use in ouput
var.name <- substitute(x)
# Omit observations with missing data
var.clean <- na.omit(var)
# Number of nonmissing observations
n <- length(var.clean)
# Grab levels of variable
levels <- sort(unique(var.clean))
# Create an empty data frame to store values
out <- data.frame(variable = NA,
level = NA,
ci.95 = NA)
# Estimate prop, se, and ci for each level of the variable
for(i in seq_along(levels)) {
prop <- paste0("prop", i)
se <- paste0("se", i)
log.prop <- paste0("log.trans", i)
log.se <- paste0("log.se", i)
log.l <- paste0("log.l", i)
log.u <- paste0("log.u", i)
lcl <- paste0("lcl", i)
ucl <- paste0("ucl", i)
# Find the proportion for each level of the variable
assign(prop, sum(var.clean == levels[i]) / n)
# Find the standard error for each level of the variable
assign(se, sd(var.clean == levels[i]) /
sqrt(length(var.clean == levels[i])))
# Perform a logit transformation of the original percentage estimate
assign(log.prop, log(get(prop)) - log(1 - get(prop)))
# Transform the standard error of the percentage to a standard error of its
# logit transformation
assign(log.se, get(se) / (get(prop) * (1 - get(prop))))
# Calculate the lower and upper confidence bounds of the logit
# transformation
assign(log.l,
get(log.prop) -
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
assign(log.u,
get(log.prop) +
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
# Finally, perform inverse logit transformations to get the confidence bounds
assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))
# Create a combined 95% CI variable for easy copy/paste into Word tables
ci.95 <- paste0(round(get(prop) * 100, 2), " ",
"(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
round(get(ucl) * 100, 2), ")")
# Populate the "out" data frame with values
out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
}
# Remove first (empty) row from out
# But only in the first iteration
if (is.na(out[1,1])) {
out <- out[-1, ]
rownames(out) <- 1:nrow(out)
}
out
}
data(mtcars)
t1.props(cyl, mtcars)
I appreciate any help or advice you have to offer.
You can also keep the function mainly intact and use lapply over it:
vars <- c("cyl", "am")
lapply(vars, t1.props, data=mtcars)
[[1]]
variable level ci.95
1 cyl 4 34.38 (19.50, 53.11)
2 cyl 6 21.88 (10.35, 40.45)
3 cyl 8 43.75 (27.10, 61.94)
[[2]]
variable level ci.95
1 am 0 59.38 (40.94, 75.5)
2 am 1 40.62 (24.50, 59.06)
And combine them all into one data frame with:
lst <- lapply(vars, t1.props, data=mtcars)
do.call(rbind,lst)
Data
You must simplify the var and var.name assignments to:
t1.props <- function(x, data = NULL) {
# Grab dataframe and/or variable name
if(!missing(data)){
var <- data[,x]
} else {
var <- x
}
# Grab variable name for use in ouput
var.name <- x
# Omit observations with missing data
var.clean <- na.omit(var)
# Number of nonmissing observations
n <- length(var.clean)
# Grab levels of variable
levels <- sort(unique(var.clean))
# Create an empty data frame to store values
out <- data.frame(variable = NA,
level = NA,
ci.95 = NA)
# Estimate prop, se, and ci for each level of the variable
for(i in seq_along(levels)) {
prop <- paste0("prop", i)
se <- paste0("se", i)
log.prop <- paste0("log.trans", i)
log.se <- paste0("log.se", i)
log.l <- paste0("log.l", i)
log.u <- paste0("log.u", i)
lcl <- paste0("lcl", i)
ucl <- paste0("ucl", i)
# Find the proportion for each level of the variable
assign(prop, sum(var.clean == levels[i]) / n)
# Find the standard error for each level of the variable
assign(se, sd(var.clean == levels[i]) /
sqrt(length(var.clean == levels[i])))
# Perform a logit transformation of the original percentage estimate
assign(log.prop, log(get(prop)) - log(1 - get(prop)))
# Transform the standard error of the percentage to a standard error of its
# logit transformation
assign(log.se, get(se) / (get(prop) * (1 - get(prop))))
# Calculate the lower and upper confidence bounds of the logit
# transformation
assign(log.l,
get(log.prop) -
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
assign(log.u,
get(log.prop) +
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
# Finally, perform inverse logit transformations to get the confidence bounds
assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))
# Create a combined 95% CI variable for easy copy/paste into Word tables
ci.95 <- paste0(round(get(prop) * 100, 2), " ",
"(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
round(get(ucl) * 100, 2), ")")
# Populate the "out" data frame with values
out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
}
# Remove first (empty) row from out
# But only in the first iteration
if (is.na(out[1,1])) {
out <- out[-1, ]
rownames(out) <- 1:nrow(out)
}
out
}
The nice thing about all the functions you're using is that they are already vectorized (except sd and qt, but you can easily vectorize them for specific arguments with Vectorize). This means you can pass vectors to them without needing to write a single loop. I left out the parts of your function that deal with preparing the input and prettying up the output.
t1.props <- function(var, data=mtcars) {
N <- nrow(data)
levels <- names(table(data[,var]))
count <- unclass(table(data[,var])) # counts
prop <- count / N # proportions
se <- sqrt(prop * (1-prop)/(N-1)) # standard errors of props.
lprop <- log(prop) - log(1-prop) # logged prop
lse <- se / (prop*(1-prop)) # logged se
stat <- Vectorize(qt, "df")(0.975, N-1) # tstats
llower <- lprop - stat*lse # log lower
lupper <- lprop + stat*lse # log upper
lower <- exp(llower) / (1 + exp(llower)) # lower ci
upper <- exp(lupper) / (1 + exp(lupper)) # upper ci
data.frame(variable=var,
level=levels,
perc=100*prop,
lower=100*lower,
upper=100*upper)
}
So, the only explicit applying/looping comes when you apply the function to multiple variables as follows
## Apply your function to two variables
do.call(rbind, lapply(c("cyl", "am"), t1.props))
# variable level perc lower upper
# 4 cyl 4 34.375 19.49961 53.11130
# 6 cyl 6 21.875 10.34883 40.44691
# 8 cyl 8 43.750 27.09672 61.94211
# 0 am 0 59.375 40.94225 75.49765
# 1 am 1 40.625 24.50235 59.05775
As far as the loop in your code, it's not like that is particularly important in terms of efficiency, but you can see how much easier code can be to read when its concise - and apply functions offer a lot of simple one-line solutions.
I think the most important thing to change in your code is the use of assign and get. Instead, you can store variables in lists or another data structure, and use setNames, names<-, or names(...) <- to name the components when needed.
Is there a package in R that produces tables like this one:
Basically, given a dataset of factors and variables, produces a nicely-formatted table with pairwise t-tests and letters indicating significance of differences?
Dummy data:
var = c(rnorm(100,mean=1,sd=2),rnorm(30,mean=-1,sd=2),rnorm(50,mean=0,sd=4))
factor = as.factor(c(rep(1,100),rep(2,30),rep(3,50))
How would I take that dummy data and turn it into (one column of) the example table?
Disclaimer
Not an exact answer but it could be good start!!! Use tables package.
library(Hmisc)
library(tables)
stderr <- function(x) sd(x)/sqrt(length(x))
latex(
object = tabular((Species+1) ~ All(iris)* PlusMinus(mean, stderr, digits=1), data=iris)
, title = "Test"
, file=""
, size = "small"
, cdot = 3
, here = TRUE
, booktabs=TRUE
, center="centering"
)
Output
Edited
You can obtain letters using multcompView package.
Code
library(multcompView)
Sepal.Length.fm <- aov(Sepal.Length~Species, data=iris)
Sepal.Length.Letters <- data.frame("Letters"=multcompLetters(extract_p(TukeyHSD(Sepal.Length.fm)$"Species"))$"Letters")
Sepal.Width.fm <- aov(Sepal.Width~Species, data=iris)
Sepal.Width.Letters <- data.frame("Letters"=multcompLetters(extract_p(TukeyHSD(Sepal.Width.fm)$"Species"))$"Letters")
Petal.Length.fm <- aov(Petal.Length~Species, data=iris)
Petal.Length.Letters <- data.frame("Letters"=multcompLetters(extract_p(TukeyHSD(Petal.Length.fm)$"Species"))$"Letters")
Petal.Width.fm <- aov(Petal.Width~Species, data=iris)
Petal.Width.Letters <- data.frame("Letters"=multcompLetters(extract_p(TukeyHSD(Petal.Width.fm)$"Species"))$"Letters")
Letters <- cbind(Sepal.Length.Letters, Sepal.Width.Letters, Petal.Length.Letters, Petal.Width.Letters)
Output
Letters Letters Letters Letters
versicolor a a a a
virginica b b b b
setosa c c c c
You can do this with the aggregate function.
df = as.data.frame(cbind(var,factor))
aggregate(df$var, by=list(df$factor), t.test)
Group.1 x
1 1 4.939821
2 2 -2.128924
3 3 -1.431482
If you want multiple tests to show up as new columns, you can create a new function as the final parameter.