I'm trying to loop TukeyHSD tests through each column of a dataframe and compare treatment levels. Here's some mock data that's a simplified version of the data I have (my data has ~350 columns):
df1 <- data.frame(cmpd1 = c(500,436,1,1,1,1),
cmpd2 = c(1,1,1,1,1,253),
cmpd3 = c(1,1,300,57,150,260),
treatment=c("W","W","A","A","D","D"))
I've followed the suggestions in this post successfully and have created a loop that runs ANOVAs for each column, outputting only columns that had a p-value <0.07 for the treatment comparisons:
# specific compound differences
for (i in 1:3){
column <- names(df1[i])
anova <- broom::tidy(aov(df1[,i] ~ treatment, data = df1))
# only want aov with P < 0.07 printed
if(anova$p.value[1] < 0.07) {
print(column)
print(anova)
}
}
However, I'd like to run TukeyHSD tests on all columns in a similar way, only outputting the tukey results that have a p-value <0.07 for any given treatment comparison. I tried something like this but it doesn't work, giving the error "Error in if (tukey[["p adj"]] < 0.07) { : argument is of length zero":
for (i in 1:3){
column <- names(df1[i])
anova <- aov(df1[,i] ~ treatment, data = df1)
tukey <- TukeyHSD(anova)
# only want tukey with P < 0.07 printed
if(tukey[["p adj"]] < 0.07) {
print(column)
print(tukey)
}
}
I can't figure out the right way to have it only output tukey tests that contain a p-value <0.07, so my ideal output would be something like this (this contains made-up values):
$cmpd1
diff lwr upr p adj
D-A 2.728484e-12 -29169.59 29169.59 1.0000000
W-A 3.637979e-12 -32278.10 32278.10 0.0001
W-D 1.484573e+04 -13620.88 43312.34 0.056
The output of TukeyHSD is a list as evident from the structure
str(TukeyHSD(aov(df1[,1] ~ treatment, data = df1)))
List of 1
$ treatment: num [1:3, 1:4] -2.84e-14 4.67e+02 4.67e+02 -1.09e+02 3.58e+02 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "D-A" "W-A" "W-D"
.. ..$ : chr [1:4] "diff" "lwr" "upr" "p adj"
- attr(*, "class")= chr [1:2] "TukeyHSD" "multicomp"
- attr(*, "orig.call")= language aov(formula = df1[, 1] ~ treatment, data = df1)
- attr(*, "conf.level")= num 0.95
- attr(*, "ordered")= logi FALSE
we can extract the list element 'treatment' which is a matrix and thus the [[ or $ wouldn't work. We can use [ with column name along with the , to separate the row/column index or names and wrap with any as there are 3 values for 'p adj' (if expects a single TRUE/FALSE logical input)
for (i in 1:3){
column <- names(df1[i])
anova <- aov(df1[,i] ~ treatment, data = df1)
tukey <- TukeyHSD(anova)
# only want tukey with P < 0.07 printed
if(any(tukey$treatment[, "p adj"] < 0.07)) {
print(column)
print(setNames(tukey, column))
}
}
-output
[1] "cmpd1"
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = df1[, i] ~ treatment, data = df1)
$cmpd1
diff lwr upr p adj
D-A -2.842171e-14 -109.1823 109.1823 1.000000
W-A 4.670000e+02 357.8177 576.1823 0.000839
W-D 4.670000e+02 357.8177 576.1823 0.000839
Related
I want to apply mk.test() to the large dataset and get results in a table/matrix.
My data look something like this:
Column A
Column B
...
ColumnXn
1
2
...
5
...
...
...
...
3
4
...
7
So far I managed to perform mk.test() for all columns and print the results:
for(i in 1:ncol(data)) {
print(mk.test(as.numeric(unlist(data[ , i]))))
}
I got all the results printed:
.....
Mann-Kendall trend test
data: as.numeric(unlist(data[, i]))
z = 4.002, n = 71, p-value = 6.28e-05
alternative hypothesis: true S is not equal to 0
sample estimates:
S varS tau
7.640000e+02 3.634867e+04 3.503154e-01
Mann-Kendall trend test
data: as.numeric(unlist(data[, i]))
z = 3.7884, n = 71, p-value = 0.0001516
alternative hypothesis: true S is not equal to 0
sample estimates:
S varS tau
7.240000e+02 3.642200e+04 3.283908e-01
....
However, I was wondering if it is possible to get results in a table/matrix format that I could save as excel.
Something like this:
Column
z
p-value
S
varS
tau
Column A
4.002
0.0001516
7.640000e+02
3.642200e+04
3.283908e-01
...
...
...
...
...
...
ColumnXn
3.7884
6.28e-05
7.240000e+02
3.642200e+04
3.283908e-01
Is it possible to do so?
I would really appreciate your help.
Instead of printing the test results you can store them in a variable. This variable holds the various test statistics and values. To find the names of the properties you can perform the test on the first row and find the property names using a string conversion:
testres = mk.test(as.numeric(unlist(data[ , 1])))
str(testres)
List of 9
$ data.name : chr "as.numeric(unlist(data[, 1]))"
$ p.value : num 0.296
$ statistic : Named num 1.04
..- attr(*, "names")= chr "z"
$ null.value : Named num 0
..- attr(*, "names")= chr "S"
$ parameter : Named int 3
..- attr(*, "names")= chr "n"
$ estimates : Named num [1:3] 3 3.67 1
..- attr(*, "names")= chr [1:3] "S" "varS" "tau"
$ alternative: chr "two.sided"
$ method : chr "Mann-Kendall trend test"
$ pvalg : num 0.296
- attr(*, "class")= chr "htest"
Here you see that for example the z-value is called testres$statistic and similar for the other properties. The values of S, varS and tau are not separate properties but they are grouped together in the list testres$estimates.
In the code you can create an empty dataframe, and in the loop add the results of that run to this dataframe. Then at the end you can convert to csv using write.csv().
library(trend)
# sample data
mydata = data.frame(ColumnA = c(1,3,5), ColumnB = c(2,4,1), ColumnXn = c(5,7,7))
# empty dataframe to store results
results = data.frame(matrix(ncol=6, nrow=0))
colnames(results) <- c("Column", "z", "p-value", "S", "varS", "tau")
for(i in 1:ncol(mydata)) {
# store test results in variable
testres = mk.test(as.numeric(unlist(mydata[ , i])))
# extract elements of result
testvars = c(colnames(mydata)[i], # column
testres$statistic, # z
testres$p.value, # p-value
testres$estimates[1], # S
testres$estimates[2], # varS
testres$estimates[3]) # tau
# add to results dataframe
results[nrow(results)+1,] <- testvars
}
write.csv(results, "mannkendall.csv", row.names=FALSE)
The resulting csv file can be opened in Excel.
Background
I want to plot the hazard ratio over time, including its confidence intervals, of a survival dataset. As an example, I will take a simplified dataset from the survival package: the colon dataset.
library(survival)
library(tidyverse)
# Colon survival dataset
data <- colon %>%
filter(etype == 2) %>%
select(c(id, rx, status, time)) %>%
filter(rx == "Obs" | rx == "Lev+5FU") %>%
mutate(rx = factor(rx))
The dataset contains patients that received a treatment (i.e., "Lev+5FU") and patients that did not (i.e., "Obs"). The survival curves are as follows:
fit <- survfit(Surv(time, status) ~ rx, data = data )
plot(fit)
Attempt
Using the cox.zph function, you can plot the hazard ratio of a cox model.
cox <- coxph(Surv(time, status) ~ rx, data = data)
plot(cox.zph(cox))
However, I want to plot the hazard ratio including 95% CI for this survival dataset using ggplot.
Question(s)
How do you extract the hazard ratio data and the 95% CIs from this cox.zph object to plot them in ggplot?
Are there other R packages that enable doing the same in a more convenient way?
Note: it’s important to recognize the correction of Dion Groothof. The lines and CIs are not really hazard ratios. They are estimates and bounds around time varying log-hazard-ratios. You would need to exponentiate to get HRs.
The values are in the result returned from cox.zph:
str(cox.zph(cox))
#----------------------
List of 7
$ table : num [1:2, 1:3] 1.188 1.188 1 1 0.276 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "rx" "GLOBAL"
.. ..$ : chr [1:3] "chisq" "df" "p"
$ x : num [1:291] 0 0.00162 0.00323 0.00485 0.00646 ...
$ time : num [1:291] 23 34 45 52 79 113 125 127 138 141 ...
$ y : num [1:291, 1] 2.09 2.1 2.1 2.1 2.11 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:291] "23" "34" "45" "52" ...
.. ..$ : chr "rx"
$ var : num [1, 1] 4.11
$ transform: chr "km"
$ call : language cox.zph(fit = cox)
- attr(*, "class")= chr "cox.zph"
To get a plot with any of the paradigms (base, lattice or ggplot2) you use the time as the x axis, use x as the solid line and y at the "points"
z <- cox.zph(cox)
ggdf <- data.frame( unclass(z)[c("time", "x","y")])
ggplot(data=ggdf, aes(x=time, y=-x))+
geom_line()+ ylim(range(z$y))+
geom_point(aes(x=time,y=z$y) )
To get the CI look at getAnywhere(plot.cox.zph)
xx <- x$x
yy <- x$y
df <- max(df)
nvar <- ncol(yy)
pred.x <- seq(from = min(xx), to = max(xx), length = nsmo)
#------------
if (se) {
bk <- backsolve(qmat$qr[1:df, 1:df], diag(df))
xtx <- bk %*% t(bk)
seval <- ((pmat %*% xtx) * pmat) %*% rep(1, df)
temp <- 2 * sqrt(x$var[i, i] * seval)
yup <- yhat + temp
ylow <- yhat - temp
yr <- range(yr, yup, ylow)
#---------------
if (se) {
lines(pred.x, exp(yup), col = col[2], lty = lty[2],
lwd = lwd[2])
lines(pred.x, exp(ylow), col = col[2], lty = lty[2],
lwd = lwd[2])
}
The survminer package will do this for you:
library(survminer)
ggcoxzph(cox.zph(cox))
I have read about loss functions theoretically and also how to build regression models based on them in R.
I can apply all of the regression models based on different loss functions in R programming except for Log-Cosh Loss Function.
For example, I would like to build a linear regression model on 5-folds subsets of the DATA, and then extract the coefficients and calculate the individuals and the aggregated variance as follows.
data = read.csv("train.csv") # "critical_temp" is the dependent variable.
data_nom_df=as.data.frame(scale(data))#Normalization
#Cross Validation
set.seed(12345)
k = 5
folds <- createFolds(data_nom_df$critical_temp, k = k, list = TRUE, returnTrain = TRUE)
## Ordinary Least Square regression
#block A
lm = list()
for (i in 1:k) {
lm[[i]] = lm(critical_temp~ .,
data = data_nom_df[folds[[i]],])
}
#block B
lm_coef = list()
lm_coef_var = list()
for(j in 1:(lm[[1]]$coefficients %>% length())){
for(i in 1:k){
lm_coef[[i]] = lm[[i]]$coefficients[j]
lm_coef_var[[j]] = lm_coef %>% unlist() %>% var()
}
}
#block C
lm_var = unlist(lm_coef_var)
lm_df = cbind(coefficients = lm[[1]]$coefficients %>% names() %>% as.data.frame()
, variance = lm_var %>% as.data.frame())
colnames(lm_df) = c("coefficients", "variance_lm")
lm_df
#block D
lm_var_sum = sum(lm_var)
lm_var_sum
The same for the rest of the regression models. However, I do not find any code or package to apply a regression model based on Log-Cosh Loss Function in R.
Could you please guide me to any source that would help me to solve this problem.
This can be done from first principles. Also note the existence of the logcosh function in the limma package which could be used in place of log(cosh(.)) if you have numeric difficulties.
f <- function(b) with(cars, sum(log(cosh(dist - b[1] - b[2] * speed))))
fm0 <- lm(dist ~ speed, cars)
res <- optim(coef(fm0), f, method = "BFGS")
str(res)
## List of 5
## $ par : Named num [1:2] -12.82 3.47
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
## $ value : num 532
## $ counts : Named int [1:2] 28 10
## ..- attr(*, "names")= chr [1:2] "function" "gradient"
## $ convergence: int 0
## $ message : NULL
Graphics
# the black line is the ordinary least squares regression line and
# the red line is the log cosh regression line
plot(cars)
abline(fm0)
yfit <- res$par[1] + res$par[2] * cars$speed
lines(cars$speed, yfit, col = "red")
ADDED
Note that the optimization can also be written like this which may be useful if you have many independent variables.
fm0 <- lm(dist ~ speed, cars)
X <- model.matrix(fm0)
f <- function(b) with(cars, sum(log(cosh(dist - X %*% b))))
res <- optim(coef(fm0), f, method = "BFGS")
res
giving:
$par
(Intercept) speed
-12.816190 3.469536
$value
[1] 531.5872
$counts
function gradient
28 10
$convergence
[1] 0
$message
NULL
I'm using export_summs in R to make a regression table, but when I use coeftest to get clustered standard errors, the table no longer reports N or R^2 properly in those columns. The coefficients and standard errors look good, just missing those additional stats. (I'm used to outreg2 in Stata which is much simpler.)
I tried using tidy_override() as suggested in the last example here (https://hughjonesd.github.io/huxtable/huxreg.pdf), no change.
# Reproducible example
datareg <- NULL
datareg$y <- rnorm(1000)
datareg$x <- rnorm(1000)
datareg$cluster_var <- rnorm(1000)
datareg <- data.frame(datareg)
reg0 <- lm(y ~ x
, data = datareg)
reg1 <- coeftest(
lm(y ~ x
, data = datareg)
, vcovCL, cluster = datareg$cluster_var)
export_summs(reg0, reg1,
model.names = c("Basic", "Cluster SE"))
Issues warning and output:
This is a case where the error message is fairly clear: the broom package does not have a glance method for coeftest objects. This is not an accident--the nature of the coeftest object does not allow for broom to calculate model summary statistics. It retains very little information about the original model:
> str(reg1)
'coeftest' num [1:2, 1:4] 0.0483 0.0153 0.0329 0.0341 1.4668 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "(Intercept)" "x"
..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
- attr(*, "method")= chr "t test of coefficients"
- attr(*, "df")= int 998
One option is to use the lm_robust function from the estimatr package. It returns objects with robust standard errors that are amenable to both glance and tidy:
reg2 <- estimatr::lm_robust(y ~ x
, data = datareg)
export_summs(reg0, reg2,
model.names = c("Basic", "Cluster SE"), number_format = NA )
──────────────────────────────────────────────────────────────────
Basic Cluster SE
────────────────────────────────────────────────────
(Intercept) 0.0482678107925753 0.0482678107925755
(0.032842483472098) (0.0329070612421128)
x 0.0152928320138191 0.015292832013819
(0.0333488383365212) (0.034094868727288)
────────────────────────────────────────────────────
N 1000 1000
R2 0.000210664993144995 0.000210665
──────────────────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, Basic, Cluster SE
Huxtable author here. This is how to do it with tidy_override:
library(generics)
library(huxtable)
library(jtools)
library(lmtest)
library(sandwich)
datareg <- NULL
datareg$y <- rnorm(1000)
datareg$x <- rnorm(1000)
datareg$cluster_var <- rnorm(1000)
datareg <- data.frame(datareg)
reg0 <- lm(y ~ x, data = datareg)
reg1 <- coeftest(reg0, vcovCL, cluster = datareg$cluster_var)
reg1 <- tidy_override(reg1, glance = list(nobs = 1000L, r.squared = 0.000),
extend = TRUE) # extend = TRUE is important
export_summs(reg0, reg1, model.names = c("Basic", "Cluster SE"))
Which gives:
────────────────────────────────────────────────────
Basic Cluster SE
───────────────────────────────────
(Intercept) -0.01 -0.01
(0.03) (0.03)
x -0.05 -0.05
(0.03) (0.03)
───────────────────────────────────
N 1000 1000
R2 0.00 0.00
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, Basic, Cluster SE
This was fairly tricky and I appreciate your difficulties... I have improved the error reporting in huxreg as a result!
I am trying run a linear model in R that does not specify an intercept. The reason is to eventually calculate the sums of squares reduced when an intercept is added. However, I am receiving different results when specifying this model using built-in factor contrasts versus explicitly stating the contrast values (i.e., -.5 and .5).
More specifically, using contrasts() results in a model with 2 terms (no intercept) while explicitly stating the contrast values via a column vector results in the correct model (no intercept and 1 term specifying the contrast).
group <- rep(c("c", "t"), each = 5)
group_cont <- rep(c(-.5, .5), each = 5)
var1 <- runif(10)
var2 <- runif(10)
test_data <- data.frame(
group = factor(group),
group_cont = group_cont,
y = var1,
x = var2
)
contrasts(test_data$group) <- cbind(grp = c(-.5, .5))
summary(lm(y ~ 1 + group, data = test_data)) # full model
summary(lm(y ~ 0 + group, data = test_data)) # weird results
summary(lm(y ~ 0 + group_cont, data = test_data)) # expected
Is there a way to specify a linear model without an intercept, but still use contrasts() to specify the contrast?
lm() asks for a data frame and column names as inputs. When you use contrasts(), you are assigning an attribute to the column in your data frame, which you can call directly using the the contrast function or attr. However, you are not changing the data type itself. Using you example above:
> str(test_data)
'data.frame': 10 obs. of 4 variables:
$ group : Factor w/ 2 levels "c","t": 1 1 1 1 1 2 2 2 2 2 #### still a factor ####
..- attr(*, "contrasts")= num [1:2, 1] -0.5 0.5 #### NOTE The contrast attribute ####
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr "c" "t"
.. .. ..$ : chr "grp"
$ group_cont: num -0.5 -0.5 -0.5 -0.5 -0.5 0.5 0.5 0.5 0.5 0.5
$ y : num 0.161 0.518 0.417 0.335 0.301 ...
$ x : num 0.34 0.729 0.766 0.629 0.191 ...
> attr(test_data$group, "contrasts")
grp
c -0.5
t 0.5
So a attr was added but the type is still a factor. So lm treats it like a factor, providing you a coefficient for each level. Moreover, providing contrast or calling the attr inside lm will throw an error. Depending on what you want to the end to look like, you may need to explore a different package like contrast. There is also a contrast argument in lm but I am not 100% sure this is what you are really looking for. See ?lm for more on that.