Linear Regression Model based on Log-Cosh Loss Function in R - r

I have read about loss functions theoretically and also how to build regression models based on them in R.
I can apply all of the regression models based on different loss functions in R programming except for Log-Cosh Loss Function.
For example, I would like to build a linear regression model on 5-folds subsets of the DATA, and then extract the coefficients and calculate the individuals and the aggregated variance as follows.
data = read.csv("train.csv") # "critical_temp" is the dependent variable.
data_nom_df=as.data.frame(scale(data))#Normalization
#Cross Validation
set.seed(12345)
k = 5
folds <- createFolds(data_nom_df$critical_temp, k = k, list = TRUE, returnTrain = TRUE)
## Ordinary Least Square regression
#block A
lm = list()
for (i in 1:k) {
lm[[i]] = lm(critical_temp~ .,
data = data_nom_df[folds[[i]],])
}
#block B
lm_coef = list()
lm_coef_var = list()
for(j in 1:(lm[[1]]$coefficients %>% length())){
for(i in 1:k){
lm_coef[[i]] = lm[[i]]$coefficients[j]
lm_coef_var[[j]] = lm_coef %>% unlist() %>% var()
}
}
#block C
lm_var = unlist(lm_coef_var)
lm_df = cbind(coefficients = lm[[1]]$coefficients %>% names() %>% as.data.frame()
, variance = lm_var %>% as.data.frame())
colnames(lm_df) = c("coefficients", "variance_lm")
lm_df
#block D
lm_var_sum = sum(lm_var)
lm_var_sum
The same for the rest of the regression models. However, I do not find any code or package to apply a regression model based on Log-Cosh Loss Function in R.
Could you please guide me to any source that would help me to solve this problem.

This can be done from first principles. Also note the existence of the logcosh function in the limma package which could be used in place of log(cosh(.)) if you have numeric difficulties.
f <- function(b) with(cars, sum(log(cosh(dist - b[1] - b[2] * speed))))
fm0 <- lm(dist ~ speed, cars)
res <- optim(coef(fm0), f, method = "BFGS")
str(res)
## List of 5
## $ par : Named num [1:2] -12.82 3.47
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
## $ value : num 532
## $ counts : Named int [1:2] 28 10
## ..- attr(*, "names")= chr [1:2] "function" "gradient"
## $ convergence: int 0
## $ message : NULL
Graphics
# the black line is the ordinary least squares regression line and
# the red line is the log cosh regression line
plot(cars)
abline(fm0)
yfit <- res$par[1] + res$par[2] * cars$speed
lines(cars$speed, yfit, col = "red")
ADDED
Note that the optimization can also be written like this which may be useful if you have many independent variables.
fm0 <- lm(dist ~ speed, cars)
X <- model.matrix(fm0)
f <- function(b) with(cars, sum(log(cosh(dist - X %*% b))))
res <- optim(coef(fm0), f, method = "BFGS")
res
giving:
$par
(Intercept) speed
-12.816190 3.469536
$value
[1] 531.5872
$counts
function gradient
28 10
$convergence
[1] 0
$message
NULL

Related

GLMM Random Intercept estimators in lme4

How do you get the random intercept effects estimators from a lme4 result object?
set.seed(247)
# Create Data
n=1000
x = runif(n)
id = rep(NA,n)
for (i in 1:10) {
id_s = (i-1)*100+1
id_e = i*100
id[id_s:id_e] = i
}
effects = rnorm(10)
lp = -0.5+0.5*x + effects[id]
probs = exp(lp)/(1+exp(lp))
Y2 = rbinom(n, 1, probs)
library(lme4)
fit_glmm2 = glmer(Y2 ~ x + (1|id), family = "binomial",control = glmerControl(calc.derivs = FALSE))
I thought maybe they are the u's but there's a slight difference between them:
yy = coef(fit_glmm2) # looking only at the intercept
fit_glmm2#u + fit_glmm2#beta[1]
If you want the random effects, ranef() is the best way to get them:
r <- ranef(fit_glmm2)
str(r)
## List of 1
## $ id:'data.frame': 10 obs. of 1 variable:
## ..$ (Intercept): num [1:10] -0.693 0.297 0.54 -0.467 0.755 ...
## ..- attr(*, "postVar")= num [1, 1, 1:10] 0.0463 0.0385 0.0392 0.0436 0.0409 ...
## - attr(*, "class")= chr "ranef.mer"
raw <- unname(unlist(ranef(fit_glmm2)$id))
identical(raw, fit_glmm2#u*fit_glmm2#theta) ## TRUE
As described in vignette("lmer", package = "lme4"), the #u values are the spherical random effects, i.e. they're iid N(0,1) and need to be transformed to get to the random effects b used in the formula X %*% beta + Z %*% b. In this case (an intercept-only RE), theta corresponds to the standard deviation of the random effect. u*theta won't work for more complicated cases ... in this case you need getME(fit_glmm2, "Lambda") %*% getME(fit_glmm2, "u").
getME(., "b") will also work, but again for more complex models you'll have to work out how the b-vector is split into random intercepts, slopes, different RE terms, etc..
Turns out you can get them by multiplying the u parameter with the theta parameter, or by calling getME(.,"b"):
yy = coef(fit_glmm2) # looking only at the intercept
fit_glmm2#u*fit_glmm2#theta + fit_glmm2#beta[1] # or
# getME(fit_glmm2,"b") + fit_glmm2#beta[1]

Looping TukeyHSD function through multiple columns of a dataframe

I'm trying to loop TukeyHSD tests through each column of a dataframe and compare treatment levels. Here's some mock data that's a simplified version of the data I have (my data has ~350 columns):
df1 <- data.frame(cmpd1 = c(500,436,1,1,1,1),
cmpd2 = c(1,1,1,1,1,253),
cmpd3 = c(1,1,300,57,150,260),
treatment=c("W","W","A","A","D","D"))
I've followed the suggestions in this post successfully and have created a loop that runs ANOVAs for each column, outputting only columns that had a p-value <0.07 for the treatment comparisons:
# specific compound differences
for (i in 1:3){
column <- names(df1[i])
anova <- broom::tidy(aov(df1[,i] ~ treatment, data = df1))
# only want aov with P < 0.07 printed
if(anova$p.value[1] < 0.07) {
print(column)
print(anova)
}
}
However, I'd like to run TukeyHSD tests on all columns in a similar way, only outputting the tukey results that have a p-value <0.07 for any given treatment comparison. I tried something like this but it doesn't work, giving the error "Error in if (tukey[["p adj"]] < 0.07) { : argument is of length zero":
for (i in 1:3){
column <- names(df1[i])
anova <- aov(df1[,i] ~ treatment, data = df1)
tukey <- TukeyHSD(anova)
# only want tukey with P < 0.07 printed
if(tukey[["p adj"]] < 0.07) {
print(column)
print(tukey)
}
}
I can't figure out the right way to have it only output tukey tests that contain a p-value <0.07, so my ideal output would be something like this (this contains made-up values):
$cmpd1
diff lwr upr p adj
D-A 2.728484e-12 -29169.59 29169.59 1.0000000
W-A 3.637979e-12 -32278.10 32278.10 0.0001
W-D 1.484573e+04 -13620.88 43312.34 0.056
The output of TukeyHSD is a list as evident from the structure
str(TukeyHSD(aov(df1[,1] ~ treatment, data = df1)))
List of 1
$ treatment: num [1:3, 1:4] -2.84e-14 4.67e+02 4.67e+02 -1.09e+02 3.58e+02 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "D-A" "W-A" "W-D"
.. ..$ : chr [1:4] "diff" "lwr" "upr" "p adj"
- attr(*, "class")= chr [1:2] "TukeyHSD" "multicomp"
- attr(*, "orig.call")= language aov(formula = df1[, 1] ~ treatment, data = df1)
- attr(*, "conf.level")= num 0.95
- attr(*, "ordered")= logi FALSE
we can extract the list element 'treatment' which is a matrix and thus the [[ or $ wouldn't work. We can use [ with column name along with the , to separate the row/column index or names and wrap with any as there are 3 values for 'p adj' (if expects a single TRUE/FALSE logical input)
for (i in 1:3){
column <- names(df1[i])
anova <- aov(df1[,i] ~ treatment, data = df1)
tukey <- TukeyHSD(anova)
# only want tukey with P < 0.07 printed
if(any(tukey$treatment[, "p adj"] < 0.07)) {
print(column)
print(setNames(tukey, column))
}
}
-output
[1] "cmpd1"
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = df1[, i] ~ treatment, data = df1)
$cmpd1
diff lwr upr p adj
D-A -2.842171e-14 -109.1823 109.1823 1.000000
W-A 4.670000e+02 357.8177 576.1823 0.000839
W-D 4.670000e+02 357.8177 576.1823 0.000839

How to plot the Hazard Ratio + CI over time of survival data in ggplot in R?

Background
I want to plot the hazard ratio over time, including its confidence intervals, of a survival dataset. As an example, I will take a simplified dataset from the survival package: the colon dataset.
library(survival)
library(tidyverse)
# Colon survival dataset
data <- colon %>%
filter(etype == 2) %>%
select(c(id, rx, status, time)) %>%
filter(rx == "Obs" | rx == "Lev+5FU") %>%
mutate(rx = factor(rx))
The dataset contains patients that received a treatment (i.e., "Lev+5FU") and patients that did not (i.e., "Obs"). The survival curves are as follows:
fit <- survfit(Surv(time, status) ~ rx, data = data )
plot(fit)
Attempt
Using the cox.zph function, you can plot the hazard ratio of a cox model.
cox <- coxph(Surv(time, status) ~ rx, data = data)
plot(cox.zph(cox))
However, I want to plot the hazard ratio including 95% CI for this survival dataset using ggplot.
Question(s)
How do you extract the hazard ratio data and the 95% CIs from this cox.zph object to plot them in ggplot?
Are there other R packages that enable doing the same in a more convenient way?
Note: it’s important to recognize the correction of Dion Groothof. The lines and CIs are not really hazard ratios. They are estimates and bounds around time varying log-hazard-ratios. You would need to exponentiate to get HRs.
The values are in the result returned from cox.zph:
str(cox.zph(cox))
#----------------------
List of 7
$ table : num [1:2, 1:3] 1.188 1.188 1 1 0.276 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "rx" "GLOBAL"
.. ..$ : chr [1:3] "chisq" "df" "p"
$ x : num [1:291] 0 0.00162 0.00323 0.00485 0.00646 ...
$ time : num [1:291] 23 34 45 52 79 113 125 127 138 141 ...
$ y : num [1:291, 1] 2.09 2.1 2.1 2.1 2.11 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:291] "23" "34" "45" "52" ...
.. ..$ : chr "rx"
$ var : num [1, 1] 4.11
$ transform: chr "km"
$ call : language cox.zph(fit = cox)
- attr(*, "class")= chr "cox.zph"
To get a plot with any of the paradigms (base, lattice or ggplot2) you use the time as the x axis, use x as the solid line and y at the "points"
z <- cox.zph(cox)
ggdf <- data.frame( unclass(z)[c("time", "x","y")])
ggplot(data=ggdf, aes(x=time, y=-x))+
geom_line()+ ylim(range(z$y))+
geom_point(aes(x=time,y=z$y) )
To get the CI look at getAnywhere(plot.cox.zph)
xx <- x$x
yy <- x$y
df <- max(df)
nvar <- ncol(yy)
pred.x <- seq(from = min(xx), to = max(xx), length = nsmo)
#------------
if (se) {
bk <- backsolve(qmat$qr[1:df, 1:df], diag(df))
xtx <- bk %*% t(bk)
seval <- ((pmat %*% xtx) * pmat) %*% rep(1, df)
temp <- 2 * sqrt(x$var[i, i] * seval)
yup <- yhat + temp
ylow <- yhat - temp
yr <- range(yr, yup, ylow)
#---------------
if (se) {
lines(pred.x, exp(yup), col = col[2], lty = lty[2],
lwd = lwd[2])
lines(pred.x, exp(ylow), col = col[2], lty = lty[2],
lwd = lwd[2])
}
The survminer package will do this for you:
library(survminer)
ggcoxzph(cox.zph(cox))

Regression table with clustered standard errors in R jupyter notebook?

I'm using export_summs in R to make a regression table, but when I use coeftest to get clustered standard errors, the table no longer reports N or R^2 properly in those columns. The coefficients and standard errors look good, just missing those additional stats. (I'm used to outreg2 in Stata which is much simpler.)
I tried using tidy_override() as suggested in the last example here (https://hughjonesd.github.io/huxtable/huxreg.pdf), no change.
# Reproducible example
datareg <- NULL
datareg$y <- rnorm(1000)
datareg$x <- rnorm(1000)
datareg$cluster_var <- rnorm(1000)
datareg <- data.frame(datareg)
reg0 <- lm(y ~ x
, data = datareg)
reg1 <- coeftest(
lm(y ~ x
, data = datareg)
, vcovCL, cluster = datareg$cluster_var)
export_summs(reg0, reg1,
model.names = c("Basic", "Cluster SE"))
Issues warning and output:
This is a case where the error message is fairly clear: the broom package does not have a glance method for coeftest objects. This is not an accident--the nature of the coeftest object does not allow for broom to calculate model summary statistics. It retains very little information about the original model:
> str(reg1)
'coeftest' num [1:2, 1:4] 0.0483 0.0153 0.0329 0.0341 1.4668 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "(Intercept)" "x"
..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
- attr(*, "method")= chr "t test of coefficients"
- attr(*, "df")= int 998
One option is to use the lm_robust function from the estimatr package. It returns objects with robust standard errors that are amenable to both glance and tidy:
reg2 <- estimatr::lm_robust(y ~ x
, data = datareg)
export_summs(reg0, reg2,
model.names = c("Basic", "Cluster SE"), number_format = NA )
──────────────────────────────────────────────────────────────────
Basic Cluster SE
────────────────────────────────────────────────────
(Intercept) 0.0482678107925753 0.0482678107925755
(0.032842483472098) (0.0329070612421128)
x 0.0152928320138191 0.015292832013819
(0.0333488383365212) (0.034094868727288)
────────────────────────────────────────────────────
N 1000 1000
R2 0.000210664993144995 0.000210665
──────────────────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, Basic, Cluster SE
Huxtable author here. This is how to do it with tidy_override:
library(generics)
library(huxtable)
library(jtools)
library(lmtest)
library(sandwich)
datareg <- NULL
datareg$y <- rnorm(1000)
datareg$x <- rnorm(1000)
datareg$cluster_var <- rnorm(1000)
datareg <- data.frame(datareg)
reg0 <- lm(y ~ x, data = datareg)
reg1 <- coeftest(reg0, vcovCL, cluster = datareg$cluster_var)
reg1 <- tidy_override(reg1, glance = list(nobs = 1000L, r.squared = 0.000),
extend = TRUE) # extend = TRUE is important
export_summs(reg0, reg1, model.names = c("Basic", "Cluster SE"))
Which gives:
────────────────────────────────────────────────────
Basic Cluster SE
───────────────────────────────────
(Intercept) -0.01 -0.01
(0.03) (0.03)
x -0.05 -0.05
(0.03) (0.03)
───────────────────────────────────
N 1000 1000
R2 0.00 0.00
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, Basic, Cluster SE
This was fairly tricky and I appreciate your difficulties... I have improved the error reporting in huxreg as a result!

getting sensitivity after fitting decision tree with caret

I am trying to obtain the sensitivity and specificity etc. after fitting a decision tree using carret (or even caret::confusionMatrix). I am using code along those lines:
fit <- rpart(
Bla ~ ...
,method="class"
,data=OrginalData
)
preds <- predict(fit, SomeData)
caret::sensitivity(factor(preds[,2]), factor(OrginalData$Bla))
str(preds)
Unfortunately, I get NA as sensitivity. Str returns:
num [1:40, 1:2] 0.926 0.926 0.926 0.926 0.926 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:40] "1" "2" "3" "4" ...
..$ : chr [1:2] "NO" "YESR"
Any ideas?
The problem is you are providing a numeric vector of probabilities as predictions while caret::sensitivity expects a vector of classes. Here is a working code with Sonar data from mlbench:
library(mlbench)
library(rpart)
library(caret)
data(Sonar)
split data:
inds <- createDataPartition(Sonar$Class, p = 0.7, list = FALSE)
train <- Sonar[inds,]
test <- Sonar[-inds,]
create model:
fit <- rpart(Class ~ .,
method = "class",
data = train)
pred <- predict(fit, test)
convert probability to prediction:
caret::sensitivity(as.factor(ifelse(pred[,2] > 0.5, "R", "M")), test$Class)
#output
[1] 0.6969697
Do note that 0.5 as threshold might not be appropriate and it should be tuned according to the desired model behavior. If there is no preference towards a certain type of error, balanced accuracy is a solid metric to tune the probability threshold.
or specify type = class when predicting rpart:
pred <- predict(fit, test, type = "class")
caret::sensitivity(pred, test$Class)

Resources