Getting predicted values at response scale using broom::augment function - r

I'm fitting glm model in R and can get predicted values at response scale using predict.glm(object=fm1, type="response") where fm1 is the fitted model. I wonder how to get predicted values at response scale using augment function from broom package. My minimum working example is given below.
Dilution <- c(1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1, 2, 4)
NoofPlates <- rep(x=5, times=10)
NoPositive <- c(0, 0, 2, 2, 3, 4, 5, 5, 5, 5)
Data <- data.frame(Dilution, NoofPlates, NoPositive)
fm1 <- glm(formula=NoPositive/NoofPlates~log(Dilution),
family=binomial("logit"), data=Data, weights=NoofPlates)
predict.glm(object=fm1, type="response")
# 1 2 3 4 5 6 7 8 9 10
# 0.02415120 0.07081045 0.19005716 0.41946465 0.68990944 0.87262421 0.95474066 0.98483820 0.99502511 0.99837891
library(broom)
broom::augment(x=fm1)
# NoPositive.NoofPlates log.Dilution. X.weights. .fitted .se.fit .resid .hat .sigma
# 1 0.0 -4.8520303 5 -3.6989736 1.1629494 -0.4944454 0.15937234 0.6483053
# 2 0.0 -4.1588831 5 -2.5743062 0.8837030 -0.8569861 0.25691194 0.5662637
# 3 0.4 -3.4657359 5 -1.4496388 0.6404560 1.0845988 0.31570923 0.4650405
# 4 0.4 -2.7725887 5 -0.3249714 0.4901128 -0.0884021 0.29247321 0.6784308
# 5 0.6 -2.0794415 5 0.7996960 0.5205868 -0.4249900 0.28989252 0.6523116
# 6 0.8 -1.3862944 5 1.9243633 0.7089318 -0.4551979 0.27931425 0.6486704
# 7 1.0 -0.6931472 5 3.0490307 0.9669186 0.6805552 0.20199632 0.6155754
# 8 1.0 0.0000000 5 4.1736981 1.2522190 0.3908698 0.11707018 0.6611557
# 9 1.0 0.6931472 5 5.2983655 1.5498215 0.2233227 0.05944982 0.6739965
# 10 1.0 1.3862944 5 6.4230329 1.8538108 0.1273738 0.02781019 0.6778365
# .cooksd .std.resid
# 1 0.0139540988 -0.5392827
# 2 0.0886414317 -0.9941540
# 3 0.4826245827 1.3111391
# 4 0.0022725303 -0.1050972
# 5 0.0543073747 -0.5043322
# 6 0.0637954916 -0.5362006
# 7 0.0375920888 0.7618349
# 8 0.0057798939 0.4159767
# 9 0.0008399932 0.2302724
# 10 0.0001194412 0.1291827

For generalized linear model, in order for the math to come out, the model needs to be transformed using a link function. For Gaussian model, this is the identity function, but for logistic regression, we use a logit function (can also be probit, does that ring a bell?). This means that you can get "raw" predicted values or transformed. This is why ?predict.glm offers a type argument, which translates to type.predict in augment.
broom::augment(x=fm1, newdata = Data, type.predict = "response")

Related

Interpretation of AUC NaN values in h2o cross-validation predictions summary

I have noticed that for some runs of:
train=as.h2o(u)
mod = h2o.glm(family= "binomial", x= c(1:15), y="dc",
training_frame=train, missing_values_handling = "Skip",
lambda = 0, compute_p_values = TRUE, nfolds = 10,
keep_cross_validation_predictions= TRUE)
there are NaNs in cross-validation metrics summary of AUC for some cv iterations of the model.
For example:
print(mod#model$cross_validation_metrics_summary["auc",])
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid cv_10_valid
auc 0.63244045 0.24962118 0.25 0.6666667 0.8095238 1.0 0.6666667 0.46666667 NaN NaN 1.0 0.2
NaN in CV seems to appear less frequently when I set smaller nfolds=7.
How these NaN values should be interpreted and when h2o cross-validation outputs them?
I suppose it happens when AUC can't be assessed correctly in an iteration. My training set has 70 complete rows.
Can such AUC cross-validation results (containing NaNs) be considered as reliable?
There are specific cases that could cause division by zero when calculating the ROC curve, which could cause an AUC to be NaN. It's probable that due to small data you have some folds that have no true positives and are causing this issue.
We can test this by keeping the fold column and then counting the values of dc in each fold:
...
train <- as.h2o(u)
mod <- h2o.glm(family = "binomial"
, x = c(1:15)
, y = "dc"
, training_frame = train
, missing_values_handling = "Skip"
, lambda = 0
, compute_p_values = TRUE
, nfolds = 10
, keep_cross_validation_fold_assignment = TRUE
, seed = 1234)
fold <- as.data.frame(h2o.cross_validation_fold_assignment(mod))
df <- cbind(u,fold)
table(df[c("dc","fold_assignment")])
fold_assignment
dc 0 1 2 3 4 5 6 7 8 9
0 4 6 6 2 9 6 6 4 4 6
1 2 2 3 4 0 2 0 0 1 2
mod#model$cross_validation_metrics_summary["auc",]
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid
auc 0.70238096 0.19357596 0.875 0.6666667 0.5 0.375 NaN 0.5833333 NaN
cv_8_valid cv_9_valid cv_10_valid
auc NaN 1.0 0.9166667
We see that the folds with NaN are the same folds that have only dc=0.
Not counting the NaN, the wide variety of AUC for your folds (from 0.2 to 1) tells us that this is not a robust model, and it is likely being overfitted. Can you add more data?

using emmeans for lmer

I've been trying to calculate marginal means for my lmer & glmer in R. I found the emmeans function and I've been trying to understand it and apply it to my model. I found that it's hard to get the means for an interaction, so I'm starting with just additive predictors, but the function doesn't work the way it's presented in examples (e.g. here https://cran.r-project.org/web/packages/emmeans/vignettes/sophisticated.html)
emmeans(Oats.lmer, "nitro")
nitro emmean SE df lower.CL upper.CL
0.0 78.89207 7.294379 7.78 61.98930 95.79484
0.2 97.03425 7.136271 7.19 80.25029 113.81822
0.4 114.19816 7.136186 7.19 97.41454 130.98179
0.6 124.06857 7.070235 6.95 107.32795 140.80919
what I'm getting is:
emmeans(model2, "VariableA")
VariableA emmean SE df lower.CL upper.CL
0.4657459 2649.742 120.8955 19.07 2396.768 2902.715
Only one line and the variable is averaged instead of split into 0 and 1 (which are the values in the dataset, and maybe the problem is that it's categorical?)
The model I'm running is :
model2 = lmer (rt ~ variableA + variableB + (1 |participant) + (1 |sequence/item), data=memoryData, REML=FALSE)
EDIT: The data file is quite big and I wasn't sure how to extract useful information from it, but here is the structure:
> str(memoryData)
'data.frame': 3168 obs. of 123 variables:
$ participant : int 10 10 10 10 10 10 10 10 10 10 ...
$ variableA : int 1 1 1 1 1 1 1 1 1 1 ...
$ variableB : int 1 1 1 1 1 1 1 1 1 1 ...
$ sequence: int 1 1 1 1 1 1 1 1 1 1 ...
$ item : int 25 26 27 28 29 30 31 32 33 34 ...
$ accuracy : int 1 1 1 1 1 1 0 1 1 1 ...
$ rt : num 1720 1628 1728 2247 1247 ...
Why is the function not working for me?
And as a further question, is there a way to get these means when I include interaction between variables A and B?
EDIT 2: ok, it did work when I changed it to factor, I guess my method of doing it was incorrect. But I'm still not sure how to calculate it when there is an interaction? Because with this method, R says "NOTE: Results may be misleading due to involvement in interactions"
To see marginal means of interactions, add all variables of the interaction term to emmeans(), and you need to use the at-argument if you want to see the marginal means at different levels of the interaction terms.
Here are some examples, for the average effect of the interaction, and for marginal effects at different levels of the interaction term. The latter has the advantage in terms of visualization.
library(ggeffects)
library(lme4)
library(emmeans)
data("sleepstudy")
sleepstudy$inter <- sample(1:5, size = nrow(sleepstudy), replace = T)
m <- lmer(Reaction ~ Days * inter + (1 + Days | Subject), data = sleepstudy)
# average marginal effect of interaction
emmeans(m, c("Days", "inter"))
#> Days inter emmean SE df lower.CL upper.CL
#> 4.5 2.994444 298.3427 8.84715 16.98 279.6752 317.0101
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
# marginal effects at different levels of interactions -
# useful for plotting
ggpredict(m, c("Days [3,5,7]", "inter"))
#>
#> # Predicted values of Reaction
#> # x = Days
#>
#> # inter = 1
#> x predicted std.error conf.low conf.high
#> 3 279.349 8.108 263.458 295.240
#> 5 304.839 9.818 285.597 324.082
#> 7 330.330 12.358 306.109 354.551
#>
#> # inter = 2
#> x predicted std.error conf.low conf.high
#> 3 280.970 7.624 266.028 295.912
#> 5 304.216 9.492 285.613 322.819
#> 7 327.462 11.899 304.140 350.784
#>
#> # inter = 3
#> x predicted std.error conf.low conf.high
#> 3 282.591 7.446 267.997 297.185
#> 5 303.593 9.384 285.200 321.985
#> 7 324.594 11.751 301.562 347.626
#>
#> # inter = 4
#> x predicted std.error conf.low conf.high
#> 3 284.212 7.596 269.325 299.100
#> 5 302.969 9.502 284.345 321.594
#> 7 321.726 11.925 298.353 345.099
#>
#> # inter = 5
#> x predicted std.error conf.low conf.high
#> 3 285.834 8.055 270.046 301.621
#> 5 302.346 9.839 283.062 321.630
#> 7 318.858 12.408 294.540 343.177
#>
#> Adjusted for:
#> * Subject = 308
emmeans(m, c("Days", "inter"), at = list(Days = c(3, 5, 7), inter = 1:5))
#> Days inter emmean SE df lower.CL upper.CL
#> 3 1 279.3488 8.132335 23.60 262.5493 296.1483
#> 5 1 304.8394 9.824196 20.31 284.3662 325.3125
#> 7 1 330.3300 12.366296 20.69 304.5895 356.0704
#> 3 2 280.9700 7.630745 18.60 264.9754 296.9646
#> 5 2 304.2160 9.493225 17.77 284.2529 324.1791
#> 7 2 327.4621 11.901431 17.84 302.4420 352.4822
#> 3 3 282.5912 7.445982 16.96 266.8786 298.3038
#> 5 3 303.5927 9.383978 16.98 283.7927 323.3927
#> 7 3 324.5942 11.751239 16.98 299.7988 349.3896
#> 3 4 284.2124 7.601185 18.34 268.2639 300.1609
#> 5 4 302.9694 9.504102 17.85 282.9900 322.9487
#> 7 4 321.7263 11.927612 17.99 296.6666 346.7860
#> 3 5 285.8336 8.076779 23.02 269.1264 302.5409
#> 5 5 302.3460 9.845207 20.48 281.8399 322.8521
#> 7 5 318.8584 12.416642 21.02 293.0380 344.6788
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
And a plotting example:
ggpredict(m, c("Days", "inter [1,3,5]")) %>% plot()
You say that "changing the vari[a]ble to factor doesn't help", but I would think this would (as documented in the emmeans FAQ):
md <- transform(memoryData,
variableA=factor(variableA),
variableB=factor(variableB))
model2 = lmer (rt ~ variableA + variableB +
(1 |participant) + (1 |sequence/item), data=md, REML=FALSE)
emmeans(model2, ~variableA)
emmeans(model2, ~variableB)
emmeans(model2, ~variableA + variableB)
If this really doesn't work, then we need a reproducible example ...

R Flexsurv and time-dependent covariates

I read that the R flexsurv package can also be used for modeling time-dependent covariates according to Christopher Jackson (2016) ["flexsurv: a platform for parametric survival modeling in R, Journal of Statistical Software, 70 (1)].
However, I was not able to figure out how, even after several adjustments and searches in online forums.
Before turning to the estimation of time-dependent covariates I tried to create a simple model with only time-independent covariates to test whether I specified the Surv object correctly. Here is a small example.
library(splitstackshape)
library(flexsurv)
## create sample data
n=50
set.seed(2)
t <- rpois(n,15)+1
x <- rnorm(n,t,5)
df <- data.frame(t,x)
df$id <- 1:n
df$rep <- df$t-1
Which looks like this:
t x id rep
1 12 17.696149 1 11
2 12 20.358094 2 11
3 11 2.058789 3 10
4 16 26.156213 4 15
5 13 9.484278 5 12
6 15 15.790824 6 14
...
And the long data:
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:n){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
Which looks like this:
t x id start stop censrec
1 12 17.69615 1 1 2 0
1.1 12 17.69615 1 2 3 0
1.2 12 17.69615 1 3 4 0
1.3 12 17.69615 1 4 5 0
1.4 12 17.69615 1 5 6 0
1.5 12 17.69615 1 6 7 0
1.6 12 17.69615 1 7 8 0
1.7 12 17.69615 1 8 9 0
1.8 12 17.69615 1 9 10 0
1.9 12 17.69615 1 10 11 0
1.10 12 17.69615 1 11 12 1
2 12 20.35809 2 1 2 0
...
Now I can estimate a simple Cox model to see whether it works:
coxph(Surv(t)~x,data=df)
This yields:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
And in the long format:
coxph(Surv(start,stop,censrec)~x,data=long.df)
I get:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
Taken together I conclude that my transformation into the long format was correct. Now, turning to the flexsurv framework:
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
yields:
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00086 4.05569 6.16631 0.53452 NA NA NA
scale NA 13.17215 11.27876 15.38338 1.04293 NA NA NA
x 15.13380 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
But
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
causes an error:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull") :
Initial value for parameter 1 out of range
Would anyone happen to know the correct syntax for the latter Surv object? If you use the correct syntax, do you get the same estimates?
Thank you very much,
best,
David
===============
EDIT AFTER FEEDBACK FROM 42
===============
library(splitstackshape)
library(flexsurv)
x<-c(8.136527, 7.626712, 9.809122, 12.125973, 12.031536, 11.238394, 4.208863, 8.809854, 9.723636)
t<-c(2, 3, 13, 5, 7, 37 ,37, 9, 4)
df <- data.frame(t,x)
#transform into long format for time-dependent covariates
df$id <- 1:length(df$t)
df$rep <- df$t-1
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:length(df$t)){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
coxph(Surv(t)~x,data=df)
coxph(Surv(start,stop,censrec)~x,data=long.df)
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull",inits=c(shape=.1, scale=1))
Which yields the same estimates for both coxph models but
Call:
flexsurvreg(formula = Surv(time = t) ~ x, data = df, dist = "weibull")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 1.0783 0.6608 1.7594 0.2694 NA NA NA
scale NA 27.7731 3.5548 216.9901 29.1309 NA NA NA
x 9.3012 -0.0813 -0.2922 0.1295 0.1076 0.9219 0.7466 1.1383
N = 9, Events: 9, Censored: 0
Total time at risk: 117
Log-likelihood = -31.77307, df = 3
AIC = 69.54614
and
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 0.8660 0.4054 1.8498 0.3353 NA NA NA
scale NA 24.0596 1.7628 328.3853 32.0840 NA NA NA
x 8.4958 -0.0912 -0.3563 0.1739 0.1353 0.9128 0.7003 1.1899
N = 108, Events: 9, Censored: 99
Total time at risk: 108
Log-likelihood = -30.97986, df = 3
AIC = 67.95973
Reading the error message:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull", :
initial values must be a numeric vector
And then reading the help page, ?flexsurvreg, it seemed as though an attempt at setting values for inits to a named numeric vector should be attempted:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull", inits=c(shape=.1, scale=1))
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00082 4.05560 6.16633 0.53454 NA NA NA
scale NA 13.17213 11.27871 15.38341 1.04294 NA NA NA
x 15.66145 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
N = 715, Events: 50, Censored: 665
Total time at risk: 715
Log-likelihood = -131.5721, df = 3
AIC = 269.1443
Extremely similar results. My guess was basically a stab in the dark, so I have no guidance on how to make a choice if this had not succeeded other than to "expand the search."
I just want to mention that in flexsurv v1.1.1, running this code:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
doesn't return any errors. It also gives the same estimates as the non time-varying command
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")

R not enough observation, arguments are treated as the container, rather than the content itself

So, I am trying to make bartlett or any test in R. it's working good with imported data:
data(foster, package = "HSAUR")
bartlett.test(weight ~ litgen,data = foster)
But not with my data:
mdat <- matrix(c(2.3,2.2,2.25, 2.2,2.1,2.2, 2.15, 2.15, 2.2, 2.25, 2.15, 2.25), nrow = 3, ncol = 4)
working_df = data.frame(mdat)
bartlett.test(X1 ~ X2, data = working_df)
Error in bartlett.test.default(c(2.3, 2.2, 2.25), c(2.2, 2.1, 2.2)) :
there must be at least 2 observations in each group
I have tried all the different functions, assignments but the problem is that the arguments are treated as a single object rather than its content
How can I make a barttlet test with my dataframes? How do make the arguments be the contents, rather than the container?
I don't know what you mean when you talk about "contents" and "container". The documentation at ?bartlett.test is pretty straightforward. You're trying to use a formula, so we'll look at the description of the formula argument:
formula a formula of the form lhs ~ rhs where lhs gives the data values and rhs the corresponding groups.
This matches with the structure of the foster data, where weight is numeric, and litgen is a categorical grouper.
head(foster)
litgen motgen weight
1 A A 61.5
2 A A 68.2
3 A A 64.0
4 A A 65.0
5 A A 59.7
6 A B 55.0
So, you need to put your data in that format.
your_data = data.frame(x = c(mdat), group = c(col(mdat)))
your_data
# x group
# 1 2.30 1
# 2 2.20 1
# 3 2.25 1
# 4 2.20 2
# 5 2.10 2
# 6 2.20 2
# 7 2.15 3
# 8 2.15 3
# 9 2.20 3
# 10 2.25 4
# 11 2.15 4
# 12 2.25 4
bartlett.test(x ~ group, data = your_data)
# Bartlett test of homogeneity of variances
#
# data: x by group
# Bartlett's K-squared = 0.86607, df = 3, p-value = 0.8336
That's all your groups at once. If you want to do pairwise comparisons, give subsets of you data to bartlett.test.

Determine threshold value based on senstivities and specificities [duplicate]

This question already has answers here:
Obtaining threshold values from a ROC curve
(5 answers)
Closed 7 years ago.
My data is kind of irregular to apply ROC etc. for a threshold determination. To simplify, here is a demo, let x is
x<-c(0,0,0,12, 3, 4, 5, 15, 15.3, 20,18, 26)
Suppose x=15.1 is the unknown true threshold and the corresponding test outcome y will be negative (0) if x==0 OR x > 15.1, otherwise y is positive(1) such that:
y<-c(0,0,0,1, 1, 1, 1, 1, 0,0,0,0)
Due to 0 is a positive outcome in x, I'm wondering in which way I can determine the threshold of x to predict y the best. I have tried R packaged pROC and ROCR, both seem not straight forward for this situation. Would somebody have me some suggestions?
You have a situation where you predict 0 for high values of x and predict 1 for low values of x, except you always predict 0 if x == 0. Standard packages like pROC and ROCR expect low values of x to be associated with predicting y=0. You could transform your data to this situation by:
Flipping the sign of all your predictions
Replacing 0 with a small negative value in x
In code (using this answer to extract TPR and FPR for each cutoff):
x2 <- -x
x2[x2 == 0] <- -1000
library(ROCR)
pred <- prediction(x2, y)
perf <- performance(pred, "tpr", "fpr")
data.frame(cut=perf#alpha.values[[1]], fpr=perf#x.values[[1]],
tpr=perf#y.values[[1]])
# cut fpr tpr
# 1 Inf 0.0000000 0.0
# 2 -3.0 0.0000000 0.2
# 3 -4.0 0.0000000 0.4
# 4 -5.0 0.0000000 0.6
# 5 -12.0 0.0000000 0.8
# 6 -15.0 0.0000000 1.0
# 7 -15.3 0.1428571 1.0
# 8 -18.0 0.2857143 1.0
# 9 -20.0 0.4285714 1.0
# 10 -26.0 0.5714286 1.0
# 11 -1000.0 1.0000000 1.0
Now you can select your favorite cutoff based on the true and false positive rates, remembering that the selected cutoff value will be negated from the original value.

Resources