How to make predictions even with NAs using predict()? - r

I want to use predict() with a polr() model to predict variable z, as per the following code. This first is the df to train the model and the subsequent test data.
df <- data.frame(x=c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2),
y=c(32, 67, 12, 89, 45, 78, 43, 47, 14, 67, 16, 36, 25, 23, 56, 26, 35, 79, 13, 44),
z=as.factor(c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 1, 2, 1, 2, 1, 2)))
test <- data.frame(x=c(1, 2, 1, 1, 2, 1, 2, 2, 1, 1),
y=c(34, NA, 78, NA, 89, 17, 27, 83, 23, 48),
z=c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1))
This is the polr() model:
mod <- polr(z ~ x + y, data = df, Hess = TRUE)
And this is the predict() function with its outcome:
predict(mod, newdata = test)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
My problem is that I want the model to make predictions even when there are NAs, as in the 2nd and 4th cases. I have tried the following, with the same result:
predict(mod, newdata = test, na.action = "na.exclude")
predict(mod, newdata = test, na.action = "na.pass")
predict(mod, newdata = test, na.action = "na.omit")
predict(mod, newdata = test, na.rm=T)
[1] 2 <NA> 2 <NA> 2 2 2 2 2 2
How can I get the model to make predictions even when there's some missing data?

This is more of a statistical or mathematical problem than a programming problem. To simplify things a little bit (and show that it's general, I'll illustrate with a linear regression, but the concept extends to ordinal regression as well.
Suppose I've estimated a linear relationship, say z = 1 + 2*x + 3*y, and I want to predict a response when the predictors are {x=3, y=NA}. I get 1 + 2*3 + 3*NA, which is clearly NA.
If you want predictions when some of the predictor variables are unknown, you have to make some kind of assumption/decision about what to do — this is a question of interpretation, not mathematics. For example, you could set unknown values of y to the mean of the original data set, or the mean of the new data set, or some sensible reference value, or you could do multiple imputation — i.e., making several predictions based on several different draws from a reasonable distribution, then averaging the results. (For a linear regression model this will give you the same answer (point estimate) as using the mean of the distribution, but (1) the results will differ if you have an effectively nonlinear model like an ordinal or generalized linear regression; (2) multiple imputation will allow you to get sensible standard errors on the prediction.)

Related

Gtsummary columns for mixed model p-value and interaction

I have been conducting a cross-over experiment, testing a specific treatment to a group of patients who received treatment "1" and "2" in random order.
I am fairly new to R, and I wish to generate a table with tbl_summary with one column of each treatment effect on various parameters, as well as a column for the p-value from the mixed model analysis (between-group comparison) and a sequence-interaction p-value.
I have calculated the between-period difference in response to treatment within each period by using a mixed model approach with the lme4-package. Then, I compared the treatment response between groups by the estimated marginal means (emmeans).
I have conducted my statistics using the following code:
library(emmeans)
library(lme4)
library(lmerTest)
df <- data.frame (record_id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12),
treatment = c(1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2),
treatment_sequence = c(1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1),
treatment_response = c(-43.5, 135.0, 8.4, -7.2, 99.0, 159.0, 12.0, -27.0, 3.0, 12.0, -15.0, 91.5, 6.0, -9.0, 177.0, 27.0, 52.8, -54.0, -50.7, 63.0, -9.0, 186.0, -72.0, 15.0)
)
df
df_mm <- lmer(treatment_response ~ as.factor(treatment)*treatment_sequence + (1|record_id), data=df)
anova(df_mm)
emmeans(df_mm, list(pairwise ~ treatment), adjust = "bonferroni")
This gives the following output:
> df_mm <- lmer(treatment_response ~ as.factor(treatment)*treatment_sequence + (1|record_id), data=df)
> anova(df_mm) ###show model as anova???
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
as.factor(treatment) 1890.0 1890.0 1 10 0.4575 0.5141
treatment_sequence 832.1 832.1 1 10 0.2014 0.6632
as.factor(treatment):treatment_sequence 7466.0 7466.0 1 10 1.8071 0.2086
> emmeans(df_mm, list(pairwise ~ treatment), adjust = "bonferroni")
NOTE: Results may be misleading due to involvement in interactions
$`emmeans of treatment`
treatment emmean SE df lower.CL upper.CL
1 1.45 19.9 19.7 -40.1 43
2 61.83 19.9 19.7 20.3 103
Results are averaged over the levels of: treatment_sequence
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
$`pairwise differences of treatment`
1 estimate SE df t.ratio p.value
treatment1 - treatment2 -60.4 26.2 10 -2.301 0.0442
Results are averaged over the levels of: treatment_sequence
Degrees-of-freedom method: kenward-roger
I would like the between-treatment comparison p-value (0.0442) along with the interaction p-value of 0.21 in the table. My aim is to create a table like this:
I have tried modifying the code from this post (Gtsummary columns for all post hoc pairwise comparisons), but I cannot seem to get it right.
Is this possible? And can someone help with the coding?
Below is a working example. BUT I don't think the emmeans method you're using is correct. If you want to use it, you'll need to update the code to grab the p-value from the emmeans object (it's just a random number for now).
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.6.1'
df <- data.frame (record_id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12),
treatment = c(1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2),
treatment_sequence = c(1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1),
treatment_response = c(-43.5, 135.0, 8.4, -7.2, 99.0, 159.0, 12.0, -27.0, 3.0, 12.0, -15.0, 91.5, 6.0, -9.0, 177.0, 27.0, 52.8, -54.0, -50.7, 63.0, -9.0, 186.0, -72.0, 15.0)
)
mod <- lme4::lmer(treatment_response ~ as.factor(treatment) * treatment_sequence + (1 | record_id), data=df)
tt <- emmeans::emmeans(mod, list(pairwise ~ treatment), adjust = "bonferroni")$`pairwise differences of treatment`
#> NOTE: Results may be misleading due to involvement in interactions
tt |> as.data.frame() |> dplyr::select(dplyr::last_col()) |> dplyr::pull()
#> [1] 0.04419325
car::Anova(mod) %>%
broom::tidy() %>%
dplyr::filter(dplyr::n() == dplyr::row_number()) |>
dplyr::pull(p.value)
#> [1] 0.1788567
my_custom_stats <- function(data, variable, ...) {
formula <-
as.formula(glue::glue(
"{variable} ~ as.factor(treatment) * treatment_sequence + (1 | record_id)"
))
mod <- lme4::lmer(formula, data = data)
# I think this is not appriraite due to the interaction
# but if you're confident about this approach, update pw_difference_p to be from emmeans
pw_difference_p <-
emmeans::emmeans(mod, list(pairwise ~ treatment), adjust = "bonferroni")
pw_difference_p <- runif(1)
interacton_p <-
car::Anova(mod) %>%
broom::tidy() %>%
dplyr::filter(dplyr::n() == dplyr::row_number()) |>
dplyr::pull(p.value)
dplyr::tibble(
pw_difference_p = pw_difference_p,
interacton_p = interacton_p
)
}
df %>%
tbl_summary(
by = treatment,
include = treatment_response,
statistic = all_continuous() ~ "{mean} ± {sd}"
) %>%
add_stat(fns = ~my_custom_stats) %>%
modify_header(interacton_p = "**Interaction P**",
pw_difference_p = "**Treatment P**") %>%
modify_fmt_fun(c(interacton_p, pw_difference_p) ~ style_pvalue) %>%
as_kable()
#> NOTE: Results may be misleading due to involvement in interactions
Characteristic
1, N = 12
2, N = 12
Treatment P
Interaction P
treatment_response
1 ± 60
62 ± 76
0.3
0.2
Created on 2022-08-24 by the reprex package (v2.0.1)

Kaplan Meier curve after IPTW

I want to use IPTW to find the effects of a medication on cardiovascular death (1), which competes with non-cardiovascular death (2) and survival (0).
After the IPTW, I would like to do a competing risk analysis to find the effect of the medication on cardiovascular death and plot the resulting Kaplan Meier curve.
This is my start
library(tableone)
library(crr)
library(ipw)
library(sandwich)
library(survey)
treatment<-as.numeric(df$treatment==1)
#propensity score model
psmodel<-glm(treatment ~ age + sex, data=df )
ps<-predict(psmodel, type="response")
#weights
weight<-ifelse(treatment==1,1/(ps),1/(1-ps))
age<-as.numeric(df$age)
sex<-as.numeric(df$sex==1)
cov1 <-cbind(age,sex, weight, treatment)
ftime <-df$Survival
fstatus<-df$Outcome
competingrisk<-crr(ftime, fstatus, cov1, failcode=2)
summary(competingrisk)
My dataset looks like this but with 500 lines.
structure(list(Outcome = c(1, 1, 1, 2, 2, 2, 2, 0, 0, 1, 1, 0,
0, 1, 1, 0, 0, 2), Survival = c(7, 13, 14, 8, 9, 15, 14, 16,
14, 3, 7, 13, 14, 8, 9, 15, 16, 4), treatment = c(1, 0, 0, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0), age = c(59, 58, 57,
56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42),
BMI = c(25, 24, 23, 22, 21, 20, 29, 28, 27, 26, 25, 24, 25,
24, 23, 22, 21, 20), sex = c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -18L))
A Kaplan-Meier plot is very naive to multiple covariates, and to competing risks. I highly doubt it is the best way to analyse your data in this instance, however it could be used to investigate and so here is the solution.
If you want to find the four Kaplan-Meier curves for the two events, you can use the following code:
library(survival)
#create a KM model
mod1 <- survfit(Surv(Survival,Outcome==1)~treatment,data=df)
#plot it
plot(mod1,
conf.int=TRUE, #show confidence intervals
col=c("red","green") #treatment=0 is red
)
#see ?plot.survfit for more parameters
mod2 <- survfit(Surv(Survival,Outcome==2)~treatment,data=df)
#plot it
plot(mod2,
conf.int=TRUE, #show confidence intervals
col=c("red","green") #treatment=0 is red
)
However, as stated above, KM plots are naive to competing risks (the probabilities do not take the other event into consideration and so at certain time points, you can have that the probabilities of the two events sum to more than 100%).You would likely be better plotting the Cumulative Incidence Function (or CIF), which does take into account the competing risks. The cmprsk package has the cuminc() function for this.
library(cmprsk)
cif <- cuminc(df$Survival,df$Outcome,df$treatment)
plot(cif)
# See ?plot.cuminc for more parameters
However, I find these base plots to be quite unappealing, and so would highly recommend using the survminer package which utilises ggplot2 to create better plots:
library(survminer)
ggsurvplot(mod1)
ggsurvplot(mod2)
ggcompetingrisks(cif)
If you wish to analyse the data and find the effects of a treatment, using the IPTW, you can use the crr() function as in your question and this will return the subdistribution proportional hazard regression coefficients (i.e. the results of a Fine & Gray model). These can be interpreted in much the same way that of a Cox proportional hazard analysis can be (whilst accounting for the competing risks). Bear in mind that the IPTW is not a panacea and therefore these results may still be non-causal.
Once you have the Fine & Gray model results from crr(), you can create a plot of the subdistributions across the two treatments by using the following code (assuming you have made the competingrisk object above)
cov2 <- matrix(c(0,1),ncol=1)
colnames(cov2) <- "treatment"
cr_pred <- predict(competingrisk,cov2)
plot(cr_pred)
# see ?plot.predict.crr for parameters
A few good resources are:
This blogpost by Emily Zabor
Tutorial in Biostatistics by Putter et al.

lm() breaks when using poly() with predictors set up as factors

I'm trying to model the relationship between a categorical predictor variable and a continuous outcome variable. I use lm() to this end. Since it's a categorical variable, the proper practice is to convert it to a factor variable type. However, when using poly() for the predictor's regression term and when setting up the predictor variable as a factor it causes lm() to break. On the other hand, if I run lm() without using poly() (but do keep the predictor as factor) or keep poly() but not convert the predictor to a factor (let it be numeric) -- then lm() doesn't break. I don't understand why it breaks and I don't understand if I can trust the results when it doesn't break.
Data
Data about 50 basketball players. One column (PosCode) is about player's position in the game, and the other (Height) is player's height.
data <-
structure(list(Player = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50), PosCode = c(3, 3, 4, 1, 4, 1, 3,
1, 2, 2, 4, 1, 5, 5, 2, 1, 2, 5, 4, 4, 5, 4, 4, 4, 2, 3, 2, 3,
1, 1, 2, 4, 1, 2, 3, 1, 5, 4, 3, 4, 4, 1, 1, 4, 5, 1, 1, 1, 5,
2), Height = c(176.1, 179.1, 183.1, 169.7, 177.3, 179, 176.4,
174.9, 180.2, 176.5, 178.6, 167.9, 183.4, 166.2, 189.5, 171.9,
188.5, 172.6, 167.7, 172.6, 186.9, 163.8, 179.3, 165.4, 182.2,
166.1, 176.8, 171.9, 173.8, 163, 172.5, 184.9, 170.4, 170.6,
166.8, 172.6, 184.3, 163.3, 182.4, 165.8, 173.4, 182.1, 172.9,
184.9, 173.2, 185.8, 161.4, 186, 178.4, 170.7)), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))
> data
## # A tibble: 50 x 3
## Player PosCode Height
## <dbl> <dbl> <dbl>
## 1 1 3 176.
## 2 2 3 179.
## 3 3 4 183.
## 4 4 1 170.
## 5 5 4 177.
## 6 6 1 179
## 7 7 3 176.
## 8 8 1 175.
## 9 9 2 180.
## 10 10 2 176.
## # ... with 40 more rows
Modeling the data
I want to know whether I can predict players height from their position in the game. Since position is categorical (there are 5 possible positions), this variable should be of a factor type, with 5 levels.
library(tidyverse)
library(magrittr)
data %<>% mutate_at(vars(PosCode), ~ as.factor(.)) ## convert PosCode from dbl to fct
Modeling by using lm() without poly()
lm(Height ~ PosCode, data = data)
## Call:
## lm(formula = Height ~ PosCode, data = data)
##
## Coefficients:
## (Intercept) PosCode2 PosCode3 PosCode4 PosCode5
## 173.6714 4.9397 0.4429 0.1824 4.1857
Modeling by using lm() with poly()
lm(Height ~ poly(PosCode ,1), data = data)
## Error in qr.default(X) : NA/NaN/Inf in foreign function call (arg 1)
## In addition: Warning messages:
## 1: In mean.default(x) : argument is not numeric or logical: returning NA
## 2: In Ops.factor(x, xbar) : ‘-’ not meaningful for factors
If the predictor isn't a factor, there's no problem regardless of poly()
## convert PosCode from fct back to dbl
data %<>% mutate_at(vars(PosCode), ~ as.double(.))
## lm() without poly()
lm(Height ~ PosCode, data = data)
Call:
lm(formula = Height ~ PosCode, data = data)
## Coefficients:
## (Intercept) PosCode
## 174.3848 0.3112
## lm() with poly()
lm(Height ~ poly(PosCode ,1), data = data)
## Call:
## lm(formula = Height ~ poly(PosCode, 1), data = data)
## Coefficients:
## (Intercept) poly(PosCode, 1)
## 175.256 3.173
But clearly, treating PosCode as dbl rather than fct changes the model in a way that is wrong.
Bottom line
I don't understand why including poly() in lm() breaks it when the predictor is set up as a factor variable.
From help("poly"):
x a numeric vector at which to evaluate the polynomial.
Thus, you cannot use factors inside poly(). This is expected, because categorical variables (i.e., factors) have to be recoded, for example, into dummy variables. And it does neither make sense to have, say, a quadratic effect for the categorical variable as a whole nor for the coded (dummy) variables. (It does not make sense from a substantive perspective, but squaring a dummy variable that has only 0s and 1s does also not make much sense from a perspective blind to statistics.)
You can see that lm() recodes your factor because you get four coefficents (one less than the number of categories) for the variable PosCode in your first model.
In the end, poly() is not of much use unless you set its argument degree to a value > 1

Error in fitdist with gamma distribution

Below are my codes:
library(fitdistrplus)
s <- c(11, 4, 2, 9, 3, 1, 2, 2, 3, 2, 2, 5, 8,3, 15, 3, 9, 22, 0, 4, 10, 1, 9, 10, 11,
2, 8, 2, 6, 0, 15, 0 , 2, 11, 0, 6, 3, 5, 0, 7, 6, 0, 7, 1, 0, 6, 4, 1, 3, 5,
2, 6, 0, 10, 6, 4, 1, 17, 0, 1, 0, 6, 6, 1, 5, 4, 8, 0, 1, 1, 5, 15, 14, 8, 1,
3, 2, 9, 4, 4, 1, 2, 18, 0, 0, 10, 5, 0, 5, 0, 1, 2, 0, 5, 1, 1, 2, 3, 7)
o <- fitdist(s, "gamma", method = "mle")
summary(o)
plot(o)
and the error says:
Error in fitdist(s, "gamma", method = "mle") : the function mle
failed to estimate the parameters,
with the error code 100
The Gamma distribution doesn't allow zero values (the likelihood will evaluate to zero, and the log-likelihood will be infinite, for a response of 0) unless the shape parameter is exactly 1.0 (i.e., an exponential distribution - see below) ... that's a statistical/mathematical problem, not a programming problem. You're going to have to find something sensible to do about the zero values. Depending on what makes sense for your application, you could (for example)
choose a different distribution to test (e.g. pick a censoring point and fit a censored Gamma, or fit a zero-inflated Gamma distribution, or ...)
exclude the zero values (fitdist(s[s>0], ...))
set the zero values to some sensible non-zero value (fitdist(replace(s,which(s==0),0.1),...)
which (if any) of these is best depends on your application.
#Sandipan Dey's first answer (leaving the zeros in the data set) appears to make sense, but in fact it gets stuck at the shape parameter equal to 1.
o <- fitdist(s, "exp", method = "mle")
gives the same answer as #Sandipan's code (except that it estimates rate=0.2161572, the inverse of the scale parameter=4.626262 that's estimated for the Gamma distribution - this is just a change in parameterization). If you choose to fit an exponential instead of a Gamma, that's fine - but you should do it on purpose, not by accident ...
To illustrate that the zeros-included fit may not be working as expected, I'll construct my own negative log-likelihood function and display the likelihood surface for each case.
mfun <- function(sh,sc,dd=s) {
-sum(dgamma(dd,shape=sh,scale=sc,log=TRUE))
}
library(emdbook) ## for curve3d() helper function
Zeros-included surface:
cc1 <- curve3d(mfun(x,y),
## set up "shape" limits" so we evaluate
## exactly shape=1.000 ...
xlim=c(0.55,3.55),
n=c(41,41),
ylim=c(2,5),
sys3d="none")
png("gammazero1.png")
with(cc1,image(x,y,z))
dev.off()
In this case the surface is only defined at shape=1 (i.e. an exponential distribution); the white regions represent infinite log-likelihoods. It's not that shape=1 is the best fit, it's that it's the only fit ...
Zeros-excluded surface:
cc2 <- curve3d(mfun(x,y,dd=s[s>0]),
## set up "shape" limits" so we evaluate
## exactly shape=1.000 ...
xlim=c(0.55,3.55),
n=c(41,41),
ylim=c(2,5),
sys3d="none")
png("gammazero2.png")
with(cc2,image(x,y,z))
with(cc2,contour(x,y,z,add=TRUE))
abline(v=1.0,lwd=2,lty=2)
dev.off()
Just provide the initial values for the gamma distribution parameters (scale, shape) to be computed with mle using optim and also the lower bounds for the parameters, it should work.
o <- fitdist(s, "gamma", lower=c(0,0), start=list(scale=1,shape=1))
summary(o)
#Fitting of the distribution ' gamma ' by maximum likelihood
#Parameters :
# estimate Std. Error
#scale 4.626262 NA
#shape 1.000000 NA
#Loglikelihood: -250.6432 AIC: 505.2864 BIC: 510.4766
As per the comments by #Ben Bolker, we may want to exclude the zero points first:
o <- fitdist(s[s!=0], "gamma", method = "mle", lower=c(0,0), start=list(scale=1,shape=1))
summary(o)
#Fitting of the distribution ' gamma ' by maximum likelihood
#Parameters :
# estimate Std. Error
#scale 3.401208 NA
#shape 1.622378 NA
#Loglikelihood: -219.6761 AIC: 443.3523 BIC: 448.19

How to get the order of the model used in auto.arima?

I am trying to use auto.arima on a timeseries. Now I need to know the order of the arima that has been selected. The return value is of type ARIMA, which doesn't hold the order anywhere. (or am I missing the values). Given in code snippet and the output attributes. (This is same as in R Documentation)
double[] list1 = {0, 0, 2, 1, 2, 10, 21, 0, 0, 3, 6, 5, 11, 51, 0, 11, 8, 6, 24, 25, 104, 0, 0, 6, 4, 5, 25, 71};
rconnection.assign("myData1", list1);
rconnection.eval("timeSeries1 <- ts(myData1,start=1,frequency="+staticBookingStage+")");
REXP fc = rconnection.eval("fitModel1 <- auto.arima(timeSeries1)");
System.out.println( fc.asList().names);
Output
[coef, sigma2, var.coef, mask, loglik, aic, arma, residuals, call, series, code, n.cond, nobs, model, bic, aicc, x, fitted]
Use the arimaorder() function:
library(forecast)
fit <- auto.arima(WWWusage)
arimaorder(fit)

Resources