Passing predictors to a cox model from a list - r

I'm running a pretty simple cox model from the survminer package.
surv_object <- Surv(time, event)
model <- coxph(surv_object ~ female + age + ethnicity + imd, data = df)
I need to run multiple Cox models, and for each model, my predictors change. I have all my predictors stored in a separate data frame such as this (we'll call it pred_df):
> pred_df
# A tibble: 4 x 2
predictor endpoint
<chr> <chr>
1 female Mortality
2 age Mortality
3 ethnicity Mortality
4 imd Mortality
Is there an easy way to pass the items from the predictor column to coxph()? Something like this:
coxph(surv_object ~ predictors, data = df)
What I've tried already:
I've tried a rather clumsy hack along these lines:
pred_vars <- pred_df %>%
pull(predictor) %>% # extract column values as a vector
paste(collapse = " + ") %>% # combine values in a string
parse(text = . ) # parse the string as an expression
model <- coxph(surv_object ~ eval(pred_vars), data = df)
R actually understands this and runs the model. But the output is uninterpretable. The model seems to run but does not output individual predictors i.e. female, age, ethnicity and imd. Instead it just outputs eval(pred_vars)
Call:
coxph(formula = Surv(time, event) ~ eval(pred_vars), data = df)
n= 62976, number of events= 12882
(3287 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
eval(pred_vars) 3.336e-05 1.000e+00 5.339e-06 6.249 4.14e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
eval(pred_vars) 1 1 1 1
Concordance= 0.515 (se = 0.003 )
Rsquare= 0.001 (max possible= 0.989 )
Likelihood ratio test= 38.28 on 1 df, p=6e-10
Wald test = 39.04 on 1 df, p=4e-10
Score (logrank) test = 39.07 on 1 df, p=4e-10
There must be a simpler way of doing this?

Try reformulate.
formula <- reformulate(
termlabels = pred_df[[1, "predictor"]],
response = pred_df[[1, "endpoint"]]
)
coxph(formula = formula, data = df)

You can do this in base R with as.formula and paste(..., collapse = " + "), like...
foo <- as.formula(paste0("Surv(time, event) ~ ", paste(pred_df$predictors, collapse = " + ")))
Result of that line:
> foo
Surv(time, event) ~ female + age + ethnicity + imd
And then you just pass foo to your call to coxph.

Related

Calculate by hand Fitted Values of an Interaction from a regression output

I am working with an interaction model similar to this one below:
set.seed(1993)
moderating <- sample(c("Yes", "No"),100, replace = T)
x <- sample(c("Yes", "No"), 100, replace = T)
y <- sample(1:100, 100, replace = T)
df <- data.frame(y, x, moderating)
Results <- lm(y ~ x*moderating)
summary(Results)
Call:
lm(formula = y ~ x * moderating)
Residuals:
Min 1Q Median 3Q Max
-57.857 -29.067 3.043 22.960 59.043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.4000 6.1639 8.501 2.44e-13 ***
xYes 8.4571 9.1227 0.927 0.356
moderatingYes -11.4435 8.9045 -1.285 0.202
xYes:moderatingYes -0.1233 12.4563 -0.010 0.992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.82 on 96 degrees of freedom
Multiple R-squared: 0.04685, Adjusted R-squared: 0.01707
F-statistic: 1.573 on 3 and 96 DF, p-value: 0.2009
I'm learning how to calculate the fitted value of a interaction from a regression table. In the example, the base category (or omitted category) is x= No and moderating = No.
Thus far, I know the following fitted values:
#Calulate Fitted Value From a Regression Interaction by hand
#Omitted Variable = X_no.M_no
X_no.M_no <- 52.4000
X_yes.M_no <- 52.4000 + 8.4571
X_no.M_yes <- 52.4000 + -11.4435
X_yes.M_yes #<- ?
I do not understand how the final category, X_yes.M_yes, is calculated. My initial thoughts were X_yes.M_yes <- 52.4000 + -0.1233, (the intercept plus the interaction term) but that is incorrect. I know its incorrect because, using the predict function, the fitted value of X_yes.M_yes = 49.29032, not 52.4000 + -0.1233 = 52.2767.
How do I calculate, by hand, the predicted value of the X_yes.M_yes category?
Here are the predicted values as generated from the predict function in R
#Validated Here Using the Predict Function:
newdat <- NULL
for(m in na.omit(unique(df$moderating))){
for(i in na.omit(unique(df$x))){
moderating <- m
x <- i
newdat<- rbind(newdat, data.frame(x, moderating))
}
}
Prediction.1 <- cbind(newdat, predict(Results, newdat, se.fit = TRUE))
Prediction.1
Your regression looks like this in math:
hat_y = a + b x + c m + d m x
Where x = 1 when "yes" and 0 when "no" and m is similarly defined by moderating.
Then X_yes.M_yes implies x = 1 and m = 1, so your prediction is a + b + c + d.
or in your notation X_yes.M_yes = 52.4000 + 8.4571 - 11.4435 - 0.1233

Logistic Regression on NBA shot data

I am using NBA shot data and am attempting to create shot prediction models using different regression techniques. However, I am running into the following warning message when trying to use a logistic regression model: Warning message:
glm.fit: algorithm did not converge. Also, it seems that the predictions do not work at all (not changed from the original Y variable (make or miss)). I will provide my code below. I got the data from here: Shot Data.
nba_shots <- read.csv("shot_logs.csv")
library(dplyr)
library(ggplot2)
library(data.table)
library("caTools")
library(glmnet)
library(caret)
nba_shots_clean <- data.frame("game_id" = nba_shots$GAME_ID, "location" =
nba_shots$LOCATION, "shot_number" = nba_shots$SHOT_NUMBER,
"closest_defender" = nba_shots$CLOSEST_DEFENDER,
"defender_distance" = nba_shots$CLOSE_DEF_DIST, "points" = nba_shots$PTS,
"player_name" = nba_shots$player_name, "dribbles" = nba_shots$DRIBBLES,
"shot_clock" = nba_shots$SHOT_CLOCK, "quarter" = nba_shots$PERIOD,
"touch_time" = nba_shots$TOUCH_TIME, "game_result" = nba_shots$W
, "FGM" = nba_shots$FGM)
mean(nba_shots_clean$shot_clock) # NA
# this gave NA return which means that there are NAs in this column that we
# need to clean up
# if the shot clock was NA I assume that this means it was the end of a
# quarter and the shot clock was off.
# For now I'm going to just set all of these NAs equal to zero, so all zeros
# mean it is the end of a quarter
# checking the amount of NAs
last_shots <- nba_shots_clean[is.na(nba_shots_clean$shot_clock),]
nrow(last_shots) # this tells me there is 5567 shots taken when the shot
# clock was turned off at the end of a quarter
# setting these NAs equal to zero
nba_shots_clean[is.na(nba_shots_clean)] <- 0
# checking to see if it worked
nrow(nba_shots_clean[is.na(nba_shots_clean$shot_clock),]) # it worked
# create a test and train set
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
points + dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
This gives me the output of the following, which tells me the prediction didn't do anything, as it's the same as before.
FALSE TRUE
0 21428 0
1 0 17977
I would really appreciate any guidance.
The confusion matrix of your model (model prediction vs. nbaTest$FGM) tells you that your model has a 100% accuracy !
This is due to the points variable in your dataset which is perfectly associated to the dependent variable:
table(nba_shots_clean$points, nba_shots_clean$FGM)
0 1
0 87278 0
2 0 58692
3 0 15133
Try to delete points from your model:
# create a test and train set
set.seed(1234)
split = sample.split(nba_shots_clean, SplitRatio=0.75)
nbaTrain = subset(nba_shots_clean, split==TRUE)
nbaTest = subset(nba_shots_clean, split==FALSE)
# logistic regression
nbaLogitModel <- glm(FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, data=nbaTrain,
family="binomial", na.action = na.omit)
summary(nbaLogitModel)
No warning messages now and the estimated model is:
Call:
glm(formula = FGM ~ location + shot_number + defender_distance +
dribbles + shot_clock + quarter + touch_time, family = "binomial",
data = nbaTrain, na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8995 -1.1072 -0.9743 1.2284 1.6799
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.427688 0.025446 -16.808 < 2e-16 ***
locationH 0.037920 0.012091 3.136 0.00171 **
shot_number 0.007972 0.001722 4.630 0.000003656291 ***
defender_distance -0.006990 0.002242 -3.117 0.00182 **
dribbles 0.010582 0.004859 2.178 0.02941 *
shot_clock 0.032759 0.001083 30.244 < 2e-16 ***
quarter -0.043100 0.007045 -6.118 0.000000000946 ***
touch_time -0.038006 0.005700 -6.668 0.000000000026 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 153850 on 111532 degrees of freedom
Residual deviance: 152529 on 111525 degrees of freedom
AIC: 152545
Number of Fisher Scoring iterations: 4
The confusion matrix is:
nbaPredict = predict(nbaLogitModel, newdata=nbaTest, type="response")
cm = table(nbaTest$FGM, nbaPredict > 0.5)
print(cm)
FALSE TRUE
0 21554 5335
1 16726 5955

Successively removing predictor variable from formula

I have a model formula in the form of
model.all <- lme(Response ~ A + B + C)
I would like to update this model by successively removing a predictor variable from the model, so I would end up with 3 models, specifically:
mod.1 <- lme(Response ~ B + C) ; mod.2 <- lme(Response ~ A + C) ; mod.3 <- lme(Response ~ A + B)
I am thinking of a loop function, so I am aware of the update function, but I have too many predictor variables to manually change the code.
Any suggestions would be appreciated.
I would use combn in this occasion, see the example below:
Example Data
Response <- runif(100)
A <- runif(100)
B <- runif(100)
C <- runif(100)
Solution
a <- c('A','B','C') #the names of your variables
b <- as.data.frame(combn(a,2)) #two-way combinations of those using combn
#create the formula for each model
my_forms <- sapply(b, function(x) paste('Response ~ ', paste(x,collapse=' + ')))
> my_forms #the formulas that will be used in the model
V1 V2 V3
"Response ~ A + B" "Response ~ A + C" "Response ~ B + C"
#run each model
my_models <- lapply(my_forms, function(x) lm(as.formula(x)))
Output
> summary(my_models[[1]])
Call:
lm(formula = as.formula(x))
Residuals:
Min 1Q Median 3Q Max
-0.48146 -0.20745 -0.00247 0.24263 0.58341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.32415 0.08232 3.938 0.000155 ***
A 0.25404 0.09890 2.569 0.011733 *
B 0.07955 0.10129 0.785 0.434141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2828 on 97 degrees of freedom
Multiple R-squared: 0.06507, Adjusted R-squared: 0.04579
F-statistic: 3.375 on 2 and 97 DF, p-value: 0.03827
As you can see each model is saved in as a list element in my_models. I find this quite easy to make and run.

How can I dynamically regress and predict multiple items with R?

I'm trying to write a function that regresses multiple items, then tries to predict data based on the model:
"tnt" <- function(train_dep, train_indep, test_dep, test_indep)
{
y <- train_dep
x <- train_indep
mod <- lm (y ~ x)
estimate <- predict(mod, data.frame(x=test_indep))
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}
If I pass the above this, it fails:
train_dep = vector1
train_indep <- cbind(vector2, vector3)
test_dep = vector4
test_indep <- cbind(vector5, vector6)
tnt(train_dep, train_indep, test_dep, test_indep)
Changing the above to something like the following works, but I want this done dynamically so I can pass it a matrix of any number of columns:
x1 = x[,1]
x2 = x[,2]
mod <- lm(y ~ x1+x2)
estimate <- predict(mod, data.frame(x1=test_indep[,1], x2=test_indep[,2]))
Looks like this could help, but I'm still confused on the rest of the process: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/70843.html
Try this instead:
tnt <- function(train_dep, train_indep, test_dep, test_indep)
{ dat<- as.data.frame(cbind(y=train_dep, train_indep))
mod <- lm (y ~ . , data=dat )
newdat <- as.data.frame(test_indep)
names(newdat) <- names(dat)[2:length(dat)]
estimate <- predict(mod, newdata=newdat )
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}
Call:
lm(formula = y ~ ., data = dat)
Residuals:
1 2 3
0 0 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NA NA
V2 1 0 Inf <2e-16 ***
V3 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 1 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: Inf on 1 and 1 DF, p-value: < 2.2e-16
[1] "RMSE: 0"
Warning message:
In predict.lm(mod, newdata = newdat) :
prediction from a rank-deficient fit may be misleading
>
The warning is because of the exact fit you are offering
Modified using the as.formula suggestion in the comments. Roman's comment above about passing all as one data.frame and using the . notation in formulas is probably the best solution, but I implemented it in paste because you should know how to use paste and as.formula :-).
tnt <- function(train_dep, train_indep, test_dep, test_indep) {
form <- as.formula(paste("train_dep ~", paste( "train_indep$",colnames(train_indep) ,sep="",collapse=" + " ), sep=" "))
mod <- lm(form)
estimate <- predict(mod, data.frame(x=test_indep))
rmse <- sqrt(sum((test_dep-estimate)^2)/length(test_dep))
print(summary(mod))
print(paste("RMSE: ", rmse))
}

R not removing terms when doing MAM

I want to do a MAM, but I'm having difficulty in removing some terms:
full.model<-glm(SSB_sq~Veg_height+Bare+Common+Birds_Foot+Average_March+Average_April+
Average_May+Average_June15+Average_June20+Average_June25+Average_July15
+Average_July20+Average_July25,family="poisson")
summary(full.model)
I believe I have to remove these terms to start the MAM like so:
model1<-update(full.model,~.-Veg_height:Bare:Common:Birds_Foot:Average_March
:Average_April:Average_May:Average_June15:Average_June20:Average_June25:
Average_July15:Average_July20:Average_July25,family="poisson")
summary(model1)
anova(model1,full.model,test="Chi")
But I get this output:
anova(model1,full.model,test="Chi")
Analysis of Deviance Table
Model 1: SSB_sq ~ Veg_height + Bare + Common + Birds_Foot + Average_March +
Average_April + Average_May + Average_June15 + Average_June20 +
Average_June25 + Average_July15 + Average_July20 + Average_July25
Model 2: SSB_sq ~ Veg_height + Bare + Common + Birds_Foot + Average_March +
Average_April + Average_May + Average_June15 + Average_June20 +
Average_June25 + Average_July15 + Average_July20 + Average_July25
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 213 237.87
2 213 237.87 0 0
I've tried putting plus signs in model1 instead of colons, as I was clutching at straws when reading my notes but the same thing happens.
Why are both my models the same? I've tried searching on Google but I'm not very good at the terminology so my searches aren't bringing up much.
If I read your intention correctly, are you trying to fit a null model with no terms in it? If so, a simpler way is just to use the SSB_sq ~ 1 as the formula, meaning a model with only an intercept.
fit <- lm(sr ~ ., data = LifeCycleSavings)
fit0 <- lm(sr ~ 1, data = LifeCycleSavings)
## or via an update:
fit01 <- update(fit, . ~ 1)
Which gives, for example:
> anova(fit)
Analysis of Variance Table
Response: sr
Df Sum Sq Mean Sq F value Pr(>F)
pop15 1 204.12 204.118 14.1157 0.0004922 ***
pop75 1 53.34 53.343 3.6889 0.0611255 .
dpi 1 12.40 12.401 0.8576 0.3593551
ddpi 1 63.05 63.054 4.3605 0.0424711 *
Residuals 45 650.71 14.460
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(fit, fit0)
Analysis of Variance Table
Model 1: sr ~ pop15 + pop75 + dpi + ddpi
Model 2: sr ~ 1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 650.71
2 49 983.63 -4 -332.92 5.7557 0.0007904 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
An explanation of the formulae I use:
The first model used the shortcut ., which means all remaining variables in argument data (in my model it meant all variables in LifeCycleSavings on the RHS of the formula, except for sr which is already on the LHS).
In the second model (fit0), we only include 1 on the RHS of the formula. In R, 1 means an intercept, so sr ~ 1 means fit an intercept-only model. By default, an intercept is assumed, hence we did not need 1 when specifying the first model fit.
If you want to suppress an intercept, add - 1 or + 0 to your formula.
For your data, the first model would be:
full.model <- glm(SSB_sq ~ ., data = FOO, family = "poisson")
where FOO is the data frame holding your variables - you are using a data frame, aren't you? The null, intercept-only model would be specified using one of:
null.model <- glm(SSB_sq ~ 1, data = FOO, family = "poisson")
or
null.model <- update(full.model, . ~ 1)

Resources