Validating a model and introducing a new predictor in glm - r

I am hitting my head against the computer...
I have a prediction model in R that goes like this
m.final.glm <- glm(binary_outcome ~ rcs(PredictorA, parms=kn.a) + rcs(PredictorB, parms=kn.b) + PredictorC , family = "binomial", data = train_data)
I want to validate this model on test_data2 - first by updating the linear predictor (lp)
train_data$lp <- predict(m.final.glm, train_data)
test_data2$lp <- predict(m.final.glm, test_data2)
lp2 <- predict(m.final.glm, test_data2)
m.update2.lp <- glm(binary_outcome ~ 1, family="binomial", offset=lp2, data=test_data2)
m.update2.lp$coefficients[1]
m.final.update2.lp <- m.final.glm
m.final.update2.lp$coefficients[1] <- m.final.update2.lp$coefficients[1] + m.update2.lp$coefficients[1]
m.final.update2.lp$coefficients[1]
p2.update.lp <- predict(m.final.update2.lp, test_data2, type="response")
This gets me to the point where I have updated the linear predictor, i.e. in the summary of the model only the intercept is different, but the coefficients of each predictor are the same.
Next, I want to include a new predictor (it is categorical, if that matters), PredictorD, into the updated model. This means that the model has to have the updated linear predictor and the same coefficients for Predictors A, B and C but the model also has to contain Predictor D and estimate its significance.
How do I do this? I will be very grateful if you could help me with this. Thanks!!!

Related

Including an offset when using cph {rms} for validation of a Cox model

I am externally validating and updating a Cox model in R. The model predicts 5 year risk. I don't have access to the original data, just the equation for the linear predictor and the value of the baseline survival probability at 5 years.
I have assessed calibration and discrimination of the model in my dataset and found that the model needs to be updated.
I want to update the model by adjusting baseline risk only, so I have been using a Cox model with the linear predictor ("beta.sum") included as an offset term, to restrict its coefficient to be 1.
I want to be able to use cph instead of coxph as it makes internal validation by bootstrapping much easier. However, when including the linear predictor as an offset I get the error:
"Error in exp(object$linear.predictors) :
non-numeric argument to mathematical function"
Is there something I am doing incorrectly, or does the cph function not allow an offset within the formula? If so, is there another way to restrict the coefficient to 1?
My code is below:
load(file="k.Rdata")
### Predicted risk ###
# linear predictor (LP)
k$beta.sum <- -0.2201 * ((k$age/10)-7.036) + 0.2467 * (k$male - 0.5642) - 0.5567 * ((k$epi/5)-7.222) +
0.4510 * (log(k$acr_mgmmol/0.113)-5.137)
k$pred <- 1 - 0.9365^exp(k$beta.sum)
# Recalibrated model
# Using coxph:
cox.new <- coxph(Surv(time, rrt) ~ offset(beta.sum), data = k, x=TRUE, y=TRUE)
# new baseline survival at 5 years
library(pec)
predictSurvProb(cox.new, newdata=data.frame(beta.sum=0), times = 5) #baseline = 0.9570
# Using cph
cph.new <- cph(Surv(time, rrt) ~ offset(beta.sum), data=k, x=TRUE, y=TRUE, surv=TRUE)
The model will run without surv=TRUE included, but this means a lot of the commands I want to use cannot work, such as calibrate, validate and predictSurvProb.
EDIT:
I will include a way to reproduce the error
library(purr)
library(rms)
n <- 1000
set.seed(1234)
status <- as.numeric(rbernoulli(n, p=0.1))
time <- -5* log(runif(n))
lp <- rnorm(1000, mean=-2.7, sd=1)
mydata <- data.frame(status, time, lp)
test <- cph(Surv(time, status) ~ offset(lp), data=mydata, surv=TRUE)

Estimating multiple OLS with AR residuals

I am new to modeling in R, so I'm stumbling a bit...
I have a model in Eviews, which I have to translate to R and make further upgrades.
The model is multiple OLS with AR(1) of residuals.
I implemented it like this
model1 <- lm(y ~ x1 + x2 + x3, data)
data$e <- dplyr:: lag(residuals(model1), 1)
model2 <- lm(y ~ x1 + x2 + x3 + e, data)
My issue is the same as it is in this thread and I expected it: while parameter estimations are similar, they are different enought that I cannot use it.
I am planing of using ARIMA from stats package, but the problem is implementation. How to make AR(1) on residuals, and make other variables as they are?
Provided I understood you correctly, you can supply external regressors to your arima model through the xreg argument.
You don't provide sample data so I don't have anything to play with, but your model should translate to something like
model <- arima(data$y, xreg = as.matrix(data[, c("x1", "x2", "x3")]), order = c(1, 0, 0))
Explanation: The first argument data$y contains your time series data. xreg contains your external regressors as a matrix, with every column containing as many observations for that regressor as you have time points. order = c(1, 0, 0) defines an AR(1) model.

Confusion matrix for multinomial logistic regression & ordered logit

I would like to create confusion matrices for a multinomial logistic regression as well as a proportional odds model but I am stuck with the implementation in R. My attempt below does not seem to give the desired output.
This is my code so far:
CH <- read.table("http://data.princeton.edu/wws509/datasets/copen.dat", header=TRUE)
CH$housing <- factor(CH$housing)
CH$influence <- factor(CH$influence)
CH$satisfaction <- factor(CH$satisfaction)
CH$contact <- factor(CH$contact)
CH$satisfaction <- factor(CH$satisfaction,levels=c("low","medium","high"))
CH$housing <- factor(CH$housing,levels=c("tower","apartments","atrium","terraced"))
CH$influence <- factor(CH$influence,levels=c("low","medium","high"))
CH$contact <- relevel(CH$contact,ref=2)
model <- multinom(satisfaction ~ housing + influence + contact, weights=n, data=CH)
summary(model)
preds <- predict(model)
table(preds,CH$satisfaction)
omodel <- polr(satisfaction ~ housing + influence + contact, weights=n, data=CH, Hess=TRUE)
preds2 <- predict(omodel)
table(preds2,CH$satisfaction)
I would really appreciate some advice on how to correctly produce confusion matrices for my 2 models!
You can refer -
Predict() - Maybe I'm not understanding it
Here in predict() you need to pass unseen data for prediction.

GLMNET prediction with intercept

I have two questions about prediction using GLMNET - specifically about the intercept.
I made a small example of train data creation, GLMNET estimation and prediction on the train data (which I will later change to Test data):
# Train data creation
Train <- data.frame('x1'=runif(10), 'x2'=runif(10))
Train$y <- Train$x1-Train$x2+runif(10)
# From Train data frame to x and y matrix
y <- Train$y
x <- as.matrix(Train[,c('x1','x2')])
# Glmnet model
Model_El <- glmnet(x,y)
Cv_El <- cv.glmnet(x,y)
# Prediction
Test_Matrix <- model.matrix(~.-y,data=Train)[,-1]
Test_Matrix_Df <- data.frame(Test_Matrix)
Pred_El <- predict(Model_El,newx=Test_Matrix,s=Cv_El$lambda.min,type='response')
I want to have an intercept in the estimated formula. This code gives an error concerning the dimensions of the Test_Matrix matrix unless I remove the (Intercept) column of the matrix - as in
Test_Matrix <- model.matrix(~.-y,data=Train)[,-1]
My questions are:
Is it the right way to do this in order to get the prediction - when I want the prediction formula to include the intercept?
If it is the right way: Why do I have to remove the intercept in the matrix?
Thanks in advance.
The matrix x you were feeding into the glmnet function doesn't contain an intercept column. Therefore, you should respect this format when constructing your test matrix: i.e. just do model.matrix(y ~ . - 1, data = Train).
By default, an intercept is fit in glmnet (see the intercept parameter in the glmnet function). Therefore, when you called glmnet(x, y), you are technically doing glmnet(x, y, intercept = T). Thus, even though your x matrix didn't have an intercept, one was fit for you.
If you want to predict a model with intercept, you have to fit a model with intercept. Your code used model matrix x <- as.matrix(Train[,c('x1','x2')]) which is intercept-free, therefore if you provide an intercept when using predict, you get an error.
You can do the following:
x <- model.matrix(y ~ ., Train) ## model matrix with intercept
Model_El <- glmnet(x,y)
Cv_El <- cv.glmnet(x,y)
Test_Matrix <- model.matrix(y ~ ., Train) ## prediction matrix with intercept
Pred_El <- predict(Model_El, newx = Test_Matrix, s = Cv_El$lambda.min, type='response')
Note, you don't have to do
model.matrix(~ . -y)
model.matrix will ignore the LHS of the formula, so it is legitimate to use
model.matrix(y ~ .)

Replacing intercept with dummy variables in ARIMAX models in R

I am attempting to fit an ARIMAX model to daily consumption data in R. When I perform an OLS regression with lm() I am able to include a dummy variable for each unit and remove the constant term (intercept) to avoid less then full rank matrices.
lm1 <- lm(y ~ -1 + x1 + x2 + x3, data = dat)
I have not found a way to do this with arima() which forces me to use the constant term and exclude one of the dummy variables.
with(dat, arima(y, xreg = cbind(x1, x2))
Is there a specific reason why arima() doesn't allow this and is there a way to bypass?
See the documentation for the argument include.mean in ?arima, it seems you want the following: arima(y, xreg = cbind(x1, x2), include.mean=FALSE).
Be also aware of the definition of the model fitted by ARIMA as pointed by #RichardHardy.

Resources