use cox model to estimate survival - r

I first establish a cox model in R:
test1<- test[1:20,]
model.1 <- coxph(Surv(test1$days,test1$status==1) ~ test1$MTT+test1$ADC,data=test1)
and when i tried to predict next patient's survival like this:
covs1 <- data.frame(test[21,]$MTT,test[21,]$ADC)
summary(survfit(model.1, newdata= covs1, type ="aalen"))
it gave me too many survival results and the warning is
"'newdata' had 1 row but variables found have 20 rows "
fyi, there are 20 events and the results contain 20 survival results.

The names of the columns in the datframe being given as the basis for a prediction must have the same column names as are in the RHS of the model formula. I don't think yours will qualifiy unless you do something like this:
test1<- test[1:20,]
model.1 <- coxph( Surv(days, status==1) ~ MTT + ADC, data=test1)
covs1 <- test[21, c("MTT", "ADC")]
# then do your prediction
You should not use $ to supply arguments to Surv. It is important that the model be constructed in the environment of the dataframe.

Related

What is the correct way to use weights in a logistic regression in R?

My data includes survey data of car buyers. My data has a weight column that i used in SPSS to get sample sizes. Weight column is affected by demographic factors & vehicle sales. Now i am trying to put together a logistic regression model for a car segment which includes a few vehicles. I want to use the weight column in the logistic regression model & i tried to do so using "weights" in glm function. But the results are horrific. Deviances are too high, McFadden Rsquare too low. My dependent variable is binary, independent variables are on 1 to 5 scale. Weight column is numerical, ranging from 32 to 197. Could that be a reason that results are poor? Do i need to have values in weight column below 1?
Format of input file to R is -
WGT output I1 I2 I3 I4 I5
67 1 1 3 1 5 4
I1, I2, I3 being independent variables
logr<-glm(output~1,data=data1,weights=WGT,family="binomial")
logrstep<-step(logr,direction = "both",scope = formula(data1))\
logr1<-glm(output~ (formula from final iteration),weights = WGT,data=data1,family="binomial")
hl <- hoslem.test(data1$output,fitted(logr1),g=10)
I want a logistic regression model with better accuracy & gain a better understanding of using weights with logistic regression
I would check out the survey package. This will allow you to specify weights for the survey design using the svydesign function. Additionally, you can use the svyglm function to perform your weighted logistic regression. See http://r-survey.r-forge.r-project.org/survey/
Something like the following assuming your data is in a dataframe called df
my_svy <- svydesign(df, ids = ~1, weights = ~WGT)
Then you can do the following:
my_fit <- svyglm(output ~1, my_svy, family = "binomial")
For a full reprex check out the below example
library(survey)
# Generate Some Random Weights
mtcars$wts <- rnorm(nrow(mtcars), 50, 5)
# Make vs a factor just for illustrative purposes
mtcars$vs <- as.factor(mtcars$vs)
# Build the Complete survey Object
svy_df <- svydesign(data = mtcars, ids = ~1, weights = ~wts)
# Fit the logistic regression
fit <- svyglm(vs ~ gear + disp, svy_df, family = "binomial")
# Store the summary object
(fit_sumz <- summary(fit))
# Look at the AIC if desired
AIC(fit)
# Pull out the deviance if desired
fit_sumz$deviance
As far as the stepwise regression, this typically isn't a great methodology for a statistical point of view. It results in a higher R2 and some other issues regarding inference (see https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/).

Pooling sandwich variance estimator over multiply imputed datasets

I am running a poisson regression on multiply imputed data to predict a common binary outcome. After running mice, I have obtained a stacked data frame comprising the raw data and five imputed datasets. Here is a toy example:
df <- mice::nhanes
imp <- mice(df) #impute data
com <- complete(imp, "long", TRUE) #creates data frame
I now want to:
Run the regression on each imputed dataset
Calculate robust standard errors using a sandwich variance estimator
Combine / pool the results of both analyses
I can run the regression on the mids object using the with and pool commands:
fit.pois.mids <- with(imp, glm(hyp ~ age + bmi + chl, family = poisson))
summary(pool(fit.pois.mids))
I can also run the regression on each of the imputed datasets before combining them:
imp.df <- split(com, com$.imp); names(imp.df) <- c("raw", "imp1", "imp2", "imp3", "imp4", "imp5") #creates list of data frames representing each imputed dataset
fit.pois <- lapply(imp.df, function(x) {
fit <- glm(hyp ~ age + bmi + chl, data = x, family = poisson)
fit
})
summary(MIcombine(fit.pois))
Similarly, I can calculate the standard errors for each imputed dataset:
sand <- lapply(fit.pois, function(x) {
se <- coeftest(x, vcov = sandwich)
se
})
Unfortunately, MIcombine does not seem to return p-values. This post suggests using Zelig, but for that matter, I may as well just use mice. Further it does not appear to be possible to combine the estimates of the standard errors:
summary(MIcombine(sand.df))
Error in UseMethod("vcov") :
no applicable method for 'vcov' applied to an object of class "coeftest"
For the sake of simplicity, it seems that mice is a better option for pooling the results of the regression; however, I am wondering how I would go about updating (i.e., pooling and combining) the standard errors. What are some ways this could be addressed?

Why doesn't predict like the dimensions of my newdata?

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!
The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.
Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

'predict' gives different results than using manually the coefficients from 'summary'

Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.

omitting columns from data

The data set has 252 observations and 18 variables. I needed a test sample with every tenth observation and a training sample with the remaining data so I created two separate datasets:
id <- seq(1, nrow(fat), by=10)
test <- fat[id,]
train <-fat[id,]
I can do a linear regression using all predictors except brozek and density:
model2 <- lm(siri ~ .-brozek -density, train)
I need to do a principal component regression model
fatpca<-prcomp(fat[-id,]
but this still includes the variables brozek and density.
How do I exclude these variables to do a PCR model?
To remove a couple of variables you have a few choices:
trainsub <- subset(train,select=-c(brozek,density))
or
trainsub <- train[!colnames(train) %in% c("brozek","density"))
or
trainsub <- dplyr::select(train,-c(brozek,density))
You can also use a formula interface with prcomp, i.e.
prcomp(~ . -brozek - density, data=train)

Resources