both x and y variables are censored in a regression - r

Is there any existing routine in R to do Tobit regression model with both censored x and y varialbes? I know survreg function in survival pacakge can deal with censored response variables. What about the left censored x predictor variable?

There is a framework for both for tobit regression and for "interval"-censored variables in the survival package. This is Therneau's example using Tobin's original data:
tfit <- survreg(Surv(durable, durable>0, type='left') ~age + quant,
data=tobin, dist='gaussian')
predict(tfit,type="response")
And the Surv function will accept interval censoring.

Related

R squared of predicted values

I have a basic issue considering predict function and its R squared value.
data<- datasets::faithful
# Linear model
model<- lm (eruptions~., data=data)
summary (model)[[9]] # R squared ajusted
# Creating test data to predict eruption values using the previous model
test.data<- data.frame(waiting= rnorm (80, mean = 70, sd = 14 ))
pred<- predict (model, test.data) # Predict eruptions based and the previou model
# New dataset with predicted values
newdata<- data.frame (eruptions = model.pred, waiting = test.data)
# Linear model to predicted values
new.model<- lm (eruptions~., data = newdata)
summary (new.model)[[9]] ## R-squared from predicted values
The R-squared of data set with predicted values is 1. It seems obvious that if the predicted values are based on the same variables used in the predict function, the fit measured by R-squared is perfect (=1). But, my interest would be measure how good is my model to test other datasets, for example test.data in the code. Am I using predict function correctly?
Thanks in advance
Pass your new variables as the "newdata" argument of the predict() function.
Type ?predict in an R console to get its R Documentation:
newdata
An optional data frame in which to look for variables with
which to predict. If omitted, the fitted values are used.

Incorporating random intercepts in R package rms for mixed effects logistic regression

Frank Harrell's R package rms is an amazing tool for implementing multiple logistic regression. However, I wish to know how/ if it is possible to incorporate random effects into a model run through rms. I know that rms can run through nlme, but only the generalized least squares function (Gls) and not the lme function, which allows for the incorporation of random effects. Mixed effects models can be problematic for analysis/ interpretation but are occasionally necessary in order to account for nested effects in models.
I'm not sure if it's helpful in this case but I have copied some code from the rms help files that runs a simple logistic regression model and added a line demonstrating a mixed effects logistic regression model run through glmmPQL of the MASS package.
n <- 1000 # define sample size
require(rms)
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
label(age) <- 'Age' # label is in Hmisc
label(cholesterol) <- 'Total Cholesterol'
label(blood.pressure) <- 'Systolic Blood Pressure'
label(sex) <- 'Sex'
units(cholesterol) <- 'mg/dl' # uses units.default in Hmisc
units(blood.pressure) <- 'mmHg'
ch <- cut2(cholesterol, g=40, levels.mean=TRUE) # use mean values in intervals
table(ch)
f <- lrm(ch ~ age)
require(MASS)
f1<-glmmPQL(ch~age, random=~1|sex, family=binomial)
summary(f1)
I would be interested in any insight as to whether random effects can be incorporated in rms both for logistic regression (lrm) or run through nlme for linear regression.
Thanks to all

Comparison of R and scikit-learn for a classification task with logistic regression

I am doing a Logistic Regression described in the book 'An Introduction to Statistical Learning with Applications in R' by James, Witten, Hastie, Tibshirani (2013).
More specifically, I am fitting the binary classification model to the 'Wage' dataset from the R package 'ISLR' described in §7.8.1.
Predictor 'age' (transformed to polynomial, degree 4) is fitted against the binary classification wage>250. Then the age is plotted against the predicted probabilities of the 'True' value.
The model in R is fit as follows:
fit=glm(I(wage>250)~poly(age,4),data=Wage, family=binomial)
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit,newdata=list(age=age.grid),se=T)
pfit=exp(preds$fit)/(1+exp(preds$fit))
Complete code (author's site): http://www-bcf.usc.edu/~gareth/ISL/Chapter%207%20Lab.txt
The corresponding plot from the book: http://www-bcf.usc.edu/~gareth/ISL/Chapter7/7.1.pdf (right)
I tried to fit a model to the same data in scikit-learn:
poly = PolynomialFeatures(4)
X = poly.fit_transform(df.age.reshape(-1,1))
y = (df.wage > 250).map({False:0, True:1}).as_matrix()
clf = LogisticRegression()
clf.fit(X,y)
X_test = poly.fit_transform(np.arange(df.age.min(), df.age.max()).reshape(-1,1))
prob = clf.predict_proba(X_test)
I then plotted probabilities of the 'True' values against the age range. But the result/plot looks quite different. (Not talking about the CI bands or rugplot, just the probability plot.) Am I missing something here?
After some more reading I understand that scikit-learn implements a regularized logistic regression model, whereas glm in R is not regularized. Statsmodels' GLM implementation (python) is unregularized and gives identical results as in R.
http://statsmodels.sourceforge.net/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html#statsmodels.genmod.generalized_linear_model.GLM
The R package LiblineaR is similar to scikit-learn's logistic regression (when using 'liblinear' solver).
https://cran.r-project.org/web/packages/LiblineaR/

Two stage least squares regression with biomial response variable

Hi I'd like to run two stage least squares regression with binomial response variable.
For continuous response variable, I use "tsls" option from R package "sem".
Here are my commands and want to know I'm doing this right.
(x: endogenous variable, z: instrumental variable, y: response variable (0 or 1))
xhat<-lm(x~z)$fitted.values
R<-glm(y~xhat, family=binomial)
R$residuals<-c(y - R$coef[1]+x*R$coef[2])
Thank you
This isn't quite right. For the glm() function in R, the family=binomial option defaults to a logistic regression. So, your residual calculation does not transform the residuals as you want.
You can use the residuals.glm() function to automatically generate the residuals for a generalized linear model in general. You can use the residuals() function for a linear model. Similarly, you can use the predict() function, instead of the lm(x~z)$fitted.values to get the xhats. However, since in this case you have xhat and x (not the same variable), use code close to your solution above (but using the plogis() function for the logit transformation):
stageOne <- lm(x ~ z)
xhat <- predict(stageOne)
stageTwo <- glm(y ~ xhat, family=binomial)
residuals <- y - plogis(stageTwo$coef[1] + stageTwo$coef[2]*x)
A nice feature of the predict() and residuals() or residuals.glm() functions is that they can be extended to other datasets.

Calculate the Survival prediction using Cox Proportional Hazard model in R

I'm trying to calculate the Survival prediction using Cox Proportional Hazard model in R.
library(survival)
data(lung)
model<-coxph(Surv(time,status ==2)~age + sex + ph.karno + wt.loss, data=lung)
predict(model, data=lung, type ="expected")
When I use the above code, I get the Cumulative hazard's prediction corresponding to the formula
h^i(t)=h^0(t)exp(x′iβ^)
But my concern is all about predicting the Survival corresponding to the formula,
S^i(t)=S^0(t)exp(x′iβ^)
How do I predict the Survival in R?
Thanks in Advance.
You can use either predict or survfit. With predict you need to give the newdata argument a list with values for all the variables in the model:
predict(model,
newdata=list(time=100,status=1,age=60,sex=1, ph.karno=60,wt.loss=15),
type ="expected")
[1] 0.2007497
There's a plot method for survfit objects:
?survreg
png(); plot(survfit(model)); dev.off()

Resources