Survival Analysis for Telecom Churn using R - r

I am working on Telecom Churn problem and here is my dataset.
http://www.sgi.com/tech/mlc/db/churn.data
Names - http://www.sgi.com/tech/mlc/db/churn.names
I'm new to survival analysis.Given the training data,my idea to build a survival model to estimate the survival time along with predicting churn/non churn on test data based on the independent factors.Could anyone help me with the code or pointers on how to go about this problem.
To be precise,say my train data has got
customer call usage details,plan details,tenure of his account etc and whether did he churn or not.
Using general classification models,I can predict churn or not on test data.Now using Survival analysis,I want to predict the tenure of the survival in test data.
Thanks,
Maddy

If you're still interested (or for the benefit of those coming later), I've written a few guides specifically for conducting survival analysis on customer churn data using R. They cover a bunch of different analytical techniques, all with sample data and R code.
Basic survival analysis: http://daynebatten.com/2015/02/customer-churn-survival-analysis/
Basic cox regression: http://daynebatten.com/2015/02/customer-churn-cox-regression/
Time-dependent covariates in cox regression: http://daynebatten.com/2015/12/survival-analysis-customer-churn-time-varying-covariates/
Time-dependent coefficients in cox regression: http://daynebatten.com/2016/01/customer-churn-time-dependent-coefficients/
Restricted mean survival time (quantify the impact of churn in dollar terms): http://daynebatten.com/2015/03/customer-churn-restricted-mean-survival-time/
Pseudo-observations (quantify dollar gain/loss associated with the churn effects of variables): http://daynebatten.com/2015/03/customer-churn-pseudo-observations/
Please forgive the goofy images.

Here is some code to get you started:
First, read the data
nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names",
skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]]
dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn"))
Use Surv() to set up a survival object for modeling
library(survival)
s <- with(dat, Surv(account.length, as.numeric(Churn)))
Fit a cox proportional hazards model and plot the result
model <- coxph(s ~ total.day.charge + number.customer.service.calls, data=dat[, -4])
summary(model)
plot(survfit(model))
Add a stratum:
model <- coxph(s ~ total.day.charge + strata(number.customer.service.calls <= 3), data=dat[, -4])
summary(model)
plot(survfit(model), col=c("blue", "red"))

Related

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

ARIMAX exogenous variables reverse causality

I try to fit an ARIMAX model to figure out whether the containment measures (using the Government response stringency index, numbers from 0 to 100) are having a significant effect on the daily new cases rate. I also want to add test rates.
I programmed everything in R (every ts is stationary,...) and did the Granger causality test. Result: Pr(>F)is greater than 0.05. Therefore the null hypothesis of NO Granger causality can be rejected and the new cases rate and the containment measures have reverse causality.
Is there any possibility to transform the variable "stringency index" and continue with an ARIMAX model? If so, how to do this in R?
In R you have "forecast" package to build ARIMA models. Recall, that there is a difference between true ARIMAX models and linear regressions with ARIMA errors. Check this post by Rob Hyndman (forecast package author) for more detailed information:
The ARIMAX model muddle
Here are Rob Hyndman's examples to fit a linear regression with ARIMA errors - check more information here:
library(forecast)
library(fpp2) # To get a data set to work on
# Fit a linear regression with AR errors
fit <- Arima(uschange[,"Consumption"], xreg = uschange[,"Income"], order = c(1,0,0))
# Forecast and plot predictions
fcast <- forecast(fit, xreg=rep(mean(uschange[,2]),8))
autoplot(fcast) + xlab("Year") +
ylab("Percentage change")
# Use auto.arima function to find the optimal parameters
fit <- auto.arima(uschange[,"Consumption"], xreg = uschange[,"Income"])
# Plot predictions
fcast <- forecast(fit, xreg=rep(mean(uschange[,2]),8))
autoplot(fcast) + xlab("Year") +
ylab("Percentage change")
Regarding your question about how to solve the reverse causality matter, it is clear that you have endogeneity bias. The response stringency index affects the daily new cases rate and viceversa. If it is a prediction problem and not an estimation one, I wouldn't care too much on that as long as I get good predictions. For an estimation/causation matter, I will try to get different exogenus variables or try to use instrumental/control variables.

the questions about predict function in randomForestSRC package

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).
Here is my code.
install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
train <- sample(1:nrow(data), round(nrow(data) * 0.70))
data.grow <- rfsrc(Surv(days, status) ~ .,
data[train, ],
ntree = 100,
tree.err=T,
importance=T,
nsplit=1,
proximity=T)
data.pred <- predict(data.grow,
data[-train , ],
importance=T,
tree.err=T)
What I have a question is that predict function in this code.
Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.
For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.
However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).
"predict" function must work with outcome information (event/censored).
Therefore, I cannot understand what the "predict" function means.
If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?
In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.
Please let me know what the "predict" function in this package means is.
Thank you for reading my long question.
The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.
The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did.
Your example rfsrc statement does not work because it refers to columns that are not in the example data set.
I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.
# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
tree.err=TRUE, importance = TRUE)
# Simulate new data
new_data <- mtcars[31:32,]
# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
Sample size of test (predict) data: 2
Number of grow trees: 1000
Average no. of grow terminal nodes: 4.898
Total no. of grow variables: 9
Analysis: RSF
Family: surv-CR
Test set error rate: NA
predicted$predicted
event.1 event.2 event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809 2.15895

Running diagnostics on a multivariate multiple regression in r

I have a data set that gives the rates of incidence of some phenomena in all the zip codes of a state, and some demographic data. The rates are given for each year in the data set (year 1 - year 6). A snippet of the data is available here.
I've run a multivariate linear regression to examine the impact of the demographic variables on the rates, per Fox & Weisberg (2011), weighted by the average zip code population across all years (var = POPmean):
Y <- cbind(data$rateY1, data$rateY2, data$rateY3, data$rateY4, data$rateY5, data$rateY6)
model <- lm(Y ~ someVAR1+someVAR2+someVAR3+someVAR4+someVAR5, data=data, weights= POPmean)
summary(model)
coef(model)
summary(manova(model))
I'd like to plot the regression diagnostics for this model for each year, but have no idea how to do so. I'd like to use influencePlot() from the car package, but when I try to do so:
influencePlot(model, id.method="noteworthy", main="Robustness Check")
I receive an error stating that the lengths of x,y differ (which, of course, they do). Can anyone help figure out how to plot the regression diagnostics for the model given above? Or suggest an alternative method?

R codes for Poisson fixed and random effect for claims data

I have claim frequency panel data for five years consisting of age,cc,make of car,age and gender of insured person and geographical area.I have run a Poisson regression to model the claim frequency in R. Now i want to test for a Poisson fixed and random effect in R. I have done some research work and found the related codes:fm <- glmer(Y ~ A + B + (1|Subject), family = poisson, data = pData) in lme4 package.How do i apply it in my context? Thanks in advance

Resources