Running diagnostics on a multivariate multiple regression in r - r

I have a data set that gives the rates of incidence of some phenomena in all the zip codes of a state, and some demographic data. The rates are given for each year in the data set (year 1 - year 6). A snippet of the data is available here.
I've run a multivariate linear regression to examine the impact of the demographic variables on the rates, per Fox & Weisberg (2011), weighted by the average zip code population across all years (var = POPmean):
Y <- cbind(data$rateY1, data$rateY2, data$rateY3, data$rateY4, data$rateY5, data$rateY6)
model <- lm(Y ~ someVAR1+someVAR2+someVAR3+someVAR4+someVAR5, data=data, weights= POPmean)
summary(model)
coef(model)
summary(manova(model))
I'd like to plot the regression diagnostics for this model for each year, but have no idea how to do so. I'd like to use influencePlot() from the car package, but when I try to do so:
influencePlot(model, id.method="noteworthy", main="Robustness Check")
I receive an error stating that the lengths of x,y differ (which, of course, they do). Can anyone help figure out how to plot the regression diagnostics for the model given above? Or suggest an alternative method?

Related

How to find the best fitting regression equations

I have a very large dataset consisting of car insurance policyholders (C) and those who died in a car accident (D). The dataset includes different rate types (what type of insurance was in place). I want to do a logistic regression as a function of age. Is there a way to find an optimal regression equation?
for example now i have something like this in R
glm( cbind(D, C-D)~d_regr+1, data=data, family=binomial)
where d_regr is something like age, (age^2), (age^3)/3 and so on.
is there a nice way to find an optimal function, only depending on the variable age - for example with maximizing the pseudo R^2 or so?

Gamma distribution in a GLMM

I am trying to create a GLMM in R. I want to find out how the emergence time of bats depends on different factors. Here I take the time difference between the departure of the respective bat and the sunset of the day as dependent variable (metric). As fixed factors I would like to include different weather data (metric) as well as the reproductive state (categorical) of the bats. Additionally, there is the transponder number (individual identification code) as a random factor to exclude inter-individual differences between the bats.
I first worked in R with a linear mixed model (package lme4), but the QQ plot of the residuals deviates very strongly from the normal distribution. Also a histogram of the data rather indicates a gamma distribution. As a result, I implemented a GLMM with a gamma distribution. Here is an example with one weather parameter:
model <- glmer(formula = difference_in_min ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gamma(link = log))
However, since there was no change in the QQ plot this way, I looked at the residual diagnostics of the DHARMa package. But the distribution assumption still doesn't seem to be correct, because the data in the QQ plot deviates very much here, too.
Residual diagnostics from DHARMa
But if the data also do not correspond to a gamma distribution, what alternative is there? Or maybe the problem lies somewhere else entirely.
Does anyone have an idea where the error might lie?
But if the data also do not correspond to a gamma distribution, what alternative is there?
One alternative is called the lognormal distribution (https://en.wikipedia.org/wiki/Log-normal_distribution)
Gaussian (or normal) distributions are typically used for data that are normally distributed around zero, which sounds like you do not have. But the lognormal distribution does not have the same requirements. Following your previous code, you would fit it like this:
model <- glmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gaussian(link = identity))
or instead of glmer you can just call lmer directly where you don't need to specify the distribution (which it may tell you to do in a warning message anyway:
model <- lmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl)

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

life expectancy survival package R

I would like to calculate the life-years lost due to a disease in a way that I correct for other variables in the model (corrected group prognosis method). My dataset is a cohort of individuals for which I have follow-up time till death/censored and a variable whether they died, together with covariates as age, sex and prevalence of disease. I searched the web and I got the impression this should be possible with the survival package in R.
I used the following code which returns probabilities:
fit1 <- coxph(Surv(fup_death, death) ~ age + sex + prev_disease, data)
direct <- survexp( ~prev_disease, data=data, ratetable=fit1)
I also tried the survfit function, but than my computer crashes:
t<-survfit(fit1, newdata = data)
How can I derive the life-expectancy in the ones with the disease and without the disease? Or should I do it differently?
Thanks you in advance!
Best,
Symen
The calculation for years of life lost is the difference in mean survival. You can get survfit objects for two separate but comparable conditions like this:
fit1 <- coxph(Surv(fup_death, death) ~ age + sex + prev_disease, data)
survfit_WithDisease <- survfit(fit1,
newdata=data.frame(age=50,
sex='m',
prev_disease=TRUE))
survfit_NoDisease <- survfit(fit1,
newdata=data.frame(age=50,
sex='m',
prev_disease=FALSE))
and by setting print.rmean=TRUE you can get estimates of mean survival for each condition.
print(survfit_WithDisease,print.rmean=TRUE)
print(survfit_NoDisease,print.rmean=TRUE)
Note that mean isn't defined for every survival curve. There are several options for calculating mean survival when the survival curve does not go all the way to zero, which you should read about in ?print.survfit.

Survival Analysis for Telecom Churn using R

I am working on Telecom Churn problem and here is my dataset.
http://www.sgi.com/tech/mlc/db/churn.data
Names - http://www.sgi.com/tech/mlc/db/churn.names
I'm new to survival analysis.Given the training data,my idea to build a survival model to estimate the survival time along with predicting churn/non churn on test data based on the independent factors.Could anyone help me with the code or pointers on how to go about this problem.
To be precise,say my train data has got
customer call usage details,plan details,tenure of his account etc and whether did he churn or not.
Using general classification models,I can predict churn or not on test data.Now using Survival analysis,I want to predict the tenure of the survival in test data.
Thanks,
Maddy
If you're still interested (or for the benefit of those coming later), I've written a few guides specifically for conducting survival analysis on customer churn data using R. They cover a bunch of different analytical techniques, all with sample data and R code.
Basic survival analysis: http://daynebatten.com/2015/02/customer-churn-survival-analysis/
Basic cox regression: http://daynebatten.com/2015/02/customer-churn-cox-regression/
Time-dependent covariates in cox regression: http://daynebatten.com/2015/12/survival-analysis-customer-churn-time-varying-covariates/
Time-dependent coefficients in cox regression: http://daynebatten.com/2016/01/customer-churn-time-dependent-coefficients/
Restricted mean survival time (quantify the impact of churn in dollar terms): http://daynebatten.com/2015/03/customer-churn-restricted-mean-survival-time/
Pseudo-observations (quantify dollar gain/loss associated with the churn effects of variables): http://daynebatten.com/2015/03/customer-churn-pseudo-observations/
Please forgive the goofy images.
Here is some code to get you started:
First, read the data
nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names",
skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]]
dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn"))
Use Surv() to set up a survival object for modeling
library(survival)
s <- with(dat, Surv(account.length, as.numeric(Churn)))
Fit a cox proportional hazards model and plot the result
model <- coxph(s ~ total.day.charge + number.customer.service.calls, data=dat[, -4])
summary(model)
plot(survfit(model))
Add a stratum:
model <- coxph(s ~ total.day.charge + strata(number.customer.service.calls <= 3), data=dat[, -4])
summary(model)
plot(survfit(model), col=c("blue", "red"))

Resources