life expectancy survival package R - r

I would like to calculate the life-years lost due to a disease in a way that I correct for other variables in the model (corrected group prognosis method). My dataset is a cohort of individuals for which I have follow-up time till death/censored and a variable whether they died, together with covariates as age, sex and prevalence of disease. I searched the web and I got the impression this should be possible with the survival package in R.
I used the following code which returns probabilities:
fit1 <- coxph(Surv(fup_death, death) ~ age + sex + prev_disease, data)
direct <- survexp( ~prev_disease, data=data, ratetable=fit1)
I also tried the survfit function, but than my computer crashes:
t<-survfit(fit1, newdata = data)
How can I derive the life-expectancy in the ones with the disease and without the disease? Or should I do it differently?
Thanks you in advance!
Best,
Symen

The calculation for years of life lost is the difference in mean survival. You can get survfit objects for two separate but comparable conditions like this:
fit1 <- coxph(Surv(fup_death, death) ~ age + sex + prev_disease, data)
survfit_WithDisease <- survfit(fit1,
newdata=data.frame(age=50,
sex='m',
prev_disease=TRUE))
survfit_NoDisease <- survfit(fit1,
newdata=data.frame(age=50,
sex='m',
prev_disease=FALSE))
and by setting print.rmean=TRUE you can get estimates of mean survival for each condition.
print(survfit_WithDisease,print.rmean=TRUE)
print(survfit_NoDisease,print.rmean=TRUE)
Note that mean isn't defined for every survival curve. There are several options for calculating mean survival when the survival curve does not go all the way to zero, which you should read about in ?print.survfit.

Related

What code would you use in R to learn if there is a significant difference between two predictor variables in the way they affect a response variable?

I am trying to learn if Extraversion affects how people follow COVID-19 guidelines when stratified by male/female gender.
The data is set up so Extraversion is the predictor variable, guideline compliance is the respnse variable, and the data is stratified by gender.
I tried to create two models for each gender and then put them through an anova (this is what ChatGPT recommended). But, R didn't like that saying "models were not all fitted to the same size of dataset".
I think the issue is that there are way more females than males in the sample. Also, I would like to use another method to find a significant difference between the two model's results because each model follows the Poisson distribution and ANOVA only likes normal distributions.
dataMale <- data %>%
dplyr::filter(Q34 == "Male")
dataFemale <- data %>%
dplyr::filter(Q34 == "Female")
poisson_male <- glm(CDC ~ Extraversion, data=dataMale, family=poisson())
poisson_female <- glm(CDC ~ Extraversion, data=dataFemale, family=poisson())
anova(poisson_male, poisson_female, test="Chisq")

How to estimate a regression with both variables i and t simultaneously

I want to estimate a regression for a variable, LWAGE (log wage), against EXP (years of work experience). The data that I have has participants tracked across 7 years, so each year their number of years of work experience increases by 1.
When I do the regression for
πΏπ‘Šπ΄πΊπΈπ‘– = 𝛽0 + 𝛽1𝐸𝐷𝑖 + 𝑒𝑖
I used
reg1 <- lm(LWAGE~EXP, data=df)
Now I'm trying to do the following regression:
πΏπ‘Šπ΄πΊπΈπ‘–π‘‘ = 𝛽0 + 𝛽1𝐸𝑋𝑃𝑖𝑑 + 𝑒i.
But I'm not sure how to include my the time based component into my regression. I searched around but couldn't find anything relevant.
Are you attempting to include time-fixed effects in your model or an interaction between your variable EXP and time (calling this TIME for this demonstration)?
For time fixed effects using lm() you can just include time as a variable in your model. Time should be a factor.
reg2 <- lm(LWAGE~EXP + TIME, data = df)
As an interaction between EXP and TIME it would be
reg3 <- lm(LWAGE~EXP*TIME, data = df)
Based on your description it sounds like you might be looking for the interaction. I.e. How does the effect of experience on log of wages vary by time?
You can also take a look at the plm package for working with panel data.
https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html

What weights mean in WeightIt package

I want to balance my data using the WeightIt package in R (method= ebal). I have used a code similar to the one below;
#Balancing covariates between treatment groups (binary)
W1 <- weightit(treat ~ age + educ + married + nodegree + re74, data = lalonde, method = "ebal", estimand = "ATT")
match.data(W1)
The outcome is my data table with an additional column called weights. What do those weights mean and how do I go on from here? (My next step would be to do a logit regression with a balanced dataset)
Thank you so much for helping!
weightit() estimates weights that, when applied to a dataset, yield balance in the treatment groups. To estimate effects in the weighted sample, include the weights in a regression of the outcome on the treatment. This is demonstrated in the WeightIt vignette.
You should not used match.data() with WeightIt. I'm not sure where you found the code to do that. match.data() is for use with MatchIt, which is a different package with its own functions. The fact that match.data() happened to work with WeightIt is unintended behavior and should not be relied on.
To estimate the effect of the treatment on a binary outcome (which I'll denote as Y in the code below and assume is in the lalonde dataset, even though in reality it is not), you would run the following after running the first line in your code above:
fit <- glm(Y ~ treat, data = lalonde, weights = W1$weights, family = binomial)
lmtest::coeftest(fit, vcov. = sandwich::vcovHC)
The coefficient on treat is the log odds ratio of the outcome.

Application of a multi-way cluster-robust function in R

Hello (first timer here),
I would like to estimate a "two-way" cluster-robust variance-covariance matrix in R. I am using a particular canned routine from the "multiwayvcov" library. My question relates solely to the set-up of the cluster.vcov function in R. I have panel data of various crime outcomes. My cross-sectional unit is the "precinct" (over 40 precincts) and I observe crime in those precincts over several "months" (i.e., 24 months). I am evaluating an intervention that 'turns on' (dummy coded) for only a few months throughout the year.
I include "precinct" and "month" fixed effects (i.e., a full set of precinct and month dummies enter the model). I have only one independent variable I am assessing. I want to cluster on "both" dimensions but I am unsure how to set it up.
Do I estimate all the fixed effects with lm first? Or, do I simply run a model regressing crime on the independent variable (excluding fixed effects), then use cluster.vcov i.e., ~ precinct + month_year.
This seems like it would provide the wrong standard error though. Right? I hope this was clear. Sorry for any confusion. See my set up below.
library(multiwayvcov)
model <- lm(crime ~ as.factor(precinct) + as.factor(month_year) + policy, data = DATASET_full)
boot_both <- cluster.vcov(model, ~ precinct + month_year)
coeftest(model, boot_both)
### What the documentation offers as an example
### https://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf
library(lmtest)
data(petersen)
m1 <- lm(y ~ x, data = petersen)
### Double cluster by firm and year using a formula
vcov_both_formula <- cluster.vcov(m1, ~ firmid + year)
coeftest(m1, vcov_both_formula)
Is is appropriate to first estimate a model that ignores the fixed effects?
First the answer: you should first estimate your lm -model using fixed effects. This will give you your asymptotically correct parameter estimates. The std errors are incorrect because they are calculated from a vcov matrix which assumes iid errors.
To replace the iid covariance matrix with a cluster robust vcov matrix, you can use cluster.vcov, i.e. my_new_vcov_matrix <- cluster.vcov(~ precinct + month_year).
Then a recommendation: I warmly recommend the function felm from lfe for both multi-way fe's and cluster-robust standard erros.
The syntax is as follows:
library(multiwayvcov)
library(lfe)
data(petersen)
my_fe_model <- felm(y~x | firmid + year | 0 | firmid + year, data=petersen )
summary(my_fe_model)

Running diagnostics on a multivariate multiple regression in r

I have a data set that gives the rates of incidence of some phenomena in all the zip codes of a state, and some demographic data. The rates are given for each year in the data set (year 1 - year 6). A snippet of the data is available here.
I've run a multivariate linear regression to examine the impact of the demographic variables on the rates, per Fox & Weisberg (2011), weighted by the average zip code population across all years (var = POPmean):
Y <- cbind(data$rateY1, data$rateY2, data$rateY3, data$rateY4, data$rateY5, data$rateY6)
model <- lm(Y ~ someVAR1+someVAR2+someVAR3+someVAR4+someVAR5, data=data, weights= POPmean)
summary(model)
coef(model)
summary(manova(model))
I'd like to plot the regression diagnostics for this model for each year, but have no idea how to do so. I'd like to use influencePlot() from the car package, but when I try to do so:
influencePlot(model, id.method="noteworthy", main="Robustness Check")
I receive an error stating that the lengths of x,y differ (which, of course, they do). Can anyone help figure out how to plot the regression diagnostics for the model given above? Or suggest an alternative method?

Resources