Prediction function in R - r

I am working on a data set of employed and unemployed people. After estimation of parameters for (log)wages I am trying to predict values of (log)wages for unemployed people that would correspond with results that I have (in data set values for unemployed are N/A).
After using function predict I still get predictions only for employed people. Does anyone know how to get the full column of "predML"? Thank you in advance!!
ml_restricted <- selection(employed ~ schooling + age + agesq + married, logwage ~ schooling + age + agesq, data)
summary(ml_restricted)
# find predicted values
predML <- predict(ml_restricted)
data <- cbind(data, predML)
employed2 <- ifelse(data$employed == 1, "employed", "unemployed")
data <- cbind(data, employed2)

Related

Can I do Fine-Gray regression on a split survival dataset?

This is my first question here, so if I need to share more information please let me know.
I have done a Cox regression analysis in R in which I am interested in the effect of implant surface on reoperation over 36 months. Here's a reproducible example:
library(survival)
n <- 100
df <- data.frame(id=1:n,
time=sample(1:36, n, replace=TRUE),
event=sample(0:2, n, replace=TRUE),
implantsurface=sample(0:1, n, replace=TRUE),
covariate1=sample(0:1, n, replace=TRUE),
covariate2=sample(0:1, n, replace=TRUE))
df$time <- as.numeric(df$time)
I adjusted for a number of covariates, which showed that the proportional hazards assumption was violated for covariate1. I split my dataset into 0-4 mo and 4-36 mo as follows (simplified code), so that the PH assumption was no longer violated:
fit1 <- survSplit(Surv(time, event == 1) ~
implantsurface + covariate1 + covariate2,
data = df, cut=c(4),
episode= "tgroup")
fit2 <- coxph(Surv(tstart, time, event) ~
implantsurface + strata(tgroup):covariate1 + covariate2,
data = fit1)
Now I would also like to adjust for competing risks with Fine-Gray regression, but I am unable to do this for the split dataset. I have tried the following:
FG <- finegray(Surv(time = time, event = event.competing, type = "mstate") ~
implantsurface + strata(tgroup):covariate1 + covariate2,
data = fit1, etype = "event_of_interest")
FGfit <- coxph(Surv(fgstart, fgstop, fgstatus) ~
implantsurface + strata(tgroup):covariate1 + covariate2,
weights = fgwt, data = FG)
Error in strata(tgroup) : object 'tgroup' not found
Does anyone know how/if Fine-Gray can be applied to a split survival dataset? Many thanks in advance for thinking along!

Why do vif() results from the car package differ from those in lmridge R?

I'm not a frequent poster so I apologize if this format is not correct.
If you tell me how to show the data I will make that change.
In the meantime, here is the code.
The vif() generates very different scores at the k=0 level in lmridge. Why are they different?
library(car)
library(lmridge)
data <- Duncan
ds <- as.data.frame(scale(data[,2:4]))
m <- lm(prestige ~ income + education, data = ds)
vif(m)
m2 <- lmridge(prestige ~ income + education, data = ds,
scaling = "scaled",
K = seq(0,1,0.10))
vif.lmridge(m2)

How to plot survival relative to general population with age on the X-axis (left-truncated data)?

I am trying to compare the survival in my study cohort with the survival in the Dutch general population (matched for age and sex). I created a rate table of the Dutch population.
library(relsurv)
setwd("")
nldpop <- transrate.hmd("mltper_1x1.txt","fltper_1x1.txt")
Then, I wanted to create a plot of the survival of my cohort (observed) and the survival of the population (expected) with age on the X-axis. However, the 'survexp' function does not seem to support a (start,stop,event)-format. Only with the normal (futime, event)-format it works, see below, but then I have follow-up time on the X-axis. Does anyone know how to get the age on the X-axis instead of follow-up time?
# Observed and expected survival with time on X-axis
fit <- survfit(Surv(futime, event)~1)
efit <- survexp(futime ~ 1, rmap = list(year=(date_entry), age=(age_entry), sex=(sex)),
ratetable=nldpop)
plot(fit)
lines(efit)
You didn't provide your example data, so i used survival::mgus data for this. Your problem may be due to incorrectly specifying variable names in the rmap option. See plot here
library(relsurv)
nldpop <- transrate.hmd("mltper_1x1.txt", "fltper_1x1.txt")
mgus2 <- mgus %>% mutate(date_year = dxyr + 1900)
fit <- survfit(Surv(futime, death) ~ 1, data = mgus2)
efit <- survexp(Surv(futime, death) ~ 1, data = mgus2,
ratetable = nldpop, rmap = list(age = age*365.25, year = date_year, sex = sex))
plot(fit)
lines(efit)

R function with separating data and finding linear regression

I want to calculate the impact that height has on earnings given the gender. I divided my data into data for male and female but when I run the lm(earnings~height+education+age, data = data_female) function it gives me an error saying: Error in model.frame.default(formula = earnings ~ height + education + :
variable lengths differ (found for 'education')
Would you be able to help in either suggesting a better way to refine my model or helping to fix this particular error? Please let me know.
setwd("~/Google Drive/R Data")
data <- read.csv('data_ass5.csv')
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
multiple_regression <- lm(earnings~height+age+gender+education,data = data)
lm(earnings~height+age+gender+education,data = data)
summary(multiple_regression)
summary(linear_regression)
multiple_regression_redefined <- lm(earnings~age+gender+education,data = data)
# Now I wish to particularly assess the impact of gender on earnings
# therefore trying to refine my model doing the following:
# but the lm last line is causing an error. Would you be able to adivse on
# if this is the correct way to refine it and/or why I am getting the error.
# I even tried putting na.rm=TRUE after the lm code, but error still.
data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)
lm(earnings~height+education+age, data = data_female)
Per docs of lm, the data argument handles variables in formula in two ways that are NOT mutually exclusive:
data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
Specifically, all your vector assignments are redundant and overlap with column names in the data frame except for gender and education:
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
multiple_regression <- lm(earnings~height+age+gender+education,data = data)
When above is run, all referenced names except for gender and education derive from the dataframe. But gender and education is pulled from the global environment for the vectors you assigned above. Had you used sex and educ, values would be pulled from the data frame like all the others.
Relatedly, your subset calls use the gender vector and not sex column. Fortunately, they are the exact same that no errors or undesired results occurred.
data_female <- subset(data,gender==0)
data_male <- subset(data,gender==1)
Therefore, when you subsetted your data, lm is pulling all values from the subsetted data and one value, education, from global environment. But remember education is based on the full data frame so maintains a larger length than the columns of subsetted data frame.
Altogether, simply avoid assigning the redundant vectors and use columns for full and subsetted data frames.
height <- data$height
earnings <- data$earnings
gender <- data$sex
age <- data$age
education <- data$educ
# REPLACE gender WITH sex AND education WITH educ (RENAME COLS IF NEEDED)
multiple_regression <- lm(earnings ~ height + age + sex + educ, data = data)
# REPLACE gender WITH sex
data_female <- subset(data, sex==0)
data_male <- subset(data, sex==1)
# REPLACE education WITH educ
lm(earnings ~ height + educ + age, data = data_female)

Predictions using time dependent covariates in survival model

Simple question, how do you specify time dependent covariates in the data.frame supplied to newdata when looking to make predictions?
In other words, I fit a model with time dependent covariates:
cfit <- coxph(Surv(tstart, tstop, status) ~ treat + sex + age +
inherit + cluster(id), data=cgd)
Now I'd like to create a prediction for a patient, but using updated data from that patient. In other words, what is their survival probability, given that we observed changes in certain covariates within that certain time intervals?
I can predict survival for a new patient, as follows:
survfit(cfit, newdata=data.frame(treat = "placebo", age = 12, sex ="male", inherit = "X-linked"))$surv
But this does not allow me to update predictions as time passes from the start of observation for that patient, allowing for the incorporation of updated covariates.
This is detailed in the 4th paragraph of the details section of the help page ?survfit.coxph. Basically you need an id column that shows which rows belong to the same person, then for each row you need the beginning time, the ending time, and the values of the covariates during that time period. Each time period for the individual being predicted will have its own row in newdata (so the time periods should not overlap).
A reproducible example in R to my comment above:
## Install package that has a dataset with data applicable for this problem
install.packages("ipw")
library(ipw)
## Load data
head(haartdat,n=100)
## Fit model, with time varying covariate cd4.sqrt
model.2 <- coxph(Surv(tstart, fuptime, event) ~ sex + age + cd4.sqrt + cluster(patient), data = haartdat)
## Create dataframe of variables (one row)
covs <- data.frame(age = 25,sex = 1,cd4.sqrt = 24)
covs
## Get survival probabilities for these variables at baseline
summary(survfit(model.2, newdata = covs, type = "aalen"))
## Now create two 'newdata' sets of covariates, for time points up to 1900
## covs.2 the data is same at all 20 time points
covs.2 <- cbind(rep(1,20),rbind(covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs))
colnames(covs.2)[1] <- "patient"
covs.2 <- data.frame(covs.2)
covs.2$tstart <- seq(-100,1800,100)
covs.2$fuptime <- seq(0,1900,100)
covs.2$event <- rep(0,20)
## covs.3 has varying cd4.sqrt
covs.3 <- cbind(rep(2,20),rbind(covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs))
colnames(covs.3)[1] <- "patient"
covs.3 <- data.frame(covs.3)
covs.3$tstart <- seq(-100,1800,100)
covs.3$fuptime <- seq(0,1900,100)
covs.3$event <- rep(0,20)
covs.3$cd4.sqrt <- seq(20.25,25,0.25)
## Combine into one dataset
covs.4 <- rbind(covs.2,covs.3)
covs.4
## Create survival probabilities, with the id = argument
summary(survfit(model.2, newdata = covs.4, type = "aalen", id = patient))

Resources