Survival analysis time dependent input data - r

I am fairly new to survival analysis and I apologize if this is a trivial question, but I wasn't able to find any solution to my problem.
I'm trying to find a good model for predicting whether and when a contract for a specific product (identified by ID column) will be bought, therefore a time to event prediction. I am interested mostly in a probability, that the event will occur in 3 months. However, my data is pretty much a monthly time series. Sample of the dataset would look somewhat like this:
ID
Time
Number of assistance calls
Number of product malfunctions
Time to fix
Contract bought
1
2012-01
0
0
NA
0
1
2012-02
3
1
37.124
0
1
2012-03
2
0
NA
0
1
2012-04
0
0
NA
1
2
2012-03
1
0
NA
0
2
2012-04
0
0
NA
0
Here's what I struggle with. I could use a survival analysis model, e.g. Cox proportional hazards model, which is able to deal with time dependent variables, but in that case it wouldn't be able to predict (1). I could also summarize the data for each ID, but that would mean losing some information contained in the data, e.g. malfunction could occur 1, 2 or 3 months before the event.
Is there a better way to approach this?
Thank you very much for any tips!
Sources:
[1] https://www.annualreviews.org/doi/10.1146/annurev.publhealth.20.1.145

Related

How to calculate similarity of numbers (in list)

I am looking for a method for calculating similarity score for list of numbers. Ideally the method should give result in fixed range. For example from 0 to 1 where 0 is not similar at all and 1 means all numbers are identical.
For clarity let me provide a few examples:
0 1 2 3 4 5 6 7 8 9 10 => the similarity should be 0 or close to zero as all numbers are different
1 1 1 1 1 1 1 => 1
10 9 11 10.5 => close to 1
1 1 1 1 1 1 1 1 1 1 100 => score should be still pretty high as only the last value is different
I have tried to calculate the similarity based on normalization and average, but that gives me really bad results when there is one 'bad number'.
Thank you.
Similarity tests are always incredibly subjective, and the right one to use depends heavily on what you're trying to use it for. We already have three typical measures of central tendency (mean, median, mode). It's hard to say what test will work for you because there are different ways of measuring that will do what you're asking, but have wildly different measures for other lists (like [1]*7 + [100] * 7). Here's one solution:
import statistics as stats
def tester(ell):
mode_measure = 1 - len(set(ell))/len(ell)
avg_measure = 1 - stats.stdev(ell)/stats.mean(ell)
return max(avg_measure, mode_measure)

How to do survival analysis in R with time-varying exposure to an intervention, using Surv and coxph?

I have survival data in this format, with a time-varying exposure to Intervention:
ID start stop status Intervention
1 2 14 0 0
2 2 5 0 0
3 2 3 0 0
3 3 10 1 1
4 5 8 0 0
5 6 10 0 0
For example, for patient ID #3: from day 2 to day 3, the patient has not yet received the intervention (Intervention = 0), but starting on day 3 and lasting until day 10 (when the patient dies), the patient has received the intervention (Intervention = 1).
I thought that I could then estimate the time-varying effect of exposure in the following manner:
coxph (Surv (start, stop, status) ~ Intervention + cluster (ID), data = df.td)
However, I recently found that this method is not correct for right-censored data (Two different results from coxph in R, using same stop and start times, why?). Most basic guides to time-dependent survival analysis use a line like this (for example, as in https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html).
Is this method correct for estimating the effect of Intervention on outcome, given the structure of the data?

Creating time-varying covariates when time is different for each observation

I'm attempting to conduct survival analysis with time-varying covariates. The data comes from a longitudinal survey that is administered biannually, and currently looks something like this:
id event1yr event2yr income 14 income16 income18 income20
1 2014 2020 8 10 13 8
2 2018 NA 13 15 24 35
In the case of my study, I am trying to begin time (t_0) at event1yr, and measure time from that variable, which obviously is different for each observation. So, for instance, time to event for observation 1 is 6 years, whereas the time to event for observation 2 is right-censored and 2 years. The main issue comes with also trying to pull data from different time points since the beginning time is different. For instance, income for years 0-2 (exclusive) for observation 1 would come from income14, but income for year 0-2 for observation would come from income18. In the end, I'd like my data to look something like this:
id st.time end.time event2 censor inc
1 0 2 0 0 8
1 2 4 0 0 10
1 4 6 1 0 13
2 0 2 0 1 24
Thus, I'm trying to think of the best way to code to account for pulling the data from different points in time since the beginning reference time is not constant across observations.

How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R?

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.
I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.

Probability of account win/loss using Bayesian Statistics

I am trying to estimate the probability of winning or losing an account, and I'd like to do this using Bayesian Methods. I'm not really that familiar with these methods, but I think I understand the general idea.
I know some information about losses and wins. Wins are usually characterized by some combination of activities; losses are usually characters by a different combination of activities. I'd like to be able to get some posterior probability of whether or not a new observation will be won or lost based on the current number of activities that are associated with that account.
Here is an example of my data: (This is just a sample for simplicity)
Email Call Callback Outcome
14 9 2 1
3 2 4 0
16 14 2 0
15 1 3 1
5 2 2 0
1 1 0 0
10 3 5 0
2 0 1 0
17 8 4 1
3 15 2 0
17 1 3 0
10 7 5 0
10 2 3 0
8 0 0 1
14 10 3 0
1 9 3 1
5 10 3 1
13 5 1 0
9 4 4 0
So from here I know that 30% of the observations have an outcome of 1 (win) and 70% have an outcome of 0 (loss). Let's say that I want to use the other columns to get a probability of win/loss for a new observation which may have a small number of events (emails, calls, and callbacks) associated with it.
Now let's say that I want to use the counts/proportions of the different events as priors for a new observation. This is where I start getting tripped up. My thinking is to create a dirichlet distribution for wins and losses, so two separate distributions, one for wins and one for losses. Using the counts/proportions of events for each outcome as the priors. I guess I'm not sure how to do this in R. I think my course of action would be estimate a dirichlet distribution (since I have 3 variables) for each outcome using maximum likelihood. I've been trying to use the dirichlet.simul and dirichlet.mle functions from the sirt package in R. I'm not sure if I need to simulate one first?
Another issue is once I have this distribution, it's unclear to me how to get a posterior distribution of a new observation. I've read several papers and can't seem to find a straightforward process on how to do this. (Or maybe there's some holes in my understanding). Any pushes in the right direction would be greatly appreciated.
This is the code I've tried so far:
### FOR WON ACCOUNTS
set.seed(789)
N <- 6
probs <- c(0.535714286, 0.330357143, 0.133928571 )
alpha <- probs
alpha <- matrix( alpha , nrow=N , ncol=length(alpha) , byrow=TRUE )
x <- dirichlet.simul( alpha )
dirichlet.mle(x)
$alpha
[1] 0.3385607 0.2617939 0.1972898
$alpha0
[1] 0.7976444
$xsi
[1] 0.4244507 0.3282088 0.2473405
### FOR LOST ACCOUNTS
set.seed(789)
N2 <- 14
probs2 <- c(0.528037383,0.308411215,0.163551402 )
alpha2 <- probs2
alpha2 <- matrix( alpha2 , nrow=N , ncol=length(alpha2) , byrow=TRUE )
x2 <- dirichlet.simul( alpha2 )
dirichlet.mle(x2)
$alpha
[1] 0.3388486 0.2488771 0.2358043
$alpha0
[1] 0.8235301
$xsi
[1] 0.4114587 0.3022077 0.2863336
Not sure if this is a correct approach or how to get posteriors from here. I realize all the outputs look similar across won/lost accounts. I just used some simulated data to represent what I'm working with.

Resources