Hello I am learning about survival analysis and I was curious if I could use the survival package on survival data of this form:
Here is some code to genereate data in this form
start_interval <- seq(0, 13)
end_interval <- seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)
From my use of the survival package so far it seems to have each individual be a survival time but I might be misreading the documentation of the Surv function. If survival will not work what other packages are out there for this type of data.
If there is not a package or function to easily to estimate the survival function I can easily calculate the survival times myself with the following equation.
Since the survival package need one observation per survival time we need to do some transformations. Using the simulated data.
Simulated Data:
library(survival)
start_interval <- seq(0, 13)
end_interval <- seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)
Transforming the data by duplicated by the number dead
duptimes <- df$dead_in_interval
rid <- rep(1:nrow(df), duptimes)
df.t <- df[rid,]
Using the Surv Function
test <- Surv(time = df.t$start_interval,
time2 = df.t$end_interval,
event = rep(1, nrow(df.t)), #Every Observation is a death
type = "interval")
Fitting the survival curve
summary(survfit(test ~ 1))
Comparing with by hand calculation from original data
df$living_at_start/max(df$living_at_start)
They match.
Questions
When using the survfit function why is number of risk 1001 at time 0 when there is only 1000 people in the data?
length(test)
Related
I have the below conceptual problem which I can't get my head around.
Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.
Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.
So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.
Any ideas?
example data:
time column indidates the time needed per case
truth column indicates my current decision I want to compare against
predicted column indicates the cleaning decision if I would cut at a time threshold of 2.5s. This is waht I need to change/loop through.
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
You can use ROCR too for this
library(ROCR)
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
You can also check the AUC value:
auc <- performance(pred, measure = "auc")
auc#y.values[[1]]
[1] 0.92
Cross checking the AUC value with pROC
library(pROC)
roc(df$truth, df$time)
Call:
roc.default(response = df$truth, predictor = df$time)
Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92
For both the cases, it is same!
So my idea was to just create a ROC curve
Creating a ROC curve is as easy as
library(pROC)
set.seed(3)
data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) |>
roc(truth, time) |>
plot()
The problem is that I don't have a machine-learning-like model that would give me class probabilities.
Sorry, I do not understand what is machine-learning-like about the question.
I had the idea of just looping through my data at maybe 100 different thresholds
There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time values. You can get those from roc:
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5))
thresholds <- roc(df, truth, time)$thresholds
print(thresholds)
or
> print(thresholds)
[1] -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
[8] 3.093424 3.969994 4.586341 Inf
What exactly is implied in the term looping and whether you want to exclude just a for and a while loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5 a loop? There will be a loop running under the hood.
This is my first question on stack overflow.
Situation: I have 2 time series. Both series have the same values but the second series has 5 NAs at the start. Hence, first series has 105 observations, where 2nd series has 110 observations. I have fitted an ARIMA(0,1,0) using the Arima function to both series separately. And then I used the forecast package to predict 10 steps to the future.
Issue: Even though the ARIMA coefficient for both series are the same, the projections (10 steps) appear to be different. I am uncertain why this is the case. Has anyone come across this before? Any guidance is highly appreciated.
Tried: I tried setting seed, creating index manually, and using auto.ARIMA for the model fitting. However, none of the steps has helped me to reconcile the difference.
I have added a picture to show you what I see. Please note I have hidden the mid part of the series so that you can see the start and the end of the series. The yellow highlighted cells are the projection outputs from the 'Forecast' package. I have manually added the index to be years after extracting the results from R.
Time series projected and base in excel
Rates <- read.csv("Rates_for_ARIMA.csv")
set.seed(123)
#ARIMA with NA
Simple_Arima <- Arima(
ts(Rates$Rates1),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima <- forecast(Simple_Arima, h = 10)
fcasted_Arima$mean
#ARIMA Without NA
Rates2 <- as.data.frame(Rates$Rates2)
##Remove the final spaces from the CSV
Rates2 <- Rates2[-c(106,107,108,109,110),]
Simple_Arima2 <- Arima(
ts(Rates2),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima2 <- forecast(Simple_Arima2, h = 10)
fcasted_Arima2$mean
The link to data is here, CSV format
Could you share your data and code such that others can see if there is any issue with it?
I tried to come up with an example and got the same results for both series, one that includes NAs and one that doesn't.
library(forecast)
library(xts)
set.seed(123)
ts1 <- arima.sim(model = list(0, 1, 0), n = 105)
ts2 <- ts(c(rep(NA, 5), ts1), start = 1)
fit1 <- forecast::Arima(ts1, order = c(0, 1, 0))
fit2 <- forecast::Arima(ts2, order = c(0, 1, 0))
pred1 <- forecast::forecast(fit1, 10)
pred2 <- forecast::forecast(fit2, 10)
forecast::autoplot(pred1)
forecast::autoplot(pred2)
> all.equal(as.numeric(pred1$mean), as.numeric(pred2$mean))
[1] TRUE
I am trying to forecast three variables using R, but I am running into issues on how to deal with correlation.
The three variables I am trying to forecast are Revenue, Subscriptions and Price.
My initial approach was to do two independent time series forecast of subscriptions and price and multiply the outcomes to generate the revenue forecast.
I wanted to understand if this approach makes sense, as there is an inherent correlation between the price and the subscribers, and this is the part I do not know how to deal with.
# Load packages.
library(forecast)
# Read data
data <- read.csv("data.csv")
data.train <- data[0:57,]
data.test <- data[58:72,]
# Create time series for variables of interest
data.subs <- ts(data.train$subs, start=c(2014,1), frequency = 12)
data.price <- ts(data.train$price, start=c(2014,1), frequency = 12)
#Create model
subs.stlm <- stlm(data.subs)
price.stlm <- stlm(data.price)
#Forecast
subs.pred <- forecast(subs.stlm, h = 15, level = c(0.6, 0.75, 0.9))
price.pred <- forecast(price.stlm, h = 15, level = c(0.6, 0.75, 0.9))
Any help is greatly appreciated!
Looks like you can use the vector autoregression (VAR) model. Take a look at the description and the code provided here:
https://otexts.org/fpp2/VAR.html
I am interested in replicating an experiment in a paper [1] I came across. The idea is that I need to simulate a cox proportional hazard model that is dependent on the first to covariates in the dataframe. I am trying to make a plot similar to this:
But I am trying to make a "hex" version of it. The problem is that I can't seem to get the "z-axis" correct.
set.seed(42) # this makes the example exactly reproducible
#50,000 random uniforms
obs <- runif(50000,min = -1, max = .999)
#make uniforms a matrix
obs <- matrix(data = obs, nrow = 5000, ncol = 10)
#make is_censored
is_censored <- sample(0:1,5000,TRUE,prob=c(0.40,0.60))
#hazard function
const <- 1
time <- rexp(n = 5000, const*exp(-(obs[,1]+2*obs[,2])))
#dataset
df <- cbind(obs, is_censored, time)
#names for covariates
names = letters[1:10]
colnames(df)[1:10] <- names
#truth data
x <- df[,1]; y <- df[,2]
true <- tibble(x,y,time)
install.packages("hexbin")
library(hexbin)
ggplot(true,aes(x,y))+
geom_hex(bins = 30)
I thought that if I added time for the z-axis I would get the correct gradient, but instead I got:
ggplot(true,aes(x,y,fill=time))+
geom_hex(bins = 30)
How can I get the proper gradient?
1Deep Survival: A Deep Cox Proportional Hazards Network
I am using the CasualImpact R package and I would like to get the counter-factual/control time series from the output after estimation. I run the following code which is basically the same as the example code on the website of the package.
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- 1.2 * x1 + rnorm(100)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
pre.period <- c(1, 70)
post.period <- c(71, 100)
impact <- CausalImpact(data, pre.period, post.period)
The the local linear trend is in
impact$model$bsts.model$state.contributions
while the coefficient draws are supposed to be in
impact$model$bsts.model$coefficients
so I run
trend=colMeans(impact$model$bsts.model$state.contributions[1:1000,1,1:100])
trend+mean(impact$model$bsts.model$coefficients[1:1000,2])*x1
to get the counter-factual time series, however this is far from the actual counter-factual time series when plotting the results with
plot(impact)
Can somebody tell me how I can get back the counter-factual time series?
Thanks in advance!
The point predictions for the entire time series (both pre and post intervention) can be found at
impact$series
in the point.pred column. The counter-factual is the part of the point predictions that occur in the post.period portion of that column. impact$series provides the data for all three graphs in plot(impact).