Survival::Survfit (left, right, and interval censoring) - r

I'm attempting to estimate survival probabilities using the survfit function from the Survival package. My dataset consists of animals that were captured at various times over the course of ~2 years. Some animals died, some animals were censored after capture and some animals lived beyond the end of the study (I'm guessing this means I have left, right and interval censored data).
I can estimate survival probability using right censors only, but this assumes all animals were captured on the same day and does not account for adding new animals through time. What I would like to do is estimate survival as a function of calendar day and not as a function of time since capture.
Example data:
time1<- c(2, 386, 0, 1, 384, 3, 61, 33, 385, 64)
time2<- c(366, 665, 285, 665, 665, 454, 279, 254, 665, 665)
censor<- c(3,3,3,3,3,3,3,3,3,3)
region <- c(1, 6, 1, 6, 5, 1, 1, 1, 5, 6)
m1<- data.frame(time1, time2, censor, region)
code:
km.2 <- survfit(Surv(m1$time1, m1$time2, m1$censor, type = "interval") ~ m1$region)
Note the above code runs but doesn't estimate what I laid out above. I hope this is an issue of specifying certain arguments in the survfit function but this is where I am lost. Thanks for the help

Not sure if you've figured this out by now since it was nearly a year ago. I'm a bit confused by the experiment you're explaining.
However, one item that pops out immediately is the "time1". I believe you can't have any times start or end at 0. I recommend adding 0.5 or 1 to that specific time observation, and explaining why in your write up. But having a 0 value is a likely culprit for why it's not estimating properly

Related

Specifying truncation point in glmmTMB R package

I am working with a large dataset that contains longitudinal data on gambling behavior of 184,113 participants. The data is based on complete tracking of electronic gambling behavior within a gambling operator. Gambling behavior data is aggregated on a monthly level, a total of 70 months. I have an ID variable separating participants, a time variable (months), as well as numerous gambling behavior variables such as active days played for given month, bets placed for given month, total losses for given month, etc. Participants vary in when they have been active gambling. One participant may have gambled at month 2, 3, 4, and 7, another participant at 3, 5, and 7, and a third at 23, 24, 48, 65 etc.
I am attempting to run a negative binomial 2 truncated model in glmmTMB and I am wondering how the package handles lack of 0. I have longitudinal data on gambling behavior, days played for each month (for a total of 70 months). The variable can take values between 1-31 (depending on month), there are no 0. Participants’ months with 0 are absent from dataset. Example of how data are structured with just two participants:
# Example variables and data frame in long form
# Includes id variable, time variable and example variable
id <- c(1, 1, 1, 1, 2, 2, 2)
time <- c(2, 3, 4, 7, 3, 5, 7)
daysPlayed <- c(2, 2, 3, 3, 2, 2, 2)
dfLong <- data.frame(id = id, time = time, daysPlayed = daysPlayed)
My question: How do I specify where the truncation happens in glmmTMB? Does it default to 0? I want to truncate 0 and have run the following code (I am going to compare models, the first one is a simple unconditional one):
DaysPlayedUnconditional <- glmmTMB(daysPlayed ~ 1 + (1 | id), dfLong, family = truncated_nbinom2)
Will it do the trick?
From Ben Bolker through r-sig-mixed-models#r-project.org:
"I'm not 100% clear on your question, but: glmmTMB only does zero-truncation, not k-truncation with k>0, i.e. you can only specify the model Prob(x==0) = 0 Prob(x>0) = Prob(NBinom(x))/Prob(NBinom(x>0)) (terrible notation, but hopefully you get the idea)"

Calculating cumulative hypergeometric distribution

Suppose I have 100 marbles, and 8 of them are red. I draw 30 marbles, and I want to know what's the probability that at least five of the marbles are red. I am currently using http://stattrek.com/online-calculator/hypergeometric.aspx and I entered 100, 8, 30, and 5 for population size, number of success, sample size, and number of success in sample, respectively. So the probability I'm interested in is Cumulative Probability: $P(X \geq 5)$ which = 0.050 in this case. My question is, how do I calculate this in R?
I tried
> 1-phyper(5, 8, 92, 30, lower.tail = TRUE)
[1] 0.008503108
But this is very different from the previous answer.
phyper(5, 8, 92, 30) gives the probability of drawing five or fewer red marbles.
1 - phyper(5, 8, 92, 30) thus returns the probability of getting six or more red marbles
Since you want the probability of getting five or more (i.e. more than 4) red marbles, you should use one of the following:
1 - phyper(4, 8, 92, 30)
[1] 0.05042297
phyper(4, 8, 92, 30, lower.tail=FALSE)
[1] 0.05042297
Why use:
1 - phyper(..., lower.tail = TRUE)
?
Easier to use:
phyper(..., lower.tail = FALSE)
. Even if they are mathematically equivalent, there are numerical reasons for preferring the latter.
Does that fix your problem? I believe you are putting the correct inputs into the phyper function. Is it possible that you're looking at the wrong output in that web site you linked?

How to present multiple time-series data to an SVM (ksvm) in R (or, How to present two-dimensional input data to an SVM)

How can I make a ksvm model aware that the first 100 numbers in a dataset are all time series data from one sensor, while the next 100 numbers are all time series data from another sensor, etc, for six separate time series sensor inputs? Alternatively (and perhaps more generally), how can I present two-dimensional input data to an SVM?
The process for which I need a binary yes/no prediction model has six non-periodic time series inputs, all with the same sampling frequency. An event triggers the start of data collection, and after a pre-determined time I need a yes/no prediction (preferably including a probability-of-correctness output). The characteristics of the time-series inputs which should produce 'yes' vs. 'no' are not known, but what is known is that there should be some correlation between each of the the input time series data and the final outcome. There is also significant noise present on all inputs. Both the meaningful information as well as the noise appear on the inputs as short-duration bursts (the meaningful bursts are always in the same general time for a given input source), but identifying which bursts are meaningful and which are noise is difficult; i.e. the fact that a burst happened at the "right" time for one input does not necessarily indicate a "yes" output; it may just be noise. To know whether the prediction should be "yes", the model needs to somehow incorporate information from all six time series inputs. I have a collection of prior data with approximately 900 'no' results and 100 'yes' results.
I'm pretty new to both R and SVM's, but I think I want to use an SVM model (kernlab's ksvm). I'm having trouble figuring out how to present the input data to it. I'm also not sure how to tell ksvm that the data is time series data, or if that is even relevant. I've tried using the Rattle GUI front-end to R to pull in my data from csv files, but I can't figure out how to present the time series data from all six inputs into the ksvm model. As a csv-file input, it seems the only way to import the data for all 1000 samples is by organizing the input data such that all sample data (for all six time series inputs) is on a single line of the csv file, with a separate known-result file's data presented on each line of the csv file. But in doing so, the fact that the 1st, 2nd, 3rd, etc. numbers are each part of the time series data from the first sensor is lost in the translation, as well as the fact that the 101st, 102nd, 103rd, etc. numbers are each part of the time series data from the second sensor, and so on; to the ksvm model, each data sample is just considered an isolated number unrelated to its neighbor. How can I present this data to ksvm as six separate but interrelated time series arrays? Or how can I present a 2-dimensional array of data to ksvm?
UPDATE:
OK, there are two basic strategies I've tried with dismal results (well, the resulting models were better than blind guessing, but not much).
First of all, not being familiar with R, I used the Rattle GUI front-end to R. I have a feeling that by doing so I may be limiting my options. But anyway, here's what I've done.....
Example known result files (shown with only 4 sensors instead of 6, and only 7 time samples instead of 100):
training168_yes.csv
Seconds Since 1/1/2000,sensor1,sensor2,sensor3,sensor4
454768042.4, 0, 0, 0, 0
454768042.6, 51, 60, 0, 172
454768043.3, 0, 0, 0, 0
454768043.7, 300, 0, 0, 37
454768044.0, 0, 0, 1518, 0
454768044.3, 0, 0, 0, 0
454768044.7, 335, 0, 0, 4273
training169_no.csv
Seconds Since 1/1/2000,sensor1,sensor2,sensor3,sensor4
454767904.5, 0, 0, 0, 0
454767904.8, 51, 0, 498, 0
454767905.0, 633, 0, 204, 55
454767905.3, 0, 0, 0, 512
454767905.6, 202, 655, 739, 656
454767905.8, 0, 0, 0, 0
454767906.0, 0, 934, 0, 7814
The only way I know to get the data for all training samples into R/Rattle is to massage & combine all result files into a single .csv file, with one sample result per line. I can think of only two ways to do that, so I tried them both (and I knew when I was doing it that by doing this I'm hiding potentially important information, which is the point of this SO question):
TRIAL #1: For each result file, add each sensor's samples into a single number, blasting away all temporal information:
result,sensor1,sensor2,sensor3,sensor4
no, 886, 1589, 1441, 9037
yes, 686, 60, 1518, 4482
no, 632, 1289, 1173, 9152
yes, 411, 67, 988, 5030
no, 772, 1703, 1351, 9008
yes, 490, 70, 1348, 4909
When I get done using Rattle to generate the SVM, Rattle's log tab gives me the following script which can be used to generate & train an SVM in RGui:
library(rattle)
building <- TRUE
scoring <- ! building
library(colorspace)
crv$seed <- 42
crs$dataset <- read.csv("file:///C:/Users/mminich/Desktop/stackoverflow/trainingSummary1.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 6 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.67*crs$nobs) # 4 observations
crs$validate <- NULL
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 2 observations
# The following variable selections have been noted.
crs$input <- c("sensor1", "sensor2", "sensor3", "sensor4")
crs$numeric <- c("sensor1", "sensor2", "sensor3", "sensor4")
crs$categoric <- NULL
crs$target <- "result"
crs$risk <- NULL
crs$ident <- NULL
crs$ignore <- NULL
crs$weights <- NULL
require(kernlab, quietly=TRUE)
set.seed(crv$seed)
crs$ksvm <- ksvm(as.factor(result) ~ .,
data=crs$dataset[,c(crs$input, crs$target)],
kernel="polydot",
kpar=list("degree"=1),
prob.model=TRUE)
TRIAL #2: For each result file, add the samples for all sensors for each time into a single number, blasting away any information about individual sensors:
result,time1, time2, time3, time4, time5, time6, time7
no, 0, 549, 892, 512, 2252, 0, 8748
yes, 0, 283, 0, 337, 1518, 0, 4608
no, 0, 555, 753, 518, 2501, 0, 8984
yes, 0, 278, 12, 349, 1438, 3, 4441
no, 0, 602, 901, 499, 2391, 0, 7989
yes, 0, 271, 3, 364, 1474, 1, 4599
And again I use Rattle to generate the SVM, and Rattle's log tab gives me the following script:
library(rattle)
building <- TRUE
scoring <- ! building
library(colorspace)
crv$seed <- 42
crs$dataset <- read.csv("file:///C:/Users/mminich/Desktop/stackoverflow/trainingSummary2.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 6 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.67*crs$nobs) # 4 observations
crs$validate <- NULL
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 2 observations
# The following variable selections have been noted.
crs$input <- c("time1", "time2", "time3", "time4", "time5", "time6", "time7")
crs$numeric <- c("time1", "time2", "time3", "time4", "time5", "time6", "time7")
crs$categoric <- NULL
crs$target <- "result"
crs$risk <- NULL
crs$ident <- NULL
crs$ignore <- NULL
crs$weights <- NULL
require(kernlab, quietly=TRUE)
set.seed(crv$seed)
crs$ksvm <- ksvm(as.factor(result) ~ .,
data=crs$dataset[,c(crs$input, crs$target)],
kernel="polydot",
kpar=list("degree"=1),
prob.model=TRUE)
Unfortunately even with nearly 1000 training datasets, both of the resulting models give me only slightly better results than I would get by just random chance. I'm pretty sure it would do better if there's a way to avoid blasting away either the temporal data or the distinction between different sensors. How can I do that? BTW, I don't know if it's important, but the sensor readings for all sensors are taken at almost exactly the same time, but the time difference between one reading and the next varies by maybe 10 to 20% generally from one run to the next (i.e. between "training" files), and I have no control over that. I think that's probably safe to ignore (i.e. I think it's probably safe to just number the readings sequentially like 1,2,3,etc.).
SVM takes a feature vector and uses it to build a classifier. Your feature vectors can be of 6 dimensions each from a different source and time as the seventh dimension. Each point in time from which you have a signal will produce another vector. Create t vectors, Vt, of size 7 each and make those your feature vectors. Populate them with your data and pass them into ksvm. By adding t as another feature in the feature vector you are correlating both all the data that happened at a specific time with each other but also it will help SVM learn that their is a progression of values.
You can the choose a subset of Vt as a training set. You will have to manually tag these vectors with a label that is the correct classification.

R: mix() in mixdist package returning error

I have installed the mixdist package in R to combine distributions. Specifically, I'm using the mix() function. See documentation.
Basically, I'm getting
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
I googled the error message, but no useful results popped up.
My first argument to mix() is a data frame called data.df. It is formatted exactly like the built-in data set pike65. I also did data.df <- as.mixdata(data.df).
My second argument has two rows. It is a data frame called datapar, formatted exactly like pikepar. My pi values are 0.5 and 0.5. My mu values are 250 and 463 (based on my data set). My sigma values are 0.5 and 1.
My call to mix() looks like:
fitdata <- mix(data.df, datapar, "norm", constr = mixconstr(consigma="CCV"), emsteps = 3, print.level = 2)
The printing shows that my pi values go from 0.5 to NaN after the first iteration, and that my gradient is becoming 0.
I would appreciate any help in sorting out this error.
Thanks,
n.i.
Using the test data you linked to
library(mixdist)
time <- seq(673,723)
counts <-c(3,12,8,12,18,24,39,48,64,88,101,132,198,253,331,
419,563,781,1134,1423,1842,2505,374,6099,9343,13009,
15097,13712,9969,6785,4742,3626,3794,4737,5494,5656,4806,
3474,2165,1290,799,431,213,137,66,57,41,35,27,27,27)
data.df <- data.frame(time=time, counts=counts)
We can see that
startparam <- mixparam(c(699,707),1 )
data.fit <- mix(data.mix, startparam, "norm")
Gives the same error. This error appears to be closely tied to the data (so the reason this data does not work could be potentially different than why yours does not work but this is the only example you offered up).
The problem with this data is that the probability between the two groups becomes indistinguishable at some point. Then that happens, the "E" step of the algorithm cannot estimate the pi variable properly. Here
pnorm(717,707,1)
# [1] 1
pnorm(717,699,1)
# [1] 1
both are exactly 1 and this seems to be causing the error. When mix takes 1 minus this value and compares the ratio to estimate group, it gets NaN values which are propagated to the estimate of proportions. When internally these NaN values are passed to nlm() to do the estimation, you get the error message
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
The same error message can be replicated with
f <- function(x) sum((x-1:length(x))^2)
nlm(f, c(10,10))
nlm(f, c(10,NaN)) #error
So it appears the maxdist package will not work in this scenario. You may wish to contact the package maintainer to see if they are aware of the problem. In the meantime you will will need to find another way to estimate the parameters of you mixture model.
Now, I am not an expert in mixture distributions, but I think #MrFlick's accepted answer is a little bit misleading for anyone googling the error message (although no doubt correct for the example he gave). The core problem is that in both, your linked code and your example, the sigma values are very small compared to mu values. I think that the algorithm just cannot manage to find a solution with such small starting sigma values. If you increase the sigma values, you will get a solution. Linked code as an example:
library(mixdist)
time <- seq(673,723)
counts <- c(3, 12, 8, 12, 18, 24, 39, 48, 64, 88, 101, 132, 198, 253, 331, 419, 563, 781, 1134, 1423, 1842, 2505, 374, 6099, 9343, 13009, 15097, 13712, 9969, 6785, 4742, 3626, 3794, 4737, 5494, 5656, 4806, 3474, 2165, 1290, 799, 431, 213, 137, 66, 57, 41, 35, 27, 27, 27)
data.df <- data.frame(time=time, counts=counts)
data.mix <- as.mixdata(data.df)
startparam <- mixparam(mu = c(699,707), sigma = 1)
data.fit <- mix(data.mix, startparam, "norm") ## Leads to the error message
startparam <- mixparam(mu = c(699,707), sigma = 5) # Adjust start parameters
data.fit <- mix(data.mix, startparam, "norm")
plot(data.fit)
data.fit ### Estimates somewhat reasonable mixture distributions
# Parameters:
# pi mu sigma
# 1 0.853 699.3 4.494
# 2 0.147 708.6 2.217
A bottom line: if you can increase your start parameter sigma values, mix function might find reasonable estimates for you. You do not necessarily have to try another package.
In addition, you can get this message if you have missing data in your dataset.
From example set
data(pike65)
data(pikepar)
pike65$freq[10] <- NA
fitpike1 <- mix(pike65, pikepar, "lnorm", constr = mixconstr(consigma = "CCV"), emsteps = 3)
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist =
dist, : missing value in parameter

Discrete Deconvolution in R

I have a set of data with two slightly overlapping peaks that I would like to deconvolve into their respective components.
The measured data (variable h) is a function of the first event (variable f) and a second unmeasured event (typically denoted as variable g). The data set may be reconstructed using the following code:
h <- as.numeric(c(256, 208, 139, 406, 316, 226))
f <- as.numeric(c(256, 208, 139))
t <- as.numeric(c(1, 2, 4, 5, 6, 8))
test <- data.frame(h, f, t)
In the data above, the variable t represents time. The first event begins just after t=0 and the second event begins just after t=4. My objective is to figure out how much of the second event (where h = 406, 316, and 226) are attributable to the residual effects of f and how much is due to g. In other words, I would like to solve for variable g at t = 5, 6, and 8.
Both of these events can be assumed to follow a monoexponential decay function. When the log(10) of h is plotted against t the resulting graph looks like this:
In researching this problem, it appears that the R decon package is only suitable for evaluating measurement error problems, rather than performing this type of discrete deconvolution analysis. Does anyone know of an alternative method to go about solving this problem?

Resources