I have data for two populations that are binned by age, with different bins for each population.
Age bins in population 1: 18-24, 25-29, 30-34, 35-45, 46-60, 61+
Age bins in population 2: 15-19, 20-24, 25-29, 30-34 ... 85-89, 90+
I want to infer a continuous distribution from these binned data in order to compare the two populations more directly. I tried fitting an untruncated negative binomial distribution but it was underestimating the lower bins:
So, now I want to try a truncated negative binomial distribution. I did the following:
library(truncdist)
library(fitdistrplus)
dtruncated_nbinom <- function(x)
dtrunc(x, "nbinom", a=18, b=100)
ptruncated_nbinom <- function(q)
ptrunc(q, "nbinom", a=18, b=100)
pop1_nbinom <- fitdistcens(pop1_dt, "truncated_nbinom")
But I got the following error:
Error in computing default starting values.
Error in manageparam(start.arg = start, fix.arg = fix.arg, obs = pseudodata, :
Error in start.arg.default(obs, distname) :
Unknown starting values for distribution truncated_nbinom.
Any advice on how to approach/resolve this?
Here's the pop 1 data:
pop1 <- data.table(left = c(18,25,30,35,46,61), right = c(25,30,35,46,61,100), counts = c(2745,3115,2726,3433,1368,204))
pop1_dt <- pop1[rep(1:nrow(pop1), pop1[,counts]), .(left, right)]
Related
I would like to
obtain the predicted time of the event, given a set of covariates
obtain the time at which the risk is equal to my specified threshold,
given covariates obtain the risk, given time and covariates
All this using ic_par (parametric) or ic_npar (non-parametric) or ic_sp (semi-parametric) models (not bayesian models) from icenReg
There are 3 functions in icenReg (https://cran.r-project.org/web/packages/icenReg/icenReg.pdf) that I believe do at least two of those things:
sampleSurv
getFitEsts
getSCurves
Can someone explain what those three functions do? Especially the difference between sampleSurv and getFitEsts?
From what I understand, the time to the event is modelled as a probability curve.
So you do not obtain a defined predicted time to the event, but rather a probability of this event occuring through time.
Thus, you can obtain the probability of the event to occur after X days, or you can obtain the time at which the event has a probability of X % to have occured.
getFitEsts() will provide these 2 estimates from an object previously fitted by ic_sp(), ic_par() or ic_bayes()
Here is an example of how to obtain these estimates, with an example from icenReg package :
data("IR_diabetes")
flatPrior_fit <- ic_bayes(cbind(left, right) ~ gender, data = IR_diabetes, model = "po", dist = "gamma")
newdata <- data.frame(gender = c(unique(IR_diabetes$gender)))
rownames(newdata) <- c(as.character(unique(IR_diabetes$gender)))
# plot the survival probability curve
plot(flatPrior_fit)
# plot the same curve according to each factor
plot(flatPrior_fit,newdata)
maleCovs <- data.frame(gender = c("male"))
femaleCovs <- data.frame(gender = c("female"))
# median survival time as calculated by the model if males and females are considered together
# = 50 % probability of the event occurring
getFitEsts(flatPrior_fit , p = 0.5)
# median survival time for males
getFitEsts(flatPrior_fit, newdata = maleCovs, p = 0.5)
# median survival time for females
getFitEsts(flatPrior_fit, newdata = femaleCovs, p = 0.5)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = maleCovs, q = 15)
# Probability that males died at day 15 ( = 1 - probability that they survived )
getFitEsts(flatPrior_fit, newdata = femaleCovs, q = 15)
getScurves() work only for semi parametric models.
It allows to obtain the interval of the survival probability for each time step, as plotted on the curve :
data("IR_diabetes")
# fit a semi parametric model (proportional odds)
sp_fit <- ic_sp(cbind(left, right) ~ gender, data = IR_diabetes, model = "po")
# plot the survival curve
plot(sp_fit, newdata)
# obtain the intervals and associated survival probability of this survival curve for each time step
getSCurves(sp_fit,newdata)
Finally, sampleSurv() draw samples from the probability curve you fitted, contained between the intervals computed, according to the quantile you need. These results are variable because there is multiple possibilities between these intervals.
I hope it helped a bit to understand these functions
We are carrying out an Operational Risk study, in particular we are fitting a severity frequency function with a negative binomial as follows:
# Negative Binomial Fitting
fit = MASS::fitdistr(datosf$Freq,"negative binomial")[[1]]
BN_s <- fit[1]
BN_mu <- fit[2]
# fitdistr parametrises the BN with size and mu, we calculate the parameter p as size/(size+mu)
BN_prob<-fit[1]/(fit[1]+fit[2])
# scale size to model annual frequency
BN_size= BN_s*f_escala
# goodness-of-fit test
chi_2_test = chisq.test(datosf$Freq,rnbinom(n=l,size=BN_s,prob=BN_prob))
# goodness-of-fit plot
nbinom = function(x)dnbinom(x, size = BN_s, mu = BN_mu)
hist(datosf$Freq, freq=FALSE, nclass=50)
curve(nbinom, from=0, to=max(datosf$Freq), n=max(datosf$Freq)+1, add=TRUE, col="blue")
In the data frame datosf$Freq we have the frequency (of the historical series) grouped monthly.
Currently, we have the objective of weighting these years according to the time horizon using the function:
w(t) = 1.05 - t/20 where t is the number of years and t=1,....,10
i.e. the objective is to maximise the following likelihood function:
L(x_i,\theta) = \prod_{i} w_i f(x_i,\theta)
Where x_i is the frequency and f(x_i) is the negative binomial density function.
How can we readapt the code to include the weights w_i?
Thank you very much!
Getting this error when attempting to run my code: Error in model.frame.default(data = list(an = c(0, 0), gn = c(0, 0), pkd = c(0, : variable lengths differ (found for 'age.c')
Here is my code:
library(KMsurv)
library(survival)
data(kidrecurr)
attach(kidrecurr)
head(kidrecurr)
#Long Form
kid.dat <- data.frame(patient=rep(patient, 2), time=c(time1, time2), infect=c(infect1, infect2), age=age, gender=gender, gn=gn, an=an, pkd=pkd)
kid.surv <- Surv(kid.dat$time, kid.dat$infect)
age.c <- kid.dat$age-mean(kid.dat$age)
#Cox ph
kid.ph <- coxph(kid.surv~kid.dat$an+kid.dat$gn+kid.dat$pkd+kid.dat$gender+age.c)
#Frailty Model
kid.f <- coxph(kid.surv~kid.dat$an+kid.dat$gn+kid.dat$pkd+kid.dat$gender+age.c+frailty(kid.dat$patient, dist="gamma"))
#Marginal Model
kid.m <- coxph(kid.surv~kid.dat$an+kid.dat$gn+kid.dat$pkd+kid.dat$gender+age.c+cluster(kid.dat$patient))
#Summaries for confidence intervals, hazard ratios, and p-values
summary(kid.ph)
summary(kid.f)
summary(kid.m)
#Plot of survivals for gender for each model
newdat <- data.frame(an=0, gn=0, pkd=0, gender=0:1, age.c=0); newdat
fit.ph <- survfit(kid.ph, newdat)
fit.m <- survfit(kid.m, newdat)
##Frailty model items for plot
modf.bh <- basehaz(kid.f)
modf.surv0f25 <- exp(-modf.bh$hazard*.25)
modf.surv1f25 <- exp(-modf.bh$hazard*.25)^exp(kid.f$coef)
My goal is to produce a single plot of the estimated survival functions for males and females from the coxph models, frailty model, and marginal model. However, I am getting this error and cannot get the rest to work.
Any help is appreciated! I am not sure what the issue is. I understand that the dataframe should just be one line for each of the gender levels and the rest is "zeroed" out, which I have in my data.frame.
I have tried amending the age variable to have 0 for 76 entries and then gender repeating, but this does not create the plot that I need (multiple survival lines).
I try to find any way for test Poisson residuals like normals in aov(). In my hypothetical example:
# For normal distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y1 <- rnorm(length(x), mean=10, sd=1.5)
#Normality test in aov residuals
y1.av<-aov(y1 ~ x)
shapiro.test(y1.av$res)
# Shapiro-Wilk normality test
#
#data: y1.av$res
#W = 0.99782, p-value = 0.7885
Sounds silly, OK!!
Now, I'll like to make a same approche but for Poisson distribution:
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
#Normality test in aov residuals
y2.av<-aov(y2 ~ x)
poisson.test(y2.av$res)
Error in poisson.test(y2.av$res) :
'x' must be finite, nonnegative, and integer
There is any stat approach for make this?
Thanks!
You could analyse your data below a counting context. Discrete data, such as variables of Poisson nature, can be analysed based on observed frequencies. You can formulate hypothesis testing for this task. Being your data y you can contrast the null hypothesis that y follows a Poisson distribution with some parameter lambda against the alternative hypothesis that y does not come from the Poisson distribution. Let's sketch the test with you data:
#Data
set.seed(123)
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
Now we obtain the counts, which are elemental for the test:
#Values
df <- as.data.frame(table(y2),stringsAsFactors = F)
df$y2 <- as.integer(df$y2)
After that we must separate the observed values O and its groups or categories classes. Both elements constitute the y variable:
#Observed values
O <- df$Freq
#Groups
classes <- df$y2
As we are testing a Poisson distribution, we must compute the lambda parameter. This can be obtained with Maximum Likelihood Estimation (MLE). The MLE for Poisson is the mean (considering we have counts and groups in order to determine this value), so we compute it with next code:
#MLE
meanval <- sum(O*classes)/sum(O)
Now, we have to get the probabilities of each class:
#Probs
prob <- dpois(classes,meanval)
Poisson distribution can go to infinite values, so we must compute the probability for the values that can be greater than our last group in order to have probabilities that sum to one:
prhs <- 1-sum(prob)
This probability can be easily added to the last value of our group in order to transform to account for values greater or equal to it (For example, instead of only having the probability that y equals to 20 we can have the probability that y is greater or equal to 20):
#Add probability
prob[length(prob)]<-prob[length(prob)]+prhs
With this we can conduct a goodness of fit test using chisq.test() function in R. It requires the observed values O and the probabilities prob that we have computed. Just a reminder that this test uses to set wrong degrees of freedom, so we can correct it by the formulation of the test that uses k-q-1 degrees. Where k is the number of groups and q is the number of parameters computed (we have computed one parameter with MLE). Next the test:
chisq.test(O,p=prob)
The output:
Chi-squared test for given probabilities
data: O
X-squared = 7.6692, df = 17, p-value = 0.9731
The key value from the test is the X-squared value which is the test statistic. We can reuse the value to obtain the real p-value (In our example, we have k=18 and minus 2, the degrees of freedom are 16).
The p.value can be obtained with next code:
p.value <- 1-pchisq(7.6692, 16)
The output:
[1] 0.9581098
As this value is not greater that known significance levels we do not reject the null hypothesis and we can affirm that y comes from a Poisson distribution.
I try to find the distribution for this dataset. I tried with the fitdistrplus package
data <- data.matrix(Book1)
descdist(data, discrete = FALSE)
but get this error:
Error in descdist(data, discrete = FALSE) : data must be a numeric vector
You can use instead
data <- as.numeric(Book1)
descdist(data, discrete = FALSE)
This gets you this graph:
And these values:
summary statistics
------
min: 3 max: 35
median: 5
mean: 6.244898
estimated sd: 3.517
estimated skewness: 1.977063
estimated kurtosis: 9.456783
If you then decide that the closest is an exponentional distribution, you can get its parameters like this
ft <- fitdist(data, distr = "exp" )
ft
Fitting of the distribution ' exp ' by maximum likelihood
Parameters:
estimate Std. Error
rate 0.1601307 0.002299016
And you can compare their density using this function:
denscomp(ft)