Discrete Deconvolution in R - r

I have a set of data with two slightly overlapping peaks that I would like to deconvolve into their respective components.
The measured data (variable h) is a function of the first event (variable f) and a second unmeasured event (typically denoted as variable g). The data set may be reconstructed using the following code:
h <- as.numeric(c(256, 208, 139, 406, 316, 226))
f <- as.numeric(c(256, 208, 139))
t <- as.numeric(c(1, 2, 4, 5, 6, 8))
test <- data.frame(h, f, t)
In the data above, the variable t represents time. The first event begins just after t=0 and the second event begins just after t=4. My objective is to figure out how much of the second event (where h = 406, 316, and 226) are attributable to the residual effects of f and how much is due to g. In other words, I would like to solve for variable g at t = 5, 6, and 8.
Both of these events can be assumed to follow a monoexponential decay function. When the log(10) of h is plotted against t the resulting graph looks like this:
In researching this problem, it appears that the R decon package is only suitable for evaluating measurement error problems, rather than performing this type of discrete deconvolution analysis. Does anyone know of an alternative method to go about solving this problem?

Related

Survival::Survfit (left, right, and interval censoring)

I'm attempting to estimate survival probabilities using the survfit function from the Survival package. My dataset consists of animals that were captured at various times over the course of ~2 years. Some animals died, some animals were censored after capture and some animals lived beyond the end of the study (I'm guessing this means I have left, right and interval censored data).
I can estimate survival probability using right censors only, but this assumes all animals were captured on the same day and does not account for adding new animals through time. What I would like to do is estimate survival as a function of calendar day and not as a function of time since capture.
Example data:
time1<- c(2, 386, 0, 1, 384, 3, 61, 33, 385, 64)
time2<- c(366, 665, 285, 665, 665, 454, 279, 254, 665, 665)
censor<- c(3,3,3,3,3,3,3,3,3,3)
region <- c(1, 6, 1, 6, 5, 1, 1, 1, 5, 6)
m1<- data.frame(time1, time2, censor, region)
code:
km.2 <- survfit(Surv(m1$time1, m1$time2, m1$censor, type = "interval") ~ m1$region)
Note the above code runs but doesn't estimate what I laid out above. I hope this is an issue of specifying certain arguments in the survfit function but this is where I am lost. Thanks for the help
Not sure if you've figured this out by now since it was nearly a year ago. I'm a bit confused by the experiment you're explaining.
However, one item that pops out immediately is the "time1". I believe you can't have any times start or end at 0. I recommend adding 0.5 or 1 to that specific time observation, and explaining why in your write up. But having a 0 value is a likely culprit for why it's not estimating properly

R: How do i aggregate losses by a poisson observation?

I'm new to R but i am trying to use it in order to aggregate losses that are observed from a severity distribution by an observation from a frequency distribution - essentially what rcompound does. However, i need a more granular approach as i need to manipulate the severity distribution before 'aggregation'.
Lets take an example. Suppose you have:
rpois(10,lambda=3)
Thereby, giving you something like:
[1] 2 2 3 5 2 5 6 4 3 1
Additionally, suppose we have severity of losses determined by:
rgamma(20,shape=1,scale=10000)
So that we also have the following output:
[1] 233.0257 849.5771 7760.4402 731.5646 8982.7640 24172.2369 30824.8424 22622.8826 27646.5168 1638.2333 6770.9010 2459.3722 782.0580 16956.1417 1145.4368 5029.0473 3485.6412 4668.1921 5637.8359 18672.0568
My question is: what is an efficient way to get R to take each Poisson observation in turn and then aggregate losses from my severity distribution? For example, the first Poisson observation is 2. Therefore, adding two observations (the first two) from my Gamma distribution gives 1082.61.
I say this needs to be 'efficient' (run time) due to the fact:
- The Poisson parameter may be come significantly large, i.e. up to 1000 or so.
- The realisations are likely to be up to 1,000,000, i.e. up to a million Poisson and Gamma observations to sort through.
Any help would be greatly appreciated.
Thanks, Dave.
It looks like you want to split the gamma vector at positions indicated by the accumulation of the poisson vector.
The following function (from here) does the splitting:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
pois <- c(2, 2, 3, 5, 2, 5, 6, 4, 3, 1)
gam <- c(233.0257, 849.5771, 7760.4402, 731.5646, 8982.7640, 24172.2369, 30824.8424, 22622.8826, 27646.5168, 1638.2333, 6770.9010, 2459.3722, 782.0580, 16956.1417, 1145.4368, 5029.0473, 3485.6412, 4668.1921, 5637.8359, 18672.0568)
posits <- cumsum(pois)
Then do the following:
sapply(splitAt(gam, posits + 1), sum)
[1] 1082.603 8492.005 63979.843 61137.906 17738.200 19966.153 18672.057
According to post I linked to above, the splitAt() function slows down for large arrays, so you could (if necessary) consider the alternatives proposed in that post. For my part, I generated 1e6 poissons and 1e6 gammas, and the above function ran in 0.78 sec on my machine.

Using interpolation to derive a function in R

I'm trying to derive an approximated function from some X and Y values in R.
As I understand it splines can be used, but I just can't grasp how through the documentation.
Heres what I'd like to do:
x <- c(0, 1, 2, 3, 4, 5)
y <- c(200, 320, 455, 612, 899)
## Example of goal
approxfun <- findfun(x,y, pow=5)
approxfun
Returning a result of: f(x) = y = ax^5 + bx^4 + cx^3 + dx^2 + e*x^1 + f.
Where a, b, c, d, e, and f are some real numbers.
The core issue I'm trying to tackle is solving an equation in the form of sum(f(x)),n=1->N = max. IE. I'm trying to find the N that allows for the maximum amount of accumulated function sums of f(x). In other words, if i have 100 apples and eat increasing amounts each day as they being to turn over-ripe, and the amount i eat each day is f(x), I need to know how many days the apples will last.

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

R: mix() in mixdist package returning error

I have installed the mixdist package in R to combine distributions. Specifically, I'm using the mix() function. See documentation.
Basically, I'm getting
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
I googled the error message, but no useful results popped up.
My first argument to mix() is a data frame called data.df. It is formatted exactly like the built-in data set pike65. I also did data.df <- as.mixdata(data.df).
My second argument has two rows. It is a data frame called datapar, formatted exactly like pikepar. My pi values are 0.5 and 0.5. My mu values are 250 and 463 (based on my data set). My sigma values are 0.5 and 1.
My call to mix() looks like:
fitdata <- mix(data.df, datapar, "norm", constr = mixconstr(consigma="CCV"), emsteps = 3, print.level = 2)
The printing shows that my pi values go from 0.5 to NaN after the first iteration, and that my gradient is becoming 0.
I would appreciate any help in sorting out this error.
Thanks,
n.i.
Using the test data you linked to
library(mixdist)
time <- seq(673,723)
counts <-c(3,12,8,12,18,24,39,48,64,88,101,132,198,253,331,
419,563,781,1134,1423,1842,2505,374,6099,9343,13009,
15097,13712,9969,6785,4742,3626,3794,4737,5494,5656,4806,
3474,2165,1290,799,431,213,137,66,57,41,35,27,27,27)
data.df <- data.frame(time=time, counts=counts)
We can see that
startparam <- mixparam(c(699,707),1 )
data.fit <- mix(data.mix, startparam, "norm")
Gives the same error. This error appears to be closely tied to the data (so the reason this data does not work could be potentially different than why yours does not work but this is the only example you offered up).
The problem with this data is that the probability between the two groups becomes indistinguishable at some point. Then that happens, the "E" step of the algorithm cannot estimate the pi variable properly. Here
pnorm(717,707,1)
# [1] 1
pnorm(717,699,1)
# [1] 1
both are exactly 1 and this seems to be causing the error. When mix takes 1 minus this value and compares the ratio to estimate group, it gets NaN values which are propagated to the estimate of proportions. When internally these NaN values are passed to nlm() to do the estimation, you get the error message
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
The same error message can be replicated with
f <- function(x) sum((x-1:length(x))^2)
nlm(f, c(10,10))
nlm(f, c(10,NaN)) #error
So it appears the maxdist package will not work in this scenario. You may wish to contact the package maintainer to see if they are aware of the problem. In the meantime you will will need to find another way to estimate the parameters of you mixture model.
Now, I am not an expert in mixture distributions, but I think #MrFlick's accepted answer is a little bit misleading for anyone googling the error message (although no doubt correct for the example he gave). The core problem is that in both, your linked code and your example, the sigma values are very small compared to mu values. I think that the algorithm just cannot manage to find a solution with such small starting sigma values. If you increase the sigma values, you will get a solution. Linked code as an example:
library(mixdist)
time <- seq(673,723)
counts <- c(3, 12, 8, 12, 18, 24, 39, 48, 64, 88, 101, 132, 198, 253, 331, 419, 563, 781, 1134, 1423, 1842, 2505, 374, 6099, 9343, 13009, 15097, 13712, 9969, 6785, 4742, 3626, 3794, 4737, 5494, 5656, 4806, 3474, 2165, 1290, 799, 431, 213, 137, 66, 57, 41, 35, 27, 27, 27)
data.df <- data.frame(time=time, counts=counts)
data.mix <- as.mixdata(data.df)
startparam <- mixparam(mu = c(699,707), sigma = 1)
data.fit <- mix(data.mix, startparam, "norm") ## Leads to the error message
startparam <- mixparam(mu = c(699,707), sigma = 5) # Adjust start parameters
data.fit <- mix(data.mix, startparam, "norm")
plot(data.fit)
data.fit ### Estimates somewhat reasonable mixture distributions
# Parameters:
# pi mu sigma
# 1 0.853 699.3 4.494
# 2 0.147 708.6 2.217
A bottom line: if you can increase your start parameter sigma values, mix function might find reasonable estimates for you. You do not necessarily have to try another package.
In addition, you can get this message if you have missing data in your dataset.
From example set
data(pike65)
data(pikepar)
pike65$freq[10] <- NA
fitpike1 <- mix(pike65, pikepar, "lnorm", constr = mixconstr(consigma = "CCV"), emsteps = 3)
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist =
dist, : missing value in parameter

Resources