R: How do i aggregate losses by a poisson observation? - r

I'm new to R but i am trying to use it in order to aggregate losses that are observed from a severity distribution by an observation from a frequency distribution - essentially what rcompound does. However, i need a more granular approach as i need to manipulate the severity distribution before 'aggregation'.
Lets take an example. Suppose you have:
rpois(10,lambda=3)
Thereby, giving you something like:
[1] 2 2 3 5 2 5 6 4 3 1
Additionally, suppose we have severity of losses determined by:
rgamma(20,shape=1,scale=10000)
So that we also have the following output:
[1] 233.0257 849.5771 7760.4402 731.5646 8982.7640 24172.2369 30824.8424 22622.8826 27646.5168 1638.2333 6770.9010 2459.3722 782.0580 16956.1417 1145.4368 5029.0473 3485.6412 4668.1921 5637.8359 18672.0568
My question is: what is an efficient way to get R to take each Poisson observation in turn and then aggregate losses from my severity distribution? For example, the first Poisson observation is 2. Therefore, adding two observations (the first two) from my Gamma distribution gives 1082.61.
I say this needs to be 'efficient' (run time) due to the fact:
- The Poisson parameter may be come significantly large, i.e. up to 1000 or so.
- The realisations are likely to be up to 1,000,000, i.e. up to a million Poisson and Gamma observations to sort through.
Any help would be greatly appreciated.
Thanks, Dave.

It looks like you want to split the gamma vector at positions indicated by the accumulation of the poisson vector.
The following function (from here) does the splitting:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
pois <- c(2, 2, 3, 5, 2, 5, 6, 4, 3, 1)
gam <- c(233.0257, 849.5771, 7760.4402, 731.5646, 8982.7640, 24172.2369, 30824.8424, 22622.8826, 27646.5168, 1638.2333, 6770.9010, 2459.3722, 782.0580, 16956.1417, 1145.4368, 5029.0473, 3485.6412, 4668.1921, 5637.8359, 18672.0568)
posits <- cumsum(pois)
Then do the following:
sapply(splitAt(gam, posits + 1), sum)
[1] 1082.603 8492.005 63979.843 61137.906 17738.200 19966.153 18672.057
According to post I linked to above, the splitAt() function slows down for large arrays, so you could (if necessary) consider the alternatives proposed in that post. For my part, I generated 1e6 poissons and 1e6 gammas, and the above function ran in 0.78 sec on my machine.

Related

R: dpois not returning correct probabilities?

I am working with dataset of the number of truffles found in 288 search areas. I am planning to test the null hypothesis that the truffles are distributed randomly, thus I am using dpois() to to calculate the expected probability densities. There are 4 categories (0, 1, 2, or 3 truffles per plot). The expected probabilities will later be converted to expected proportions and incorporated into a chisq.test analysis.
The problem is that the expected probabilities that I get with the following code don't make sense. They should sum to 1, but are much too small. I run the same exact code with another dataset and it produces normal values. What is going on here?
trufflesFound<-c(rep(0,203),rep(1,39),rep(2,18),rep(3,28))
trufflesTable<-table(trufflesFound)
trufflesTable
mean(trufflesTable)
expTruffPois<-dpois(x = 0:3, lambda = mean(trufflesTable))
expTruffPois
These are the probabilities it gives me, which are much too low!
0: 0.00000000000000000000000000000005380186
1: 0.00000000000000000000000000000387373404
2: 0.00000000000000000000000000013945442527
3: 0.00000000000000000000000000334690620643
In contrast, this dataset works just fine:
extinctData<-c(rep(1,13),rep(2,15),rep(3,16),rep(4,7),rep(5,10),rep(6,4),7,7,8,9,9,10,11,14,16,16,20)
extinctFreqTable <- table(extinctData)
extinctFreqTable
mean(extinctFreqTable)
expPois <- dpois(x = 0:20, lambda = mean(extinctFreqTable))
expPois
sum(expPois)
The sum is 0.9999997, which is close to the expected value of 1
Thoughts?
Lambda should be the average frequency, but taking mean(trufflesTable) returns the average of the counts of frequencies. Use mean(trufflesFound) instead. The reason the second one looks "right" is because mean(extinctData) is relatively close to mean(extinctFreqTable).
Note that the probabilities don't sum exactly to 1, because given the mean it is conceivable that we'd observe more than 4 truffles in a future search area.
trufflesFound<-c(rep(0,203),rep(1,39),rep(2,18),rep(3,28))
expTruffPois<-dpois(x = 0:3, lambda = mean(trufflesFound))
expTruffPois
#> [1] 0.57574908 0.31786147 0.08774301 0.01614715
sum(expTruffPois)
#> [1] 0.9975007
Created on 2022-02-08 by the reprex package (v2.0.1)

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

Calculating probabilities of simulated random variables in R

I have the following graph:
I need to travel from A to B. I also assume that I am taking the fastest route from A to be every day.
The travel times (in hours) between the nodes are exponentially distributed. I have simulated them, with the relevant lambda values, in R as follows:
AtoX <- rexp(1000, 4)
AtoY <- rexp(1000, 2.5)
XtoY <- rexp(1000, 10)
YtoX <- rexp(1000, 10)
XtoB <- rexp(1000, 3)
YtoB <- rexp(1000, 5)
I calculated the average travel time everyday in R as follows:
AXB <- AtoX + XtoB
AYB <- AtoY + YtoB
AXYB <- AtoX + XtoY + YtoB
AYXB <- AtoY + YtoX + XtoB
TravelTimes <- pmin(AXB, AYB, AXYB, AYXB)
averageTravelTime <- mean(TravelTimes)
I'm now trying to find the following for every single day:
With which probability is each of the four possible routes from A to B taken?
What is the probability that I have to travel more than half an hour?
For (1), I understand that I need to take the cumulative distribution function (CDF) P(x <= X) for each route.
For (2), I understand that I need to take the cumulative distribution function (CDF) P(0.5 => X), where 0.5 denotes half an hour.
I have only just started learning R, and I am unsure of how to go about doing this.
Reading the documentation, it seem that I might need to do something like the following to calculate the CDF:
pexp()
1 - pexp()
How can I do this?
Let R1, R2, R3, R4 be, in some order, random variables corresponding to the total time of the four routes. Then, being sums of independent exponential random variables, each of them follows the Erlang or the Gamma distribution (see here).
To answer 1, you want to find P(min{R1, R2, R3, R4} = R_i) for i=1,2,3,4. While the minimum of independent exponential random variables is tractable (see here), as far as I know that is not the case with Erlang/Gamma distributions in general. Hence, I believe you need to answer this question numerically, using simulations.
The same applies to the second question requiring to find P(min{R1, R2, R3, R4} >= 1/2).
Hence, we have
table(apply(cbind(AXB, AYB, AXYB, AYXB), 1, which.min)) / 1000
# 1 2 3 4
# 0.312 0.348 0.264 0.076
and
mean(TravelTimes >= 0.5)
# [1] 0.145
as our estimates. By increasing 1000 to some higher number (e.g., 1e6 works fast) one could make those estimates more precise.

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources