Verifying a Poisson process with rate λ = 10 - r

I'm working on a scenario where I have to generate some numbers at a rate of 10, use cumsum to sequence them, and then remove anything with a value over 12 (this represents the timings of visitors to a website):
Visits = rexp(4000, rate = 10)
Sequenced = cumsum(Visits)
Sequenced <- Sequenced[Sequenced <= 12]
From here I need to verify that the generated "visits" follows a Poisson process with a rate of 10, but I'm not sure I'm doing this right.
TheMean = mean(Sequenced)
HourlyRate1 = TheMean/12 # divided by 12 as data contains up to 12 hours
This does not generate an answer of (or near) 10 (I thought it would based on the rate parameter of the rexp function).
I am new to this, so I believe I have misunderstood something along the way, but I'm not sure what. Can somebody please point me in the right direction, where using the data generated in the first code segment above, I need to "verify the visits follow a Poisson Process with rate λ equals 10".

You are measuring the wrong thing.
Since Sequenced (the times of visits) cannot exceed 12, its mean is likely to be about 6 and, if that is the case, it simply confirms that you applied that limit of 12
What does have a Poisson distribution is the number of terms in Sequenced: this is expected to be 12×10=120 though with a variance of 120 and so a standard deviation of 10.95. You could look at that, or divide that by 12 (in which case the expected value is 10 and standard deviation about 0.9, but that is not Poisson distributed and has the possibility of non-integer values), with the R code
NumberOfVisits <- length(Sequenced)
VisitsPerUnitTime <- NumberOfVisits / 12

Related

How can I create a normal distributed set of data in R?

I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)

Create Simulation in R

I have following problem:
A casualty insurance company has 1000 policyholders, each of whom will
independently present a claim in the next month with probability 5%.
Assuming that the amounts of the claims made are independent exponential
random variables with mean 800 Dollars.
Does anyone know how to create simulation in R to estimate the probability
that the sum of those claims exceeds 50,000 Dollars?
This sounds like a homework assignment, so it's probably best to consult with your teacher(s) if you're unsure about how to approach this. Bearing that in mind, here's how I'd go about simulating this:
First, create a function that generates values from an exponential distribution and sums those values, based on the values you give in your problem description.
get_sum_claims <- function(n_policies, prob_claim, mean_claim) {
sum(rexp(n = n_policies*prob_claim, rate = 1/mean_claim))
}
Next, make this function return the sum of all claims lots of times, and store the results. The line with map_dbl does this, essentially instructing R to return 100000 simulated sums of claims from the get_sum_claims function.
library(tidyverse)
claim_sums <- map_dbl(1:100000, ~ get_sum_claims(1000, 0.05, 800))
Finally, we can calculate the probability that the sum of claims is greater than 50000 by using the code below:
sum(claim_sums > 50000)/length(claim_sums)
This gives a fairly reliable estimate of ~ 0.046 as the probability that the sum of claims exceeds 50000 in a given month.
I'm a bit inexperienced with R, but here is my solution.
First, construct a function which simulates a single trial. To do so, one needs to determine how many claims are filed n. I hope it is clear that n ~ Binomial(1000, 0.05). Note that, you cannot simply assume n = 1000 * 0.05 = 50. By doing so, you would decrease the variance, which will result in a lower probability. I can explain why this is the case if needed. Then, generate and sum n values based on an exponential distribution with mean 800.
simulate_total_claims <- function(){
claim_amounts <- rexp(rbinom(n=1,size=1000, prob = 0.05), rate = 1/800)
total <- sum(claim_amounts)
return(total)
}
Now, all that needs to be done is run the above function a lot and determine the proportion of runs which have values greater than 50000.
totals <- rerun(.n = 100000, simulate_total_claims())
estimated_prob <- mean(unlist(totals) > 50000)

Calculation of Precision,Recall and F-Score with confusionMatrix

We have developed an algorithm that detects number of repetions per resistance exercise machine out of accelerometer data. People performed always 10 repetitions 2x per machine.
n people x 10 repetitions x 2 sets = total amount of repetitions performed .
Now, I wanted to calculate the precision, recall and f-score with confusionMatrix from the caret package.
I made an xlsx file with two rows representing real (upper row) and algorithmically predicted number of repetitions (lower row) as depicted in the picture:
I coded the following:
reps_prec_phone1<- read.xlsx("Reps_for_Precision_Recall_FSCORE.xlsx", sheet = "2Vec_Phone1", startRow = 0, colNames = FALSE)
reps_prec_pred_phone1<-as.factor(reps_prec_phone1[1,])
reps_prec_real_Phone1<-as.factor(reps_prec_phone1[2,])
result_phone1 <- confusionMatrix(reps_prec_pred_phone1, reps_prec_real_Phone1, mode="prec_recall")
The result looks like this:
As you can see in the confusionMatrix, 385 sets (1 set consists of 10 repetitions) instead of 3850 repetitions were counted. Now I am wondering, methodologically how can I get confusionMatrix to calculate the number of repetitions instead of the number of sets.
In my case the error rate is 1-Accuracy = 2.5%. As 1 set consists of 10 repetitions. As set vs repetition is a factor of 10, I could simply divide the error rate by 10 and recalulate the accuracy 1-0.0025 = 0.9975. However,
does anyone know how to solve this issue with confusionMatrix?
Thank you in advance for your brain power & experience!
There's a theoretical mistake.
A confusion matrix is made to compare observations given with predicted values, R convert your data as factor, then your values {10,11} are interpreted as the levels of that factor not as numeric values, then R count the ties. In short put, you have a wrong idea about what a confusion matrix is.
Also, any model will perform biased predictions because you have extremely unbalanced data, in short put, there's nothing to predict.
Then, you don't have a programming problem it's more a theoretical one. Visit Stack Exchange to clear your ideas.
Visit me!

R: How to generate a series of exponential deviates that sum to some number

I am trying to generate a series of wait times for a Markov chain where the wait times are exponentially distributed numbers with rate equal to one. However, I don't know the number of transitions of the process, rather the total time spent in the process.
So, for example:
t <- rexp(100,1)
tt <- cumsum(c(0,t))
t is a vector of the successive and independent waiting times and tt is a vector of the actual transition time starting from 0.
Again, the problem is I don't know the length of t (i.e. the number of transitions), rather how much total waiting time will elapse (i.e. the floor of last entry in tt).
What is an efficient way to generate this in R?
The Wikipedia entry for Poisson process has everything you need. The number of arrivals in the interval has a Poisson distribution, and once you know how many arrivals there are, the arrival times are uniformly distributed within the interval. Say, for instance, your interval is of length 15.
N <- rpois(1, lambda = 15)
arrives <- sort(runif(N, max = 15))
waits <- c(arrives[1], diff(arrives))
Here, arrives corresponds to your tt and waits corresponds to your t (by the way, it's not a good idea to name a vector t since t is reserved for the transpose function). Of course, the last entry of waits has been truncated, but you mentioned only knowing the floor of the last entry of tt, anyway. If he's really needed you could replace him with an independent exponential (bigger than waits[N]), if you like.
If I got this right: you want to know how many transitions it'll take to fill your time interval. Since the transitions are random and unknown, there's no way to predict for a given sample. Here's how to find the answer:
tfoo<-rexp(100,1)
max(which(cumsum(tfoo)<=10))
[1] 10
tfoo<-rexp(100,1) # do another trial
max(which(cumsum(tfoo)<=10))
[1] 14
Now, if you expect to need to draw some huge sample, e.g. rexp(1e10,1), then maybe you should draw in 'chunks.' Draw 1e9 samples and see if sum(tfoo) exceeds your time threshold. If so, dig thru the cumsum . If not, draw another 1e9 samples, and so on.

Generating sorted random ints without the sort? O(n)

Just been looking at a code golf question about generating a sorted list of 100 random integers. What popped into my head, however, was the idea that you could generate instead a list of positive deltas, and just keep adding them to a running total, thus:
deltas: 1 3 2 7 2
ints: 1 4 6 13 15
In fact, you would use floats, then normalise to fit some upper limit, and round, but the effect is the same.
Although it wouldn't make for shorter code, it would certainly be faster without the sort step. But the thing I have no real handle on is this: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?
Edit: A sample script:
import random,sys
running = 0
max = 1000
deltas = [random.random() for i in range(0,11)]
floats = []
for d in deltas:
running += d
floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)
Whose output (fair dice roll) was:
[24, 71, 133, 261, 308, 347, 499, 543, 722, 852]
UPDATE: Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.
So you are asking if the numbers generated in this way are going to be uniformly distributed.
You are generating a series:
yj = ∑i=0j ( xi / A )
where A is the sum of all xi. xi is the list of (positive) deltas.
This can be done iff xi are exponentially distributed (with any fixed mean). So, if xi are uniformly distributed, the resulting yj will not be uniformly distributed.
Having said that, it's fairly easy to generate exponential xi values.
One example would be:
sum := 0
for I = 1 to N do:
X[I] = sum = sum - ln(RAND)
sum = sum - ln(RAND)
for I = 1 to N do:
X[I] = X[I]/sum
and you will have your random numbers sorted in the range [0, 1).
Reference: Generating Sorted Lists of Random Numbers. The paper has other (faster) algorithms as well.
Of course, this generates floating-point numbers. For uniform distribution of integers, you can replace sum above by sum/RANGE in the last step (i.e., the R.H.S becomes X[I]*RANGE/sum, and then round the numbers to the nearest integer).
A uniform distribution has an upper and a lower bound. If you use your proposed method, and your deltas happen to be chosen large enough that you run into the upper bound before you have generated all your numbers, what would your algorithm do next?
Having said that, you may want to investigate the Poisson distribution, which is the distribution of interval times between random events occurring with a given average frequency.
If you take the number range of being 1 to 1000, and you have to use 100 of these numbers, the delta will have to be as a minimum 10, otherwise you can not reach the 1000 mark. How about some working to demonstrate it in action...
The chance of any given number in an evenly distributed random selection is 100/1000 e.g. 1/10 - no shock there, take that as the basis.
Assuming you start using a delta and that delta is just 10.
The odds of getting the number 1 is 1/10 - seems fine.
The odds of getting the number 2 is 1/10 + (1/10 * 1/10) (because you could hit 2 deltas of 1 in a row, or just hit a 2 as the first delta.)
The odds of getting the number 3 is 1/10 + (1/10 * 1/10 * 1/10) + (1/10 * 1/10) + (1/10 * 1/10)
The first case was a delta of 3, the second was hitting 3 deltas of 1 in a row, the third case would be a delta of 1 followed by a 2, and the fourth case was a delta of 2 followed by a 1.
For the sake of my fingers typing, we won't generate the combinations that hit 5.
Immediately the first few numbers have a greater percentage chance than the straight random.
This could be altered by changing the delta value so the fractions are all different, but I do not believe you could find a delta that produced identical odds.
To give an analogy that might just sink it, if you consider your delta as just 6 and you run that twice it is the equivalent of throwing 2 dice - each of the deltas is independant, but you know that 7 has a higher chance of being selected than 2.
I think it will be extremely similar but the extremes will be different because of the normalization. For example, 100 numbers chosen at random between 1 and 100 could all be 1. However, 100 numbers created using your system could all have deltas of 0.01 but when you normalize them you'll scale them up to be in the range 1 -> 100 which will mean you'll never get that strange possibility of a set of very low numbers.
Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.
So the new version of the code sample in the question would be:
import random,sys
running = 0
max = 1000
deltas = [random.expovariate(1.0) for i in range(0,11)]
floats = []
for d in deltas:
running += d
floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)
Note the use of random.expovariate(1.0), a Python exponential distribution random number generator (very useful!). Here it's called with a mean of 1.0, but since the script normalises against the last number in the sequence, the mean itself doesn't matter.
Output (fair dice roll):
[11, 43, 148, 212, 249, 458, 539, 725, 779, 871]
Q: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?
A: Each delta will be uniformly distributed. The central limit theorem tells us that the distribution of a sum of a large number of such deviates (since they have a finite mean and variance) will tend to the normal distribution. Hence the later deviates in your sequence will not be uniformly distributed.
So the short answer is "no". Afraid I cannot give a simple solution without doing algebra I don't have time to do today!
The reference (1979) in Alok's answer is interesting. It gives an algorithm for generating the uniform order statistics not by addition but by successive multiplication:
max = 1.
for i = N downto 1 do
out[i] = max = max * RAND^(1/i)
where RAND is uniform on [0,1). This way you don't have to normalize at the end, and in fact don't even have to store the numbers in an array; you could use this as an iterator.
The Exponential distribution: theory, methods and applications
By N. Balakrishnan, Asit P. Basu gives another derivation of this algorithm on page 22 and credits Malmquist (1950).
You can do it in two passes;
in the first pass, generate deltas between 0 and (MAX_RAND/n)
in the second pass, normalise the random numbers to be within bounds
Still O(n), with good locality of reference.

Resources