I want to define very simple function following:
where:
My work so far
prob <- function(x, n) {
quan <- qgamma(0.95, n, 1)
temp <- quan / (x)^2
first_term <- exp(-temp)
second_term <- temp^(0:(n - 1)) / factorial(0:(n - 1))
second_term <- sum(second_term)
first_term * second_term
}
The problem here is that in the sum (second term) for big n we are dealing with very big numbers, so R treats those as infinity.
So for example:
prob(0.5, n = 1000)
[1] NaN
Because quantile for n = 1000 equals to 1052.577, in nominator we have to calculate 1052.577^999 and in denominator factorial of 999. R understands those two numbers as infinity:
> factorial(999)
[1] Inf
> 1052.577^999
[1] Inf
So when it tries to divide them NaN is produced. However the output of this function is always in interval (0, 1), since its a probability. Is there any possibility to calculate value of this function in this point?
Your prob function is just the cumulative Poisson with lambda = temp and k = n - 1. Use ppois:
prob <- function(x, n) {
return(ppois(n - 1, qgamma(0.95, n, 1)/x^2))
}
prob(0.5, n = 1000)
# [1] 0
prob(0.5, n = 1000) = 0 because n - 1 = 999 is so far from the mean (lambda = qgamma(0.95, 1000, 1)/0.5^2 = 4210.308).
Related
I'm trying to compute a specific sum in R as quickly as possible. The object of interest is
and the relevant input objects are two L times K matrices x (contains only positive integers) and alpha (contains only positive real values). A is equivalent to rowSums(alpha) and N is equivalent to rowSums(x). Subscripts l and k denote a row / a column of alpha or x, respectively.
At first I thought it's going to be easy to come up with something that's super-quick, but I couldn't find an elegant solution. I think a matrix-valued version of seq() would be very helpful here. Does anyone have a creative solution to implement this efficiently?
Here's an easy-to-read, but obviously inefficient, loop-based version for reference:
# parameters
L = 20
K = 5
# x ... L x K matrix of integers
x = matrix(1 : (L * K), L, K)
# alpha ... L x K matrix of positive real numbers
alpha = matrix(1 : (L * K) / 100, L, K)
# N ... sum over rows of x
N = rowSums(x)
# A ... sum over rows of alpha
A = rowSums(alpha)
# implementation
stacksum = function(x, alpha, N, A){
# parameters
K = ncol(x)
L = nrow(x)
result = 0
for(ll in 1:L){
# first part of sum
first.sum = 0
for(kk in 1:K){
# create sequence
sequence.k = seq(alpha[ll, kk], (alpha[ll, kk] + x[ll, kk] - 1), 1)
# take logs and sum
first.sum = first.sum + sum(log(sequence.k))
}
# second part of sum
second.sum = sum(log(seq(A[ll], (A[ll] + N[ll] - 1), 1)))
# add to result
result = result + first.sum - second.sum
}
return(result)
}
# test
stacksum(x, alpha, N, A)
Update with a lgamma solution based on #RobertDodier comments.
Using sequence and rep.int.
# parameters
L <- 20
K <- 5
# x ... L x K matrix of integers
x <- matrix(1 : (L * K), L, K)
# alpha ... L x K matrix of positive real numbers
alpha <- matrix(1 : (L * K) / 100, L, K)
# N ... sum over rows of x
N <- rowSums(x)
# A ... sum over rows of alpha
A <- rowSums(alpha)
# proposed solution
stacksum2 <- function(x, alpha, N, A) {
sum(log(sequence(x, alpha) + rep.int(alpha %% 1, x))) - sum(log(sequence(N, A) + rep.int(A %% 1, N)))
}
# solution from Robert Dodier's comments
stacksum3 <- function(x, alpha, N, A) {
sum(lgamma(alpha + x) - lgamma(alpha)) - sum(lgamma(A + N) - lgamma(A))
}
# OP solution
stacksum1 = function(x, alpha, N, A){
# parameters
K = ncol(x)
L = nrow(x)
result = 0
for(ll in 1:L){
# first part of sum
first.sum = 0
for(kk in 1:K){
# create sequence
sequence.k = seq(alpha[ll, kk], (alpha[ll, kk] + x[ll, kk] - 1), 1)
# take logs and sum
first.sum = first.sum + sum(log(sequence.k))
}
# second part of sum
second.sum = sum(log(seq(A[ll], (A[ll] + N[ll] - 1), 1)))
# add to result
result = result + first.sum - second.sum
}
result
}
res <- list(
stacksum1(x, alpha, N, A),
stacksum2(x, alpha, N, A),
stacksum3(x, alpha, N, A)
)
all.equal(res[1:2], res[-1])
#> [1] TRUE
microbenchmark::microbenchmark(stacksum1 = stacksum1(x, alpha, N, A),
stacksum2 = stacksum2(x, alpha, N, A),
stacksum3 = stacksum3(x, alpha, N, A),
check = "equal")
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> stacksum1 1654.2 1704.60 1899.384 1740.80 1964.75 4234.4 100
#> stacksum2 238.2 246.45 258.284 252.35 268.40 319.4 100
#> stacksum3 18.5 19.05 20.981 20.55 21.70 36.4 100
I am trying to create a function that performs the sign test. The function should take in vector x and median m and returns a p-value defined by min(1, 2*min(P{N ≤ i}, P{N ≥ i})), where N~Bin(n, 0.5), n is the amount of values which are not equal to m, and i is the amount of values in n which are smaller than m.
Here's what I've gotten so far:
test<-function(x,m) {
n<-length(x[x!=m])
i<-length(n[n<m])
min(1,c(pbinom(q=i,size=n,p=0.5),1-pbinom(q=i,size=n,p=0.5)))
}
However, when I test it out with given values, it gives the wrong answer:
test(x=1:3, m=2)
[1] 0.25
test(x = 1:5, m = 2)
[1] 0.0625
The right answers should be 1 and 0.625. I don't know which step I have done wrong.
You're on the right track: Aside from two minor errors you made (forgot the factor 2, problem in your calculation of i: Note that in your script, n is an integer, not a vector!), there's a more subtle but important error in your calculation of the third term inside the min:
We have
1 - P(N<=i) = P(N<i) which is different from P(N<=i) for discrete-valued probability distributions (like the binomial distribution).
You can fix this by subtracting a tiny value eps from i in the last term and using the approximation P(N <= i-eps) ≈ P(N < i):
test <- function(x , m){
# amount of entries that are not m
n <- sum(x!=m)
# amoutn of entries that are smaller than m
i <- sum(x<m)
eps = 1e-10
min(1,
2*pbinom(q=i,size=n,p=0.5),
# to get P(N >= i), we need 1 - P(N < I), not 1 - P(N <= i)
2*(1-pbinom(q=i-eps,size=n,p=0.5))
)
}
I need help with a code to generate random numbers according to constraints.
Specifically, I am trying to simulate random numbers ALFA and BETA from, respectively, a Normal and a Gamma distribution such that ALFA - BETA < 1.
Here is what I have written but it does not work at all.
set.seed(42)
n <- 0
repeat {
n <- n + 1
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1)
alfa[n] <- a
beta[n] <- b
l = length(alfa)
if (l == 10000) break
}
Due to vectorization, it will be faster to generate the numbers "all at once" rather than in a loop:
set.seed(42)
N = 1e5
a = rnorm(N, 10, 2)
b = rgamma(N, 8, 1)
d = a - b
alfa = a[d < 1]
beta = b[d < 1]
length(alfa)
# [1] 36436
This generated 100,000 candidates, 36,436 of which met your criteria. If you want to generate n samples, try setting N = 4 * n and you'll probably generate more than enough, keep the first n.
Your loop has 2 problems: (a) you need curly braces to enclose multiple lines after an if statement. (b) you are using n as an attempt counter, but it should be a success counter. As written, your loop will only stop if the 10000th attempt is a success. Move n <- n + 1 inside the if statement to fix:
set.seed(42)
n <- 0
alfa = numeric(0)
beta = numeric(0)
repeat {
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1) {
n <- n + 1
alfa[n] <- a
beta[n] <- b
l = length(alfa)
if (l == 500) break
}
}
But the first way is better... due to "growing" alfa and beta in the loop, and generating numbers one at a time, this method takes longer to generate 500 numbers than the code above takes to generate 30,000.
As commented by #Gregor Thomas, the failure of your attempt is due to the missing of curly braces to enclose the if statement. If you would like to skip {} for if control, maybe you can try the code below
set.seed(42)
r <- list()
repeat {
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1) r[[length(r)+1]] <- cbind(alfa = a, beta = b)
if (length(r) == 100000) break
}
r <- do.call(rbind,r)
such that
> head(r)
alfa beta
[1,] 9.787751 12.210648
[2,] 9.810682 14.046190
[3,] 9.874572 11.499204
[4,] 6.473674 8.812951
[5,] 8.720010 8.799160
[6,] 11.409675 10.602608
Consider the Markov chain with state space S = {1, 2}, transition matrix
and initial distribution α = (1/2, 1/2).
Simulate 5 steps of the Markov chain (that is, simulate X0, X1, . . . , X5). Repeat the simulation 100
times. Use the results of your simulations to solve the following problems.
Estimate P(X1 = 1|X0 = 1). Compare your result with the exact probability.
My solution:
# returns Xn
func2 <- function(alpha1, mat1, n1)
{
xn <- alpha1 %*% matrixpower(mat1, n1+1)
return (xn)
}
alpha <- c(0.5, 0.5)
mat <- matrix(c(0.5, 0.5, 0, 1), nrow=2, ncol=2)
n <- 10
for (variable in 1:100)
{
print(func2(alpha, mat, n))
}
What is the difference if I run this code once or 100 times (as is said in the problem-statement)?
How can I find the conditional probability from here on?
Let
alpha <- c(1, 1) / 2
mat <- matrix(c(1 / 2, 0, 1 / 2, 1), nrow = 2, ncol = 2) # Different than yours
be the initial distribution and the transition matrix. Your func2 only finds n-th step distribution, which isn't needed, and doesn't simulate anything. Instead we may use
chainSim <- function(alpha, mat, n) {
out <- numeric(n)
out[1] <- sample(1:2, 1, prob = alpha)
for(i in 2:n)
out[i] <- sample(1:2, 1, prob = mat[out[i - 1], ])
out
}
where out[1] is generated using only the initial distribution and then for subsequent terms we use the transition matrix.
Then we have
set.seed(1)
# Doing once
chainSim(alpha, mat, 1 + 5)
# [1] 2 2 2 2 2 2
so that the chain initiated at 2 and got stuck there due to the specified transition probabilities.
Doing it for 100 times we have
# Doing 100 times
sim <- replicate(chainSim(alpha, mat, 1 + 5), n = 100)
rowMeans(sim - 1)
# [1] 0.52 0.78 0.87 0.94 0.99 1.00
where the last line shows how often we ended up in state 2 rather than 1. That gives one (out of many) reasons why 100 repetitions are more informative: we got stuck at state 2 doing just a single simulation, while repeating it for 100 times we explored more possible paths.
Then the conditional probability can be found with
mean(sim[2, sim[1, ] == 1] == 1)
# [1] 0.4583333
while the true probability is 0.5 (given by the upper left entry of the transition matrix).
in R, I have a vector of integers. From this vector, I would like to reduce the value of each integer element randomly, in order to obtain a sum of the vector that is a percentage of the initial sum.
In this example, I would like to reduce the vector "x" to a vector "y", where each element has been randomly reduced to obtain a sum of the elements equal to 50% of the initial sum.
The resulting vector should have values that are non-negative and below the original value.
set.seed(1)
perc<-50
x<-sample(1:5,10,replace=TRUE)
xsum<-sum(x) # sum is 33
toremove<-floor(xsum*perc*0.01)
x # 2 2 3 5 2 5 5 4 4 1
y<-magicfunction(x,perc)
y # 0 2 1 4 0 3 2 1 2 1
sum(y) # sum is 16 (rounded half of 33)
Can you think of a way to do it? Thanks!
Assuming that x is long enough, we may rely on some appropriate law of large numbers (also assuming that x is regular enough in certain other ways). For that purpose we will generate values of another random variable Z taking values in [0,1] and with mean perc.
set.seed(1)
perc <- 50 / 100
x <- sample(1:10000, 1000)
sum(x)
# [1] 5014161
x <- round(x * rbeta(length(x), perc / 3 / (1 - perc), 1 / 3))
sum(x)
# [1] 2550901
sum(x) * 2
# [1] 5101802
sum(x) * 2 / 5014161
# [1] 1.017479 # One percent deviation
Here for Z I chose a certain beta distribution giving mean perc, but you could pick some other too. The lower the variance, the more precise the result. For instance, the following is much better as the previously chosen beta distribution is, in fact, bimodal:
set.seed(1)
perc <- 50 / 100
x <- sample(1:1000, 100)
sum(x)
# [1] 49921
x <- round(x * rbeta(length(x), 100 * perc / (1 - perc), 100))
sum(x)
# [1] 24851
sum(x) * 2
# [1] 49702
sum(x) * 2 / 49921
# [1] 0.9956131 # Less than 0.5% deviation!
An alternative solution is this function, which downsamples the original vector by a random fraction proportional to the vector element size. Then it checks that elements don't fall below zero, and iteratively approaches an optimal solution.
removereads<-function(x,perc=NULL){
xsum<-sum(x)
toremove<-floor(xsum*perc)
toremove2<-toremove
irem<-1
while(toremove2>(toremove*0.01)){
message("Downsampling iteration ",irem)
tmp<-sample(1:length(x),toremove2,prob=x,replace=TRUE)
tmp2<-table(tmp)
y<-x
common<-as.numeric(names(tmp2))
y[common]<-x[common]-tmp2
y[y<0]<-0
toremove2<-toremove-(xsum-sum(y))
irem<-irem+1
}
return(y)
}
set.seed(1)
x<-sample(1:1000,10000,replace=TRUE)
perc<-0.9
y<-removereads(x,perc)
plot(x,y,xlab="Before reduction",ylab="After reduction")
abline(0,1)
And the graphical results:
Here's a solution which uses draws from the Dirichlet distribution:
set.seed(1)
x = sample(10000, 1000, replace = TRUE)
magic = function(x, perc, alpha = 1){
# sample from the Dirichlet distribution
# sum(p) == 1
# lower values should reduce by less than larger values
# larger alpha means the result will have more "randomness"
p = rgamma(length(x), x / alpha, 1)
p = p / sum(p)
# scale p up an amount so we can subtract it from x
# and get close to the desired sum
reduce = round(p * (sum(x) - sum(round(x * perc))))
y = x - reduce
# No negatives
y = c(ifelse(y < 0, 0, y))
return (y)
}
alpha = 500
perc = 0.7
target = sum(round(perc * x))
y = magic(x, perc, alpha)
# Hopefully close to 1
sum(y) / target
> 1.000048
# Measure of the "randomness"
sd(y / x)
> 0.1376637
Basically, it tries to figure out how much to reduce each element by while still getting close to the sum you want. You can control how "random" you want the new vector by increasing alpha.