Random sampling with sample() gives unexpected results - r

Consider the following when performing random sampling in R:
n <- 10
k <- 10
p <- 0.10 # proportion of the k objects to subsample
probs <- c(0.30, 0.30, 0.30, rep(0.10/7, 7)) # probabilities for each of the k objects
Here, the roles of n and k are irrelevant; however, there is the condition that n >= k.
x <- sort(sample(k, size = ceiling(p * k), replace = FALSE)) # works
y <- sample(x, size = n, replace = TRUE, prob = probs[x]) # throws error
I am wondering why the function call assigned to y above throws an error.
The error I receive is:
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
My thinking is that the 'size' argument to sample() (i.e., n*p) cannot evaluate to 1 in the second function call (y variable), but I haven't been able to find anything documenting this error in the help files to sample().
I know that ceiling() can act strangely in some instances, but I'm not convinced that this could be the issue.
When the above code is run, x is set to the integer data type, e.g., 1L, 2L, etc., which leads to the error in evaluating y.
Does someone have an idea on how to fix this issue?

If x is a single value, sample(x) samples from values 1 through x (see the Details section of the help), or 1 through floor(x) if x isn't an integer. So the prob argument has to be a vector of length x. In your code probs[x] is always a vector of length 1, which causes the error.

Related

Finding optimal parameter for each input combination in the objective function in an optimization

I am calibrating a model and for that I have to estimate a parameter for each input combination I give to the objective function. I have a bit more than 10k input combinations and I want to minimize the parameter for each combination. All other variables in the model are known. I achieved to estimate 1 minimal value for the whole set but that doesn't help me, and when I tried my approach for each combination I get the error: Error in mP[, logik] <- mPv[, logik, drop = FALSE] : NAs are not allowed in subscripted assignments.
My objective function looks like this
x_vol <- vector(mode = "double", length = 10776)
objective_function_vol <- function(x_vol){
S <- calibration_set$index_level
K <- calibration_set$strike
tau <- calibration_set$tau
r <- calibration_set$riskfree_rate
q <- calibration_set$q
model_prices_vol <- vector(mode = "double", length = 10776)
for (i in 1:10776){
model_prices_vol[i] <- hestonCallcf(S = S[i], K = K[i], t = tau[i],
r = r[i], q = 0,
v0 = x_vol[i],
vbar = 0.1064688, rho = -0.9914710,
a = 1.6240300, vvol = 0.98839192)
print(i)
}
diff_sq <- (market_price - model_prices_vol)^2
wdiff <- diff_sq/market_price
error <- sum(wdiff)/10776
return(error)
}
I am using NMOF::DEopt for the optimization. Is it maybe possible to write a second loop which stores the optimal values of x_vol because I think using the subscript i for the known inputted values as well as the unknown is somehow wrong.
The error means that some objective-function calls resulted in NA.
If you only wish to minimize a single parameter (i.e. a scalar), Differential Evolution is probably not the method you want. A grid search along one dimension, possibly with refinements, would likely work better.

Incorrect number of probabilities

Arrivals <- sample(c(0,1,2,3,4), size=1, prob = c(.15,.25,.3,.2,.1),replace = TRUE)
Buyers <- sample(Arrivals, size=1, prob = .6, replace = TRUE)
I want to take a sample of a sample.
Here Arrivals give me back a single integer. Yet I still get the error
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I found many answers on here that say that X and Prob need to be the same length and is the typical reason for the error.
But X (Arrivals) and the Prob are the same length and I still get the error.
Any idea why?
If you pass a single numeric value x into sample(), it thinks you want to sample from 1 to x. That's why it is telling you that you have the wrong number of probabilities in your second sample() call for Buyers.
For example, if Arrival is set to 2, then calling sample(Arrivals) is saying "I want to sample from c(1, 2). But you only provide one probability, instead of two - that's why you get the error.
set.seed(123)
Arrivals <- sample(c(0,1,2,3,4), size=1, prob = c(.15,.25,.3,.2,.1), replace = TRUE) # returns 2
Buyers <- sample(Arrivals, size=1, prob = c(.6, .4), replace = TRUE) # runs without error
From the sample documentation:
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x). See the examples.

R code Gaussian mixture -- numerical expression has 2 elements: only the first used

I'm trying to create a Gaussian Mix function according to these parameters:
For each sample, roll a die with k sides
If the j-th side appears from the roll, draw a sample from Normal(muj, sdj) where muj and sdj are the mean and standard deviation for the j-th Normal distribution respectively. This means you should have k different Normal distributions to choose from. Note that muj is the mathematical form of referring to the j-th element in a vector called mus.
The resulting sample from this Normal is then from a Gaussian Mixture.
Where:
n, an integer that represents the number of independent samples you want from this random variable
mus, a numeric vector with length k
sds, a numeric vector with length k
prob, a numeric vector with length k that indicates the probability of choosing the different Gaussians. This should have a default to NULL.
This is what I came up with so far:
n <- c(1)
mus <- c()
sds <- c()
prob <- c()
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
for(i in 1:seq_len(n)){
if(is.null(prob)){
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus)), n, replace=TRUE)
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}else{
rolls <- c(NA, n)
rolls <- sample(c(1:length(mus), n, replace=TRUE, p=prob))
avg <- rnorm(length(rolls), mean=mus[rolls], sd=sds[rolls])
}
}
return(avg)
}
rgaussmix(2, 1:3, 1:3)
It seems to match most of the requirements, but it keeps giving me the following error:
numerical expression has 2 elements: only the first usednumber of items to replace is not a multiple of replacement length
I've tried looking at the lengths of multiple variables, but I can't seem to figure out where the error is coming from!
Could someone please help me?
If you do seq_len(2) it gives you:
[1] 1 2
And you cannot do 1:(1:2) .. it doesn't make sense
Also you can avoid the loops in your code, by sampling the number of tries you need, for example if you do:
rnorm(3,c(0,10,20),1)
[1] -0.507961 8.568335 20.279245
It gives you 1st sample from the 1st mean, 2nd sample from 2nd mean and so on. So you can simplify your function to:
rgaussmix <- function(n, mus, sds, prob = NULL){
if(length(mus) != length(sds)){
stop("mus and sds have different lengths")
}
if(is.null(prob)){
prob = rep(1/length(mus),length(mus))
}
rolls <- sample(length(mus), n, replace=TRUE, p=prob)
avg <- rnorm(n, mean=mus[rolls], sd=sds[rolls])
avg
}
You can plot the results:
plot(density(rgaussmix(10000,c(0,5,10),c(1,1,1))),main="mixture of 0,5,10")

How to vectorise sampling from non-identically distributed Bernoulli random variables?

Given a sequence of independent but not identically distributed Bernoulli trials with success probabilities given by a vector, e.g.:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000 # trial probabilities for trials 1 to 501
What is the most efficient way to obtain a random variate from each trial? I am assuming that vectorisation is the way to go.
I know of two functions that give Bernoulli random variates:
rbernoulli from the package purr, which does not accept a vector of success probabilities as an input. In this case it may be possible to wrap the function in an apply type operation.
rbinom with arguments size = 1 gives Bernoulli random variates. It also accepts a vector of probabilities, so that:
rbinom(n = length(prob), size = 1, prob = prob)
gives an output with the right length. However, I am not entirely sure that this is actually what I want. The bits in the helpfile ?rbinom that seem relevant are:
The length of the result is determined by n for rbinom, and is the
maximum of the lengths of the numerical arguments for the other
functions.
The numerical arguments other than n are recycled to the length of the
result. Only the first elements of the logical arguments are used.
However, n is a parameter with no default, so I am not sure what the first sentence means. I presume the second sentence means that I get what I want, since only size = 1 should be recycled. However this thread seems to suggest that this method does not work.
This blog post gives some other methods as well. One commentator mentions my suggested idea using rbinom.
Another way to test that rbinom is vectorised for prob, taking advantage of the fact that the sum of N bernoulli random variables is a binomial random variable with denominator N:
x <- seq(0, 50, 0.1)
prob <- -x*(x - 50)/1000
n <- rbinom(prob, size=1000, prob)
par(mfrow=c(1, 2))
plot(prob ~ x)
plot(n ~ x)
If you don't trust random strangers on the internet and do not understand documentation, maybe you can convince yourself by testing. Just set the random seed to get reproducible results:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000
#501 seperate draws of 1 random number
set.seed(42)
res1 <- sapply(prob, rbinom, n = 1, size = 1)
#501 "simultaneous" (vectorized) draws
set.seed(42)
res2 <- rbinom(501, 1, prob)
identical(res1, res2)
#[1] TRUE

R Estimating parameters of binomial distribution

I'm trying estimate parameters n and p from Binomial Distribution by Maximum Likelihood in R.
I'm using the function optim from stats package, but there is an error.
That is my code:
xi = rbinom(100, 20, 0.5) # Sample
n = length(xi) # Sample size
# Log-Likelihood
lnlike <- function(theta){
log(prod(choose(theta[1],xi))) + sum(xi*log(theta[2])) +
(n*theta[1] - sum(xi))*log(1-theta[2])
}
# Optimizing
optim(theta <- c(10,.3), lnlike, hessian=TRUE)
Error in optim(theta <- c(10, 0.3), lnlike, hessian = TRUE) :
function cannot be evaluated at initial parameters
Anyone done this? Which function used?
tl;dr you're going to get a likelihood of zero (and thus a negative-infinite log-likelihood) if the response variable is greater than the binomial N (which is the theoretical maximum value of the response). In most practical problems, N is taken as known and just the probability is estimated. If you do want to estimate N, you need to (1) constrain it to be >= the largest value in the sample; (2) do something special to optimize over a parameter that must be discrete (this is an advanced/tricky problem).
First part of this answer shows debugging strategies for identifying the problem, second illustrates a strategy for optimizing N and p simultaneously (by brute force over a reasonable range of N).
Setup:
set.seed(101)
n <- 100
xi <- rbinom(n, size=20, prob=0.5) # Sample
Log-likelihood function:
lnlike <- function(theta){
log(prod(choose(theta[1],xi))) + sum(xi*log(theta[2])) +
(n*theta[1] - sum(xi))*log(1-theta[2])
}
Let's break this down.
theta <- c(10,0.3) ## starting values
lnlike(c(10,0.3)) ## -Inf
OK, the log-likelihood is -Inf at the starting value. Not surprising that optim() can't work with that.
Let's work through the terms.
log(prod(choose(theta[1],xi))) ## -Inf
OK, we're already in trouble on the first term.
prod(choose(theta[1],xi)) ## 0
The product is zero ... why?
choose(theta[1],xi)
## [1] 120 210 10 0 0 10 120 210 0 0 45 210 1 0
Lots of zeros. Why? What are the values of xi that are problematic?
## [1] 7 6 9 12 11 9 7 6
Aha! We're OK for 7, 6, 9 ... but in trouble with 12.
badvals <- (choose(theta[1],xi)==0)
all(badvals==(xi>10)) ## TRUE
If you really want to do this, you can do it by brute-force enumeration over reasonable values of n ...
## likelihood function
llik2 <- function(p,n) {
-sum(dbinom(xi,prob=p,size=n,log=TRUE))
}
## possible N values (15 to 50)
nvec <- max(xi):50
Lvec <- numeric(length(nvec))
for (i in 1:length(nvec)) {
## optim() wants method="Brent"/lower/upper for 1-D optimization
Lvec[i] <- optim(par=0.5,fn=llik2,n=nvec[i],method="Brent",
lower=0.001,upper=0.999)$val
}
nvec[which.min(Lvec)] ## 20
par(las=1,bty="l")
plot(nvec,Lvec,type="b")
Why you get into trouble?
If you do lnlike(c(10, 0.3)), you get -Inf. That's why your error message is complaining lnlike, rather than optim.
Often, n is known, and only p needs be estimated. In this situation, either moment estimator or maximum likelihood estimator is in closed form, and no numerical optimization is needed. So, it is really weird to estimate n.
If you do want to estimate, you have to be aware that it is constrained. Check
range(xi) ## 5 15
You observations have range [5, 15], therefore, it is required that n >= 15. How can you pass an initial value 10? The searching direction for n, should be from a large starting value, and then gradually searching downward till it reaches max(xi). So, you might try 30 as the initial value for n.
Additionally, you don't need to define lnlike in the current way. Do this:
lnlike <- function(theta, x) -sum(dbinom(x, size = theta[1], prob = theta[2], log = TRUE))
optim is often used for minimization (though it can do maximization). I have put a minus sign in the function to get negative log likelihood. In this way, you are minimizing lnlike w.r.t. theta.
You should also pass your observations xi as additional argument to lnlike, rather than taking it from global environment.
Naive try with optim:
In my comment, I already said that I don't believe using optim to estimate n will work, because n must be integers while optim is used for continuous variables. These errors and warnings shall convince you.
optim(c(30,.3), fn = lnlike, x = xi, hessian = TRUE)
Error in optim(c(30, 0.3), fn = lnlike, x = xi, hessian = TRUE) :
non-finite finite-difference value [1]
In addition: There were 15 or more warnings (use warnings() to see the
first 15
> warnings()
Warning messages:
1: In dbinom(x, size = theta[1], prob = theta[2], log = TRUE) : NaNs produced
2: In dbinom(x, size = theta[1], prob = theta[2], log = TRUE) : NaNs produced
3: In dbinom(x, size = theta[1], prob = theta[2], log = TRUE) : NaNs produced
4: In dbinom(x, size = theta[1], prob = theta[2], log = TRUE) : NaNs produced
5: In dbinom(x, size = theta[1], prob = theta[2], log = TRUE) : NaNs produced
Solution?
Ben has provided you a way. Instead of letting optim to estimate n, we manually do a grid search for n. For each candidate n, we perform a univariate optimization w.r.t. p. (Oops, in fact, there is no need to do numerical optimization here.) In this way, you are getting a profile likelihood of n. Then, we find n on the grid to minimize this profile likelihood.
Ben has provided you full details, and I shall not repeat that. Nice (and swift) work, Ben!

Resources