Incorrect number of probabilities - r

Arrivals <- sample(c(0,1,2,3,4), size=1, prob = c(.15,.25,.3,.2,.1),replace = TRUE)
Buyers <- sample(Arrivals, size=1, prob = .6, replace = TRUE)
I want to take a sample of a sample.
Here Arrivals give me back a single integer. Yet I still get the error
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
I found many answers on here that say that X and Prob need to be the same length and is the typical reason for the error.
But X (Arrivals) and the Prob are the same length and I still get the error.
Any idea why?

If you pass a single numeric value x into sample(), it thinks you want to sample from 1 to x. That's why it is telling you that you have the wrong number of probabilities in your second sample() call for Buyers.
For example, if Arrival is set to 2, then calling sample(Arrivals) is saying "I want to sample from c(1, 2). But you only provide one probability, instead of two - that's why you get the error.
set.seed(123)
Arrivals <- sample(c(0,1,2,3,4), size=1, prob = c(.15,.25,.3,.2,.1), replace = TRUE) # returns 2
Buyers <- sample(Arrivals, size=1, prob = c(.6, .4), replace = TRUE) # runs without error
From the sample documentation:
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x). See the examples.

Related

Random sampling with sample() gives unexpected results

Consider the following when performing random sampling in R:
n <- 10
k <- 10
p <- 0.10 # proportion of the k objects to subsample
probs <- c(0.30, 0.30, 0.30, rep(0.10/7, 7)) # probabilities for each of the k objects
Here, the roles of n and k are irrelevant; however, there is the condition that n >= k.
x <- sort(sample(k, size = ceiling(p * k), replace = FALSE)) # works
y <- sample(x, size = n, replace = TRUE, prob = probs[x]) # throws error
I am wondering why the function call assigned to y above throws an error.
The error I receive is:
Error in sample.int(x, size, replace, prob) :
incorrect number of probabilities
My thinking is that the 'size' argument to sample() (i.e., n*p) cannot evaluate to 1 in the second function call (y variable), but I haven't been able to find anything documenting this error in the help files to sample().
I know that ceiling() can act strangely in some instances, but I'm not convinced that this could be the issue.
When the above code is run, x is set to the integer data type, e.g., 1L, 2L, etc., which leads to the error in evaluating y.
Does someone have an idea on how to fix this issue?
If x is a single value, sample(x) samples from values 1 through x (see the Details section of the help), or 1 through floor(x) if x isn't an integer. So the prob argument has to be a vector of length x. In your code probs[x] is always a vector of length 1, which causes the error.

How should I specify argument "prob" when using sample() for resampling?

In short
I'm trying to better understand the argument prob as part of the function sample in R. In what follows, I both ask a question, and provide a piece of R code in connection with my question.
Question
Suppose I have generated 10,000 random standard rnorms. I then want to draw a sample of size 5 from this mother 10,000 standard rnorms.
How should I set the prob argument within the sample such that the probability of drawing these 5 numbers from the mother rnorm considers that the middle areas of the mother rnorm are denser but tail areas are thinner (so in drawing these 5 numbers it would draw from the denser areas more frequently than the tail areas)?
x = rnorm(1e4)
sample( x = x, size = 5, replace = TRUE, prob = ? ) ## what should be "prob" here?
# OR I leave `prob` to be the default by not using it:
sample( x = x, size = 5, replace = TRUE )
Overthinking is devil.
You want to resample these samples, following the original distribution or an empirical distribution. Think about how an empirical CDF is obtained:
plot(sort(x), 1:length(x)/length(x))
In other words, the empirical PDF is just
plot(sort(x), rep(1/length(x), length(x)))
So, we want prob = rep(1/length(x), length(x)) or simply, prob = rep(1, length(x)) as sample normalizes prob internally. Or, just leave it unspecified as equal probability is default.

How to vectorise sampling from non-identically distributed Bernoulli random variables?

Given a sequence of independent but not identically distributed Bernoulli trials with success probabilities given by a vector, e.g.:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000 # trial probabilities for trials 1 to 501
What is the most efficient way to obtain a random variate from each trial? I am assuming that vectorisation is the way to go.
I know of two functions that give Bernoulli random variates:
rbernoulli from the package purr, which does not accept a vector of success probabilities as an input. In this case it may be possible to wrap the function in an apply type operation.
rbinom with arguments size = 1 gives Bernoulli random variates. It also accepts a vector of probabilities, so that:
rbinom(n = length(prob), size = 1, prob = prob)
gives an output with the right length. However, I am not entirely sure that this is actually what I want. The bits in the helpfile ?rbinom that seem relevant are:
The length of the result is determined by n for rbinom, and is the
maximum of the lengths of the numerical arguments for the other
functions.
The numerical arguments other than n are recycled to the length of the
result. Only the first elements of the logical arguments are used.
However, n is a parameter with no default, so I am not sure what the first sentence means. I presume the second sentence means that I get what I want, since only size = 1 should be recycled. However this thread seems to suggest that this method does not work.
This blog post gives some other methods as well. One commentator mentions my suggested idea using rbinom.
Another way to test that rbinom is vectorised for prob, taking advantage of the fact that the sum of N bernoulli random variables is a binomial random variable with denominator N:
x <- seq(0, 50, 0.1)
prob <- -x*(x - 50)/1000
n <- rbinom(prob, size=1000, prob)
par(mfrow=c(1, 2))
plot(prob ~ x)
plot(n ~ x)
If you don't trust random strangers on the internet and do not understand documentation, maybe you can convince yourself by testing. Just set the random seed to get reproducible results:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000
#501 seperate draws of 1 random number
set.seed(42)
res1 <- sapply(prob, rbinom, n = 1, size = 1)
#501 "simultaneous" (vectorized) draws
set.seed(42)
res2 <- rbinom(501, 1, prob)
identical(res1, res2)
#[1] TRUE

how to solve errors in frbs package of R using GFC.GCCL method?

I'm using frbs package in R on my data set using 5-fold stratified cross validation. I've implemented stratified CV. I use GFS.GCCL method for frbs.learn function in each fold and predict the result using test data. I get this error as well as 30 equal warning messages:
Error: object 'temp.rule.degree' not found
Warning: In max(MF.temp[m, ], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
My code is written in below:
library(frbs)
data<-read.csv(file.address)
data[,30] <- unclass(data[,30]) #column 30 has the class of samples
data <- data[,c(1,14,20,26,27, 30)] # I choose to have 5 attr. since
#my data is high dimensional
k <- 5 # 5-fold
seed <- 1
folds <- strf.cv(data, k, seed) #stratification function for CV
range.data.inp <- matrix(apply(data[,-ncol(data)], 2, range), nrow=2)
data<-norm.data(as.matrix(data[,-ncol(data)]),range.data.
inp,min.scale = 0.1, max.scale = 1)
ctrl <- list(popu.size = 30, num.class = 2, num.labels= 3,
persen_cross = 0.9, max.gen = 200, persen_mutant = 0.3,
name="sim-1")
for(i in 1:k){
str <- paste("fold",i)
print(str)
test.ind <- folds[[str]]
test.data <- data[test.ind,]
train.data <- data[-test.ind,]
obj <- frbs.learn(train.data , method.type="GFS.GCCL",
range.data.inp , ctrl)
pred <- predict(obj, test.data)
print("Predicted classes:")
print(pred)
}
I don't have any idea about error and warnings. Please let me know what I should do.
I've had similar problem (and others) trying to reproduce the SLAVE learning starting with the iris example data. I had 2 format items to solve before being able to run this with my artifical data:
my dataframe import was giving me integer, where the learn needs at least numeric.
my distribution of criteria was not flat. When I flattened the distribution (3 values so n/3 samples per value) everything went fine.
That's all I know.
Hope it helps.
I encountered the same issue when I was running SLAVE and GFS.GCCL. When I was looking at the source code of the library. I found that in frbs.learn(), each method has an implementation to calculate the range of input data. So, I think it might be a problem with the range of input data. For example, in GFS.GCCL, in the source code, for setting the parameters, it looks like this:
range.data.input <- range.data
data.train.ori <- data.train
popu.size <- control$popu.size
persen_cross <- control$persen_cross
persen_mutant <- control$persen_mutant
max.gen <- control$max.gen
name <- control$name
n.labels <- control$num.labels
n.class <- control$num.class
num.labels <- matrix(rep(n.labels, ncol(range.data)), nrow = 1)
num.labels <- cbind(num.labels, n.class)
## normalize range of data and data training
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
range.data.input.ori <- range.data.input
data.tra.norm <- norm.data(data.train[, 1 : ncol(data.train) - 1], range.data.input, min.scale = 0, max.scale = 1)
data.train <- cbind(data.tra.norm, matrix(data.train[, ncol(data.train)], ncol = 1))
in the first line, range.data is either coming from your specification nor the default setting of frbs.learn(). For the default setting, it gets the max and min for each row. In the source code:
range.data <- rbind(dt.min, dt.max)
After that, the range of data taken by the GFS.GCCL is
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
which is between 0 and 1. The GFS.GCCL is also taken the range.data.input as parameter. So, it takes both range.data.norm and range.data.input.
Therefore, I think if internally, there are some calculation corresponding to range.data.input (it needs to be set as min, max for each row), but the setting for this is actually not min and max for each row. The error is generated.
But, in summary, after I remove "range.data"from frbs.learn(), both GFS.GCCL and SLAVE work for me.
You can download the source code from here:
https://cran.r-project.org/web/packages/frbs/index.html
You can find the code for GFS.GCCL and SLAVE in:
FRBS.MainFunction.R
GFS.Methods.R
In addition to #Pilip38's good advice, I have three other ideas that have fixed similar errors for me while working with the frbs package.
Most important: Make sure your output variable is never equal to 0. It looks like you have a binary output variable so I am hoping just adding 1 to it so it is 1/2 instead of 0/1 will work.
Try setting your range.data.inp matrix to be all 0's in the first row and all 1's in the second. Naturally it's better to have a tighter range but it may be causing your bug.
Try decreasing the number of labels to 2.
It's can be a brittle procedure.

"sample" and "rbinom" functions in R

I guess it has been asked before, but I'm still a bit rusty about "sample" and "rbinom" functions in R, and would like to ask the following two simple questions:
a) Let's say we have:
rbinom(n = 5, size = 1, prob = c(0.9,0.2,0.3))
So "n" = 5 but "prob" is only indicated for three of them. What values R assigns for these two n's?
b) Let's say we have:
sample(x = 1:3, size = 1, prob = c(.5,0.2,0.9))
According to R-help (?sample):
The optional prob argument can be used to give a vector of weights
for obtaining the elements of the vector being sampled.
They need not sum to one, but they should be non-negative and not all zero.
The question would be: why "prob" does not need sum to one?
Any answers would be very appreciated: thank you!
From the documentation for rbinom:
The numerical arguments other than n are recycled to the length of the result.
This means that in your example the prob vector you pass in will be recycled until it reaches the required length (presumably 5). So the vector which will be used is:
c(0.9, 0.2, 0.3, 0.9, 0.2)
As for the sample function, as #thelatemail pointed out the probabilities do not have to sum to 1. It appears that the prob vector gets normalized to 1 internally.

Resources