In R, I would like to generate a multinomially distributed random number vector of a given size N, for example using rmultinom, but with a maximum size for each of the K boxes.
For example:
set.seed(1)
draw = rmultinom(n = 1, size = 1000, prob = c(67,211,264,166,144,52,2,175))
In this case, the size is 1000, specifying the total number of objects that are put into eight boxes (the length of prob), and prob = c(67,211,264,166,144,52,2,175) the vector of probabilities for the eight boxes (which is internally normalized to sum 1). In addition, I would like c(67,211,264,166,144,52,2,175) to be the vector of the maximum size for each of the eight boxes.
However in this case, it is possible to generate numbers that are higher than c(67,211,264,166,144,52,2,175) (for instance in the example above, draw[7,]=4 is higher than 2), whereas I would like each number to be lower or equal to the maximum size of each box specified in prob, in addition to draw summing to size = 1000.
Do you know any function or any simple way to do that? I was not able to find the answer.
From wikipedia: "For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories".
The keyword here is independent. Your constraint on the number of times each category can be drawn means the sampling is not independent. If your problem were multinomial, it would be possible - though very unlikely - that all numbers could be drawn from box 7. This is not what you want, so you can't use rmultinom.
Here's a different approach:
# vector of item counts
m <- c(67,211,264,166,144,52,2,175)
# expand the item counts in to a single vector with i repeated m[i] times
d <- unlist(lapply(1:length(m), function(x) rep(x, m[x])))
# sample from d without replacement
s <- sample(d, size=1000, replace=FALSE)
# count the number of items of each type were sampled
table(factor(s))
1 2 3 4 5 6 7 8
63 197 242 153 135 48 2 160
Related
I am simulating some draws using random numbers. Unlikely, the generated numbers are not random as I would like. In fact, I obtain that there are some linear combinations.
In details, I have the following starting data:
start_vector = c(1,10,30,40,50,100) # length equal to 6
residual_of_model = 5
n = 1000 # Number of simulations
I try to simulate n observations from a random normal distribution for each of the start_vector elements, assuming it as a "random noise" to add to the original value (that is the one into start_vector):
out_vec <- matrix(NA, nrow = n, ncol = length(start_vector))
for (h_aux in 1:length(start_vector))
{
random_noise <- rnorm(n, 0, residual_of_model)
out_vec[,h_aux] <- as.numeric(start_vector[h_aux]) + random_noise
}
At this point, I obtain a matrix of size 6x1000. In theory, I assume all the columns and the rows in the matrix are linearly independent among them.
If I try to check it, using the findLinearCombos() function from the caret package I obtain that all the columns are indepent:
caret::findLinearCombos(out_vec)
If I try to evaluate the independence among the rows, using the following code:
caret::findLinearCombos(t(out_vec))
I obtain that all the rows from 7 to 1000 are a linear combination of the first 6 (the length of start_vector).
It is really strange in my opinion, I would like to not observe no dependencies at all since the rows are generated adding a random number using rnorm.
What am I missing? Is there some bug? Thanks in advance!
Working in R, I need to create a vector of length n with the values randomly drawn from a Poisson distribution with lambda=1, but with a lower bound of 2 and upper bound of 6 (i.e. all numbers will be either 2,3,4,5, or 6).
I am unsure how to do this. I tried creating a for loop that would replace any values outside that range with values inside the range:
seed(123)
n<-25 #example length
example<-rpois(n,1)
test<-example #redundant - only duplicating to compare with original *example* values
for (i in 1:length(n)){
if (test[i]<2||test[i]>6){
test[i]<-rpois(1,1)
}
}
But this didn't seem to work (still getting 0's and 1, etc, in test). Any ideas would be greatly appreciated!
Here is one way to generate n numbers with Poisson distribution and replace all the numbers which are outside range to random number inside the range.
n<-25 #example length
example<-rpois(n,1)
inds <- example < 2 | example > 6
example[inds] <- sample(2:6, sum(inds), replace = TRUE)
I am working with R for solving a multi classification problem. I want to use e1071. How is scaling done for multiclass classification ? On this page, they say that
“A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance. The center and scale values are returned and used for later predictions.”
I am wondering how y is scaled. When we have m classes we have m columns for y, which they have different means and variances. So after scaling y, we have different number in each column for the same class! And it doesn’t make sense to me.
Could you please let me know what is going on in scaling? I am so curious to know that.
Also I am wondering what this mean:
"If scale is of length 1, the value is recycled as many times as needed."
Let's have look at some information for the argument scale:
A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance.
The value expected here is a logical vector (so a vector of TRUE and FALSE). If this vector has as many values as you have columns in your matrix, then the columns are scaled or not according to your vector (eg. if you have svm(..., scale = c(TRUE, FALSE, TRUE), ...) the first and third columns are scaled while the second one is not).
What happens during scaling is explained in the third sentence quoted above: "data are scaled [...] to zero mean and unit variance". To do this:
you substract each value of a column by the mean of this column (this is called centering), and
then you divide each value of this column by the columns standard deviation (this is the actual scaling).
You can reproduce the scaling with following example:
# create a data.frame with four variables
# as you can see the difference between each term of aa and bb is one
# and the difference between each term of cc is 21.63 while dd is random
(df <- data.frame(aa = 11:15,
bb = 1:5,
cc = 1:5*21.63,
dd = rnorm(5,12,4.2)))
# then we substract the mean of each column to this colum and
# put everything back together to a data.frame
(df1 <- as.data.frame(sapply(df, function(x) {x-mean(x)})))
# you can observe that now the mean value of each column is 0 and
# that aa==bb because the difference between each term was the same
# now we divide each column by its standard deviation
(df1 <- as.data.frame(sapply(df1, function(x) {x/sd(x)})))
# as you can see, the first three columns are now equal because the
# only difference between them was that cc == 21.63*bb
# the data frame df1 is now identical to what you would obtain by
# using the default scaling function `scale`
(df2 <- scale(df))
Scaling is necessary when your columns represent data on different scales. For example, if you wanted to distinguish individuals that are obese from lean ones you could collect their weight, height and waist-to-hip ratio. Weight would probably have values ranging from 50 to 95 kg, while height would be around 175 cm (± 20 cm) and waist-to-hip could range from 0.60 to 0.95. All these measurements are on different scales so that it is difficult to compare them. Scaling the variables solves this problem. Moreover, if one variable reaches high numerical values while the other ones do not, this variable will likely be given more importance during multivariate algorithms. Therefore scaling is advisable in most cases for such methods.
Scaling does affect the mean and the variance of each variable but as it is applied equally to each row (potentially belonging to different classes) this is not a problem.
In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.
I am looking to use dbinom() in R to generate a probability. The default documentation gives dbinom(x, size, prob, log = FALSE), and I understand what they all mean except the x, where x is said to be "vector of quantiles". Can anyone explain what that means in context of let's say that I would like to find the probability of obtaining the number 5 twice if I sample 10 times from the numbers 1-5. In this case the binomial probability would be
choose(10, 2) * (1/5)^2 * (4/5)^8
In your example the "number of times you see a five" is the quantile of interest. Loosely speaking, a "quantile" is a possible value of a random variable. So if you want to find the probability of seeing a 5 x = 2 times out of size = 10 draws where each number has prob = 1 / 5 of being drawn you would enter dbinom(2, 10, 1 / 5).