sample integer values with specified mean - r

I want to generate a sample of integer numbers in R with a specified mean.
I used mu+sd*scale(rnorm(n)) to generate a sample of n values that has exactly the mean=mu
but this generates floating-point values; I would like to generate integer values instead. For example, I would like to generate a sample of mean=4. My sample size n=5, an example of generated values would be {2,6,4,3,5}.
Any ideas on how to do this in R while satisfying the constraint of a specific value of the mean?

Picking n values with a mean of m is equivalent to picking n values that sum to m*n. (I'm assuming you're going to stick to positive integers -- otherwise things get much harder!) Here is a solution based on sampling partitions (sets of values that add up to the desired total) uniformly, but I'm not sure it's what you want, since it doesn't sample uniformly over values, but over partitions ... perhaps someone else can do better, or figure out how to reweight the samples.
This brute-force solution will also probably fail for cases much larger than your example (there are 627 partitions for a total of 20, 5604 for a total of 30, 37338 for a total of 40 ...)
m <- 4
n <- 5
library("partitions")
pp <- parts(m*n) ## all sets of integers that sum to m*n (=20 here)
## restrict to partitions with exactly n (=5) non-zero values.
pp5 <- pp[1:5,colSums(pp>0)==n]
set.seed(101) ## for reproducibility
## sample uniformly from this set
pp5[,sample(ncol(pp5),size=1)] ## 9, 5, 4, 1, 1

Related

Random generated number are linear combination among them even if not specified

I am simulating some draws using random numbers. Unlikely, the generated numbers are not random as I would like. In fact, I obtain that there are some linear combinations.
In details, I have the following starting data:
start_vector = c(1,10,30,40,50,100) # length equal to 6
residual_of_model = 5
n = 1000 # Number of simulations
I try to simulate n observations from a random normal distribution for each of the start_vector elements, assuming it as a "random noise" to add to the original value (that is the one into start_vector):
out_vec <- matrix(NA, nrow = n, ncol = length(start_vector))
for (h_aux in 1:length(start_vector))
{
random_noise <- rnorm(n, 0, residual_of_model)
out_vec[,h_aux] <- as.numeric(start_vector[h_aux]) + random_noise
}
At this point, I obtain a matrix of size 6x1000. In theory, I assume all the columns and the rows in the matrix are linearly independent among them.
If I try to check it, using the findLinearCombos() function from the caret package I obtain that all the columns are indepent:
caret::findLinearCombos(out_vec)
If I try to evaluate the independence among the rows, using the following code:
caret::findLinearCombos(t(out_vec))
I obtain that all the rows from 7 to 1000 are a linear combination of the first 6 (the length of start_vector).
It is really strange in my opinion, I would like to not observe no dependencies at all since the rows are generated adding a random number using rnorm.
What am I missing? Is there some bug? Thanks in advance!

Number from sample to be drawn from a Poisson distribution with upper/lower bounds

Working in R, I need to create a vector of length n with the values randomly drawn from a Poisson distribution with lambda=1, but with a lower bound of 2 and upper bound of 6 (i.e. all numbers will be either 2,3,4,5, or 6).
I am unsure how to do this. I tried creating a for loop that would replace any values outside that range with values inside the range:
seed(123)
n<-25 #example length
example<-rpois(n,1)
test<-example #redundant - only duplicating to compare with original *example* values
for (i in 1:length(n)){
if (test[i]<2||test[i]>6){
test[i]<-rpois(1,1)
}
}
But this didn't seem to work (still getting 0's and 1, etc, in test). Any ideas would be greatly appreciated!
Here is one way to generate n numbers with Poisson distribution and replace all the numbers which are outside range to random number inside the range.
n<-25 #example length
example<-rpois(n,1)
inds <- example < 2 | example > 6
example[inds] <- sample(2:6, sum(inds), replace = TRUE)

Generate N random integers that are sampled from a uniform distribution and sum to M in R [duplicate]

In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.

simulating the t -distributions -- random samples

I am new to simulation exercises in R. I want to create 1000 samples of size 25 from a t distribution with degrees of freedom 10.
Do I need to create a single vector of data from the rt generator, and then sample repeatedly from that? So, for example, I could create the vector:
singlevector <- rt(5000, 10) , which generates data from a t-distribution of size 5000 and df = 10. So, I would treat this as my population and then sample from it. I chose the population size of 5000 arbitrarily here.
OR, should I create my 1000 samples calling on this random t generator every time?
In other words, create a matrix with 25 rows and 1000 columns, each column containing vector corresponding to a new call of rt(25, 10).
Since you are sampling independent, identically distributed values, all three of these approaches are statistically equivalent.
call the random number generator once to get as many (or more) values than you need, then sample that vector without replacement
call the random number generator 1000 times, picking 25 values each time
call the random number generator once, picking 25000 values, then subdivide the vector into individual samples in order (rather than randomly)
The latter two are not just statistically but computationally equivalent. In the first approach, the order of samples gets scrambled, but that makes no difference to the statistical properties.
Approach #1:
set.seed(101)
x1 <- rt(25000,10)
r1 <- do.call(cbind,split(x1,sample(0:24999) %/% 25))
Illustrating the equivalence of #2 and #3:
set.seed(101)
r2 <- replicate(1000, rt(25, 10))
set.seed(101)
r3 <- matrix(rt(25000,10),nrow=25)
identical(r2,r3) ## TRUE
In general solution #3 is fastest (but all of these approaches are very fast for problems of this order of magnitude, i.e. approx 5 milliseconds (#3) vs 10 milliseconds (#2) for 25 x 1000 samples on my laptop); I would pick whichever approach is easiest for you to understand when you read the code.

R: Statistics of distribution

I have the number of samples per unit and need to calculate statistics with R.
The table is like this (all rows and columns are actually filled with values, I only write a few here for easier visibility, and there are many more columns):
Hour 1 2 3 4
H1 72 11 98 65
H2 19 27
H3
H4
H5
:
H200000
I.e. the first hour (H1) there were 72 samples of value 1, 11 samples of value 2, etc. The second hour(H2) there were 19 samples of value 1, 27 samples of value 2, etc.
I need to calculate the mean and standard deviation per hour (i.e. per row). As there are many thousands of rows I need a fast method.
Example: The manual mean-calculation for hour 1 (H1) would be:
(72x1 + 11x2 + 98x3 + 65x4)/(72+11+98+65) = 2.6
I suppose there are R-methods or packages that can do this, but I fail to find where. Your support is highly appreciated.
Thanks,
Chris
You want to calculate a weighted mean, so you need weighted.mean. For the first row:
values <- c(1, 2, 3, 4)
weights <- c(72, 11, 98, 65)
weighted.mean(values, weights)
The weighted standard deviation is not well-defined. You could use a hand-rolled weighted RMS as an estimator (but this assumes that your input sample is really from a single Gaussian, i.e. there are no outliers -- not sure if that's the case for your example).
# same values and weights as above
sqrt(sum(values^2*weights^2))/sum(weights)
You should read your data into a table and iterate over every row. Also, "many thousands of rows" is not necessarily a large number for such a simple calculation. This is very basic stuff, maybe checking out a tutorial would also be beneficial.
You are much better off (i.e. faster calculations) using matrix operations instead of applying something by row. For example, assuming X is the matrix containing your data, you can get the weighted means the following way:
w <- 1:ncol(X)
w <- w/sum(w) #scale to have a sum of 1
wmeans <- X %*% w
Assuming your table is a matrix called dataset of n * 20000 and you have the weigths in a weights array you just need to do:
# The 1 as 2nd parameter indicates to apply the function on the rows
w.means <- apply(dataset, 1, weighted.mean, w=weights)

Resources