Related
Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?
I tried this but it got too big
library(tidyverse)
n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)
The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.
Thanks!
If you can tolerate doing this randomly, then
set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
if (mean(tmp) > 4) {
scores <- tmp
break
}
iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
# [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3
Notes:
The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.
FYI: if you're looking for an average of 4.2, two things to note:
change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps
tol <- 0.02
# ...
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
# ...
where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:
set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
# if (mean(tmp) > 4) {
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
iter <- iter - 1
}
iter
# [1] 0 # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)
if you instead set tol = 0.02 then you will find something:
tol <- 0.02
# ...
scores
# [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286
You can try the code below
n <- 14
avg <- 4.2
repeat{
x <- sample(1:5, n, replace = TRUE)
if (sum(x) == round(avg * n)) break
}
and you will see
> x
[1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286
in R, I have a vector of integers. From this vector, I would like to reduce the value of each integer element randomly, in order to obtain a sum of the vector that is a percentage of the initial sum.
In this example, I would like to reduce the vector "x" to a vector "y", where each element has been randomly reduced to obtain a sum of the elements equal to 50% of the initial sum.
The resulting vector should have values that are non-negative and below the original value.
set.seed(1)
perc<-50
x<-sample(1:5,10,replace=TRUE)
xsum<-sum(x) # sum is 33
toremove<-floor(xsum*perc*0.01)
x # 2 2 3 5 2 5 5 4 4 1
y<-magicfunction(x,perc)
y # 0 2 1 4 0 3 2 1 2 1
sum(y) # sum is 16 (rounded half of 33)
Can you think of a way to do it? Thanks!
Assuming that x is long enough, we may rely on some appropriate law of large numbers (also assuming that x is regular enough in certain other ways). For that purpose we will generate values of another random variable Z taking values in [0,1] and with mean perc.
set.seed(1)
perc <- 50 / 100
x <- sample(1:10000, 1000)
sum(x)
# [1] 5014161
x <- round(x * rbeta(length(x), perc / 3 / (1 - perc), 1 / 3))
sum(x)
# [1] 2550901
sum(x) * 2
# [1] 5101802
sum(x) * 2 / 5014161
# [1] 1.017479 # One percent deviation
Here for Z I chose a certain beta distribution giving mean perc, but you could pick some other too. The lower the variance, the more precise the result. For instance, the following is much better as the previously chosen beta distribution is, in fact, bimodal:
set.seed(1)
perc <- 50 / 100
x <- sample(1:1000, 100)
sum(x)
# [1] 49921
x <- round(x * rbeta(length(x), 100 * perc / (1 - perc), 100))
sum(x)
# [1] 24851
sum(x) * 2
# [1] 49702
sum(x) * 2 / 49921
# [1] 0.9956131 # Less than 0.5% deviation!
An alternative solution is this function, which downsamples the original vector by a random fraction proportional to the vector element size. Then it checks that elements don't fall below zero, and iteratively approaches an optimal solution.
removereads<-function(x,perc=NULL){
xsum<-sum(x)
toremove<-floor(xsum*perc)
toremove2<-toremove
irem<-1
while(toremove2>(toremove*0.01)){
message("Downsampling iteration ",irem)
tmp<-sample(1:length(x),toremove2,prob=x,replace=TRUE)
tmp2<-table(tmp)
y<-x
common<-as.numeric(names(tmp2))
y[common]<-x[common]-tmp2
y[y<0]<-0
toremove2<-toremove-(xsum-sum(y))
irem<-irem+1
}
return(y)
}
set.seed(1)
x<-sample(1:1000,10000,replace=TRUE)
perc<-0.9
y<-removereads(x,perc)
plot(x,y,xlab="Before reduction",ylab="After reduction")
abline(0,1)
And the graphical results:
Here's a solution which uses draws from the Dirichlet distribution:
set.seed(1)
x = sample(10000, 1000, replace = TRUE)
magic = function(x, perc, alpha = 1){
# sample from the Dirichlet distribution
# sum(p) == 1
# lower values should reduce by less than larger values
# larger alpha means the result will have more "randomness"
p = rgamma(length(x), x / alpha, 1)
p = p / sum(p)
# scale p up an amount so we can subtract it from x
# and get close to the desired sum
reduce = round(p * (sum(x) - sum(round(x * perc))))
y = x - reduce
# No negatives
y = c(ifelse(y < 0, 0, y))
return (y)
}
alpha = 500
perc = 0.7
target = sum(round(perc * x))
y = magic(x, perc, alpha)
# Hopefully close to 1
sum(y) / target
> 1.000048
# Measure of the "randomness"
sd(y / x)
> 0.1376637
Basically, it tries to figure out how much to reduce each element by while still getting close to the sum you want. You can control how "random" you want the new vector by increasing alpha.
In Excel, it's easy to perform a calculation on a previous cell by referencing that earlier cell. For example, starting from an initial value of 100 (step = 0), each next step would be 0.9 * previous + 9 simply by dragging the formula bar down from the first cell (step = 1). The next 10 steps would look like:
step value
[1,] 0 100.00000
[2,] 1 99.00000
[3,] 2 98.10000
[4,] 3 97.29000
[5,] 4 96.56100
[6,] 5 95.90490
[7,] 6 95.31441
[8,] 7 94.78297
[9,] 8 94.30467
[10,] 9 93.87420
[11,] 10 93.48678
I've looked around the web and StackOverflow, and the best I could come up with is a for loop (below). Are there more efficient ways to do this? Is it possible to avoid a for loop? It seems like most functions in R (such as cumsum, diff, apply, etc) work on existing vectors instead of calculating new values on the fly from previous ones.
#for loop. This works
value <- 100 #Initial value
for(i in 2:11) {
current <- 0.9 * value[i-1] + 9
value <- append(value, current)
}
cbind(step = 0:10, value) #Prints the example output shown above
It seems like you're looking for a way to do recursive calculations in R. Base R has two ways of doing this which differ by the form of the function used to do the recursion. Both methods could be used for your example.
Reduce can be used with recursion equations of the form v[i+1] = function(v[i], x[i]) where v is the calculated vector and x an input vector; i.e. where the i+1 output depends only the i-th values of the calculated and input vectors and the calculation performed by function(v, x) may be nonlinear. For you case, this would be
value <- 100
nout <- 10
# v[i+1] = function(v[i], x[i])
v <- Reduce(function(v, x) .9*v + 9, x=numeric(nout), init=value, accumulate=TRUE)
cbind(step = 0:nout, v)
filter is used with recursion equations of the form y[i+1] = x[i] + filter[1]*y[i-1] + ... + filter[p]*y[i-p] where y is the calculated vector and x an input vector; i.e. where the output can depend linearly upon lagged values of the calculated vector as well as the i-th value of the input vector. For your case, this would be:
value <- 100
nout <- 10
# y[i+1] = x[i] + filter[1]*y[i-1] + ... + filter[p]*y[i-p]
y <- c(value, stats::filter(x=rep(9, nout), filter=.9, method="recursive", sides=1, init=value))
cbind(step = 0:nout, y)
For both functions, the length of the output is given by the length of the input vector x.
Both of these approaches give your result.
Use our knowledge about the geometric series.
i <- 0:10
0.9 ^ i * 100 + 9 * (0.9 ^ i - 1) / (0.9 - 1)
#[1] 100.00000 99.00000 98.10000 97.29000 96.56100 95.90490 95.31441 94.78297 94.30467 93.87420 93.48678
You could also use purrr::accumulate:
data.frame(value = purrr::accumulate(0:10, ~ .x * .9 + 9, .init = 100))
value
1 100.00000
2 99.00000
3 98.10000
4 97.29000
5 96.56100
6 95.90490
7 95.31441
8 94.78297
9 94.30467
10 93.87420
11 93.48678
12 93.13811
.init is the initial value and there is also the argument .dir if you want to control the direction ("forward" is the default)
I am trying to generate n random numbers whose sum is less than 1.
So I can't just run runif(3). But I can condition each iteration on the sum of all values generated up to that point.
The idea is to start an empty vector, v, and set up a loop such that for each iteration, i, a runif() is generated, but before it is accepted as an element of v, i.e. v[i] <- runif(), the test sum(v) < 1 is carried out, and while FALSE the last entry v[i] is finally accepted, BUT if TRUE, that is the sum is greater than 1, v[i] is tossed out of the vector, and the iteration i is repeated.
I am far from implementing this idea, but I would like to resolve it along the lines of something similar to what follows. It's not so much a practical problem, but more of an exercise to understand the syntax of loops in general:
n <- 4
v <- 0
for (i in 1:n){
rdom <- runif(1)
if((sum(v) + rdom) < 1) v[i] <- rdom
}
# keep trying before moving on to iteration i + 1???? i <- stays i?????
}
I have looked into while (actually I incorporated the while function in the title); however, I need the vector to have n elements, so I get stuck if I try something that basically tells R to add random uniform realizations as elements of the vector v while sum(v) < 1, because I can end up with less than n elements in v.
Here's a possible solution. It doesn't use while but the more generic repeat. I edited it to use a while and save a couple of lines.
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
while (i < n) {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
}
v
# [1] 0.89669720 0.06178627 0.01339033 0.02333120
Using a repeat block, you must check for the condition anyways, but, removing the growing problem, it would look very similar:
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
repeat {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
if (i == 4) break
}
If you really want to keep exactly the same procedure that you have posted (aka iteratively sample the n values one at a time from the standard uniform distribution, rejecting any samples that cause your sum to exceed 1), then the following code is mathematically equivalent, shorter, and more efficient:
samp <- function(n) {
v <- rep(0, n)
for (i in 1:n) {
v[i] <- runif(1, 0, 1-sum(v))
}
v
}
Basically, this code uses the mathematical fact that if the sum of the vector is currently sum(v), then sampling from the standard uniform distribution until you get a value no greater than 1-sum(v) is exactly equivalent to sampling in the uniform distribution from 0 to 1-sum(v). The advantage of using the latter approach is that it's much more efficient -- we don't need to keep rejecting samples and trying again, and can instead just sample once for each element.
To get a sense of the runtime differences, consider sampling 100 observations with n=10, comparing to a working implementation of the code from your post (copied from my other answer to this question):
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
set.seed(144)
system.time(samples.OP <- replicate(100, OP(10)))
# user system elapsed
# 261.937 1.641 265.805
system.time(samples.josliber <- replicate(100, samp(10)))
# user system elapsed
# 0.004 0.001 0.004
In this case, the new approach is approaching 100,000 times faster.
It sounds like you're trying to uniformly sample from a space of n variables where the following constraints hold:
x_1 + x_2 + ... + x_n <= 1
x_1 >= 0
x_2 >= 0
...
x_n >= 0
The "hit and run" algorithm is the mathematical machinery that enables you to do exactly this. In 2-dimensional space, the algorithm will sample uniformly from the following triangle, with each location in the shaded area being equally likely to be selected:
The algorithm is provided in R through the hitandrun package, which requires you to specify the linear inequalities that define the space through a constraint matrix, direction vector, and right-hand side vector:
library(hitandrun)
n <- 3
constr <- list(constr = rbind(rep(1, n), -diag(n)),
dir = c(rep("<=", n+1)),
rhs = c(1, rep(0, n)))
set.seed(144)
samples <- hitandrun(constr, n.samples=1000)
head(samples, 10)
# [,1] [,2] [,3]
# [1,] 0.28914690 0.01620488 0.42663224
# [2,] 0.65489979 0.28455231 0.00199671
# [3,] 0.23215115 0.00661661 0.63597912
# [4,] 0.29644234 0.06398131 0.60707269
# [5,] 0.58335047 0.13891392 0.06151205
# [6,] 0.09442808 0.30287832 0.55118290
# [7,] 0.51462261 0.44094683 0.02641638
# [8,] 0.38847794 0.15501252 0.31572793
# [9,] 0.52155055 0.09921046 0.13304728
# [10,] 0.70503030 0.03770875 0.14299089
Breaking down this code a bit, we generated the following constraint matrix:
constr
# $constr
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] -1 0 0
# [3,] 0 -1 0
# [4,] 0 0 -1
#
# $dir
# [1] "<=" "<=" "<=" "<="
#
# $rhs
# [1] 1 0 0 0
Reading across the first line of constr$constr we have 1, 1, 1 which indicates "1*x1 + 1*x2 + 1*x3". The first element of constr$dir is <=, and the first element of constr$rhs is 1; putting it together we have x1 + x2 + x3 <= 1. From the second row of constr$constr we read -1, 0, 0 which indicates "-1*x1 + 0*x2 + 0*x3". The second element of constr$dir is <= and the second element of constr$rhs is 0; putting it together we have -x1 <= 0 which is the same as saying x1 >= 0. The similar non-negativity constraints follow in the remaining rows.
Note that the hit and run algorithm has the nice property of having the exact same distribution for each of the variables:
hist(samples[,1])
hist(samples[,2])
hist(samples[,3])
Meanwhile, the distribution of the samples from your procedure will be highly uneven, and as n increases this problem will get worse and worse.
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
samples.OP <- t(replicate(1000, OP(3)))
hist(samples.OP[,1])
hist(samples.OP[,2])
hist(samples.OP[,3])
An added advantage is that the hit-and-run algorithm appears faster -- I generated these 1000 replicates in 0.006 seconds on my computer with hit-and-run and it took 0.3 seconds using the modified code from the OP.
Here's how I would do it, without any loop, if or while:
set.seed(123)
x <- runif(1) # start with the sum that you want to obtain
n <- 4 # number of generated random numbers, can be chosen arbitrarily
y <- sort(runif(n-1,0,x)) # choose n-1 random points to cut the range [0:x]
z <- c(y[1],diff(y),x-y[n-1]) # result: determine the length of the segments
#> z
#[1] 0.11761257 0.10908627 0.02723712 0.03364156
#> sum(z)
#[1] 0.2875775
#> all.equal(sum(z),x)
#[1] TRUE
The advantage here is that you can determine exactly which sum you want to obtain and how many numbers n you want to generate for this. If you set, e.g., x <- 1 in the second line, the n random numbers stored in the vector z will add up to one.
I am filling a 10x10 martix (mat) randomly until sum(mat) == 100
I wrote the following.... (i = 2 for another reason not specified here but i kept it at 2 to be consistent with my actual code)
mat <- matrix(rep(0, 100), nrow = 10)
mat[1,] <- c(0,0,0,0,0,0,0,0,0,1)
mat[2,] <- c(0,0,0,0,0,0,0,0,1,0)
mat[3,] <- c(0,0,0,0,0,0,0,1,0,0)
mat[4,] <- c(0,0,0,0,0,0,1,0,0,0)
mat[5,] <- c(0,0,0,0,0,1,0,0,0,0)
mat[6,] <- c(0,0,0,0,1,0,0,0,0,0)
mat[7,] <- c(0,0,0,1,0,0,0,0,0,0)
mat[8,] <- c(0,0,1,0,0,0,0,0,0,0)
mat[9,] <- c(0,1,0,0,0,0,0,0,0,0)
mat[10,] <- c(1,0,0,0,0,0,0,0,0,0)
i <- 2
set.seed(129)
while( sum(mat) < 100 ) {
# pick random cell
rnum <- sample( which(mat < 1), 1 )
mat[rnum] <- 1
##
print(paste0("i =", i))
print(paste0("rnum =", rnum))
print(sum(mat))
i = i + 1
}
For some reason when sum(mat) == 99 there are several steps extra...I would assume that once i = 91 the while would stop but it continues past this. Can somone explain what I have done wrong...
If I change the while condition to
while( sum(mat) < 100 & length(which(mat < 1)) > 0 )
the issue remains..
Your problem is equivalent to randomly ordering the indices of a matrix that are equal to 0. You can do this in one line with sample(which(mat < 1)). I suppose if you wanted to get exactly the same sort of output, you might try something like:
set.seed(144)
idx <- sample(which(mat < 1))
for (i in seq_along(idx)) {
print(paste0("i =", i))
print(paste0("rnum =", idx[i]))
print(sum(mat)+i)
}
# [1] "i =1"
# [1] "rnum =5"
# [1] 11
# [1] "i =2"
# [1] "rnum =70"
# [1] 12
# ...
See ?sample
Arguments:
x: Either a vector of one or more elements from which to choose,
or a positive integer. See ‘Details.’
...
If ‘x’ has length 1, is numeric (in the sense of ‘is.numeric’) and
‘x >= 1’, sampling _via_ ‘sample’ takes place from ‘1:x’. _Note_
that this convenience feature may lead to undesired behaviour when
‘x’ is of varying length in calls such as ‘sample(x)’. See the
examples.
In other words, if x in sample(x) is of length 1, sample returns a random number from 1:x. This happens towards the end of your loop, where there is just one 0 left in your matrix and one index is returned by which(mat < 1).
The iteration repeats on level 99 because sample() behaves very differently when the first parameter is a vector of length 1 and when it is greater than 1. When it is length 1, it assumes you a random number from 1 to that number. When it has length >1, then you get a random number from that vector.
Compare
sample(c(99,100),1)
and
sample(c(100),1)
Of course, this is an inefficient way of filling your matrix. As #josilber pointed out, a single call to sample could do everything you need.
The issue comes from how sample and which do the sampling when you have only a single '0' value left.
For example, do this:
mat <- matrix(rep(1, 100), nrow = 10)
Now you have a matrix of all 1's. Now lets make two numbers 0:
mat[15]<-0
mat[18]<-0
and then sample
sample(which(mat<1))
[1] 18 15
by adding a size=1 argument you get one or the other
now lets try this:
mat[18]<-1
sample(which(mat<1))
[1] 3 13 8 2 4 14 11 9 10 5 15 7 1 12 6
Oops, you did not get [1] 15 . Instead what happens in only a single integer (15 in this case) is passed tosample. When you do sample(x) and x is an integer, it gives you a sample from 1:x with the integers in random order.