take a sample that has a specific mean - r

Let's say I have a population like {1,2,3, ..., 23} and I want to generate a sample so that the sample's mean equals 6.
I tried to use the sample function, using a custom probability vector, but it didn't work:
population <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)
mean(population)
minimum <- min(population)
maximum <- max(population)
amplitude <- maximum - minimum
expected <- 6
n <- length(population)
prob.vector = rep(expected, each=n)
for(i in seq(1, n)) {
if(expected > population[i]) {
prob.vector[i] <- (i - minimum) / (expected - minimum)
} else {
prob.vector[i] <- (maximum - i) / (maximum - expected)
}
}
sample.size <- 5
sample <- sample(population, sample.size, prob = prob.vector)
mean(sample)
The mean of the sample is about the mean of the population (oscillates around 12), and I wanted it to be around 6.
A good sample would be:
{3,5,6,8,9}, mean=6.2
{2,3,4,8,9}, mean=5.6
The problem is different from sample integer values in R with specific mean because I have a specific population and I can't just generate arbitrary real numbers, they must be inside the population.
The plot of the probability vector:

You can try this:
m = local({b=combn(1:23,5);
d = colMeans(b);
e = b[,d>5.5 &d<6.5];
function()sample(e[,sample(ncol(e),1)])})
m()
[1] 8 5 6 9 3
m()
[1] 6 4 5 3 13
breakdown:
b=combn(1:23,5) # combine the numbers into 5
d = colMeans(b) # find all the means
e = b[,d>5.5 &d<6.5] # select only the means that are within a 0.5 range of 6
sample(e[,sample(ncol(e),1)]) # sample the values the you need

Related

Generating a Random Permutation in R

I try to implement a example using R in Simulation (2006, 4ed., Elsevier) by Sheldon M. Ross, which wants to generate a random permutation and reads as follows:
Suppose we are interested in generating a permutation of the numbers 1,2,... ,n
which is such that all n! possible orderings are equally likely.
The following algorithm will accomplish this by
first choosing one of the numbers 1,2,... ,n at random;
and then putting that number in position n;
it then chooses at random one of the remaining n-1 numbers and puts that number in position n-1 ;
it then chooses at random one of the remaining n-2 numbers and puts it in position n-2 ;
and so on
Surely, we can achieve a random permutation of the numbers 1,2,... ,n easily by
sample(1:n, replace=FALSE)
For example
> set.seed(0); sample(1:5, replace=FALSE)
[1] 1 4 3 5 2
However, I want to get similar results manually according to the above algorithmic steps. Then I try
## write the function
my_perm = function(n){
x = 1:n # initialize
k = n # position n
out = NULL
while(k>0){
y = sample(x, size=1) # choose one of the numbers at random
out = c(y,out) # put the number in position
x = setdiff(x,out) # the remaining numbers
k = k-1 # and so on
}
out
}
## test the function
n = 5; set.seed(0); my_perm(n) # set.seed for reproducible
and have
[1] 2 2 4 5 1
which is obviously incorrect for there are two 2 . How can I fix the problem?
You have implemented the logic correctly but there is only one thing that you need to be aware which is related to R.
From ?sample
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x
So when the last number is remaining in x, let's say that number is 4, sampling would take place from 1:4 and return any 1 number from it.
For example,
set.seed(0)
sample(4, 1)
#[1] 2
So you need to adjust your function for that after which the code should work correctly.
my_perm = function(n){
x = 1:n # initialize
k = n # position n
out = NULL
while(k>1){ #Stop the while loop when k = 1
y = sample(x, size=1) # choose one of the numbers at random
out = c(y,out) # put the number in position
x = setdiff(x,out) # the remaining numbers
k = k-1 # and so on
}
out <- c(x, out) #Add the last number in the output vector.
out
}
## test the function
n = 5
set.seed(0)
my_perm(n)
#[1] 3 2 4 5 1
Sample size should longer than 1. You can break it by writing a condition ;
my_perm = function(n){
x = 1:n
k = n
out = NULL
while(k>0){
if(length(x)>1){
y = sample(x, size=1)
}else{
y = x
}
out = c(y,out)
x = setdiff(x,out)
k = k-1
}
out
}
n = 5; set.seed(0); my_perm(n)
[1] 3 2 4 5 1

How to do large combinations with condition in R efficiently?

Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?
I tried this but it got too big
library(tidyverse)
n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)
The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.
Thanks!
If you can tolerate doing this randomly, then
set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
if (mean(tmp) > 4) {
scores <- tmp
break
}
iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
# [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3
Notes:
The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.
FYI: if you're looking for an average of 4.2, two things to note:
change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps
tol <- 0.02
# ...
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
# ...
where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:
set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
# if (mean(tmp) > 4) {
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
iter <- iter - 1
}
iter
# [1] 0 # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)
if you instead set tol = 0.02 then you will find something:
tol <- 0.02
# ...
scores
# [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286
You can try the code below
n <- 14
avg <- 4.2
repeat{
x <- sample(1:5, n, replace = TRUE)
if (sum(x) == round(avg * n)) break
}
and you will see
> x
[1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286

Reduce total sum of vector elements in R

in R, I have a vector of integers. From this vector, I would like to reduce the value of each integer element randomly, in order to obtain a sum of the vector that is a percentage of the initial sum.
In this example, I would like to reduce the vector "x" to a vector "y", where each element has been randomly reduced to obtain a sum of the elements equal to 50% of the initial sum.
The resulting vector should have values that are non-negative and below the original value.
set.seed(1)
perc<-50
x<-sample(1:5,10,replace=TRUE)
xsum<-sum(x) # sum is 33
toremove<-floor(xsum*perc*0.01)
x # 2 2 3 5 2 5 5 4 4 1
y<-magicfunction(x,perc)
y # 0 2 1 4 0 3 2 1 2 1
sum(y) # sum is 16 (rounded half of 33)
Can you think of a way to do it? Thanks!
Assuming that x is long enough, we may rely on some appropriate law of large numbers (also assuming that x is regular enough in certain other ways). For that purpose we will generate values of another random variable Z taking values in [0,1] and with mean perc.
set.seed(1)
perc <- 50 / 100
x <- sample(1:10000, 1000)
sum(x)
# [1] 5014161
x <- round(x * rbeta(length(x), perc / 3 / (1 - perc), 1 / 3))
sum(x)
# [1] 2550901
sum(x) * 2
# [1] 5101802
sum(x) * 2 / 5014161
# [1] 1.017479 # One percent deviation
Here for Z I chose a certain beta distribution giving mean perc, but you could pick some other too. The lower the variance, the more precise the result. For instance, the following is much better as the previously chosen beta distribution is, in fact, bimodal:
set.seed(1)
perc <- 50 / 100
x <- sample(1:1000, 100)
sum(x)
# [1] 49921
x <- round(x * rbeta(length(x), 100 * perc / (1 - perc), 100))
sum(x)
# [1] 24851
sum(x) * 2
# [1] 49702
sum(x) * 2 / 49921
# [1] 0.9956131 # Less than 0.5% deviation!
An alternative solution is this function, which downsamples the original vector by a random fraction proportional to the vector element size. Then it checks that elements don't fall below zero, and iteratively approaches an optimal solution.
removereads<-function(x,perc=NULL){
xsum<-sum(x)
toremove<-floor(xsum*perc)
toremove2<-toremove
irem<-1
while(toremove2>(toremove*0.01)){
message("Downsampling iteration ",irem)
tmp<-sample(1:length(x),toremove2,prob=x,replace=TRUE)
tmp2<-table(tmp)
y<-x
common<-as.numeric(names(tmp2))
y[common]<-x[common]-tmp2
y[y<0]<-0
toremove2<-toremove-(xsum-sum(y))
irem<-irem+1
}
return(y)
}
set.seed(1)
x<-sample(1:1000,10000,replace=TRUE)
perc<-0.9
y<-removereads(x,perc)
plot(x,y,xlab="Before reduction",ylab="After reduction")
abline(0,1)
And the graphical results:
Here's a solution which uses draws from the Dirichlet distribution:
set.seed(1)
x = sample(10000, 1000, replace = TRUE)
magic = function(x, perc, alpha = 1){
# sample from the Dirichlet distribution
# sum(p) == 1
# lower values should reduce by less than larger values
# larger alpha means the result will have more "randomness"
p = rgamma(length(x), x / alpha, 1)
p = p / sum(p)
# scale p up an amount so we can subtract it from x
# and get close to the desired sum
reduce = round(p * (sum(x) - sum(round(x * perc))))
y = x - reduce
# No negatives
y = c(ifelse(y < 0, 0, y))
return (y)
}
alpha = 500
perc = 0.7
target = sum(round(perc * x))
y = magic(x, perc, alpha)
# Hopefully close to 1
sum(y) / target
> 1.000048
# Measure of the "randomness"
sd(y / x)
> 0.1376637
Basically, it tries to figure out how much to reduce each element by while still getting close to the sum you want. You can control how "random" you want the new vector by increasing alpha.

Trying to make a script calculate a value (using a function) for every 24 rows

I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!
I am using the R environment.
I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").
I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.
I have a script here that I am using to calculated weighted median:
weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
x <- as.numeric(as.vector(x))
w <- as.numeric(as.vector(w))
if(anyNA(x) || anyNA(w)) {
ok <- !(is.na(x) | is.na(w))
x <- x[ok]
w <- w[ok]
}
stopifnot(all(w >= 0))
if(all(w == 0)) stop("All weights are zero", call.=FALSE)
#'
oo <- order(x)
x <- x[oo]
w <- w[oo]
Fx <- cumsum(w)/sum(w)
#'
result <- numeric(length(probs))
for(i in seq_along(result)) {
p <- probs[i]
lefties <- which(Fx <= p)
if(length(lefties) == 0) {
result[i] <- x[1]
} else {
left <- max(lefties)
result[i] <- x[left]
if(Fx[left] < p && left < length(x)) {
right <- left+1
y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]- Fx[left])
if(is.finite(y)) result[i] <- y
}
}
}
names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
return(result)
}
So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).
The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).
I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).
In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):
`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`
followed by
weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)
followed by
weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)
all the way to
`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`
I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.
The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.
First I create some sample data:
set.seed(123)
n <- 120 # number of rows
nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
Create the splitting variable:
nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
Then, use the by() function to split nest1 by nest1$day and apply the function to each subset:
out <- by(nest1, nest1$day, function(d) {
weighted.median(d$temp, d$devo, na.rm = TRUE)
})
data.frame(day = dimnames(out)[[1]], x = as.vector(out))
# day x
# 1 1 -0.45244433
# 2 2 0.15337312
# 3 3 0.07071673
# 4 4 0.23873174
# 5 5 -0.27694709
Instead of using by, you can also use the group_by + summarise functions from the dplyr package:
library(dplyr)
nest1 %>%
group_by(day) %>%
summarise(x = weighted.median(temp, devo, na.rm = TRUE))
# # A tibble: 5 x 2
# day x
# <int> <dbl>
# 1 1 -0.452
# 2 2 0.153
# 3 3 0.0707
# 4 4 0.239
# 5 5 -0.277

How to generate random integers in R so that no two consecutive numbers are the same

Is there a method to generate random integers in R such that any two consecutive numbers are different? It is probably along the lines of x[k+1] != x[k] but I can't work out how to put it all together.
Not sure if there is a function available for that. Maybe this function can do what you want:
# n = number of elements
# sample_from = draw random numbers from this range
random_non_consecutive <- function(n=10,sample_from = seq(1,5))
{
y=c()
while(length(y)!=n)
{
y= c(y,sample(sample_from,n-length(y),replace=T))
y=y[!c(FALSE, diff(y) == 0)]
}
return(y)
}
Example:
random_non_consecutive(20,c(2,4,6,8))
[1] 6 4 6 2 6 4 2 8 4 2 6 2 8 2 8 2 8 4 8 6
Hope this helps.
The function above has a long worst-case runtime. We can keep that worst-case more constant with for example the following implementation:
# n = number of elements
# sample_from = draw random numbers from this range
random_non_consecutive <- function(n=10,sample_from = seq(1,5))
{
y= rep(NA, n)
prev=-1 # change this if -1 is in your range, to e.g. max(sample_from)+1
for(i in seq(n)){
y[i]=sample(setdiff(sample_from,prev),1)
prev = y[i]
}
return(y)
}
Another approach is to over-sample and remove the disqualifying ones as follows:
# assumptions
n <- 5 # population size
sample_size <- 1000
# answer
mu <- sample_size * 1/n
vr <- sample_size * 1/n * (1 - 1/n)
addl_draws <- round(mu + vr, 0)
index <- seq(1:n)
sample_index <- sample(index, sample_size + addl_draws, replace = TRUE)
qualified_sample_index <- sample_index[which(diff(sample_index) != 0)]
qualified_sample_index <- qualified_sample_index[1:sample_size]
# In the very unlikely event the number of qualified samples < sample size,
# NA's will fill the vector. This will print those N/A's
print(which(is.na(qualified_sample_index) == TRUE))

Resources