Generate samples from data following normal distribution but with new mean - r

I have a vector of numbers that is
set.seed(1)
x <- rnorm(8334, 1.456977, 0.3552899)
mean(x)
[1] 1.454307
Essentially, I want to randomly sample 2000 numbers from x such that mean of this sample is lower.
The key is I don't want to generate new random numbers but only sample from x, without replacement, such that I get a subset with a different mean.
Can anyone help me?
Thanks!

This method is not truly "random" as it only picks from values that are smaller than mean(x). Let me know if this is good enough for you -
set.seed(1)
x <- rnorm(8334, 1.456977, 0.3552899)
mean(x)
[1] 1.454307
y <- sample(x, 2000, prob = x <= mean(x)) # x > mean(x) has 0 chance of getting sampled
all(y %in% x)
[1] TRUE
mean(y)
[1] 1.170856
This is effectively the same as -
z <- sample(x[x <= mean(x)], 2000)
all(z %in% x)
[1] TRUE
mean(z)
[1] 1.172033
Also, for 2000 values, the lowest possible mean is this -
mean(sort(x)[1:2000])
[1] 0.9847526
UPDATE -
Here's one way to get random sample from both sides of mean(x) although it is arbitrary and I don't know if this would guarantee sample mean less than mean(x). -
z <- sample(x, 2000, prob = (x <= mean(x)) + 0.1)
mean(z)
[1] 1.225991
table(z <= mean(x))
FALSE TRUE
202 1798

How about doing rejection sampling, i.e. sampling 2000 numbers from your vector until you hit one sample that fulfills the desired properties?
set.seed(1)
x <- rnorm(8334, 1.456977, 0.3552899)
m_x <-mean(x)
y <- sample(x, 2000)
while(mean(y) >= m_x)
y <- sample(x, 2000)
mean(y)
#> [1] 1.4477
Created on 2019-06-18 by the reprex package (v0.3.0)
This should be quite fast since there is an (roughly) even chance for the new mean to be greater or smaller than the old one.

randomize normal distribution for the example
x= rnorm(8334,1.45,0.355)
pick a sample of 2000 nums
y= sample(x,2000)
lower y mean by 0.5
y=y-05
increase y's sd by 1.5
y= y*1.5
now the sd and the mean of Y will be about
mean(y)# ~0.9325603
sd(y)# ~0.5348885
hope it is the answer you are looking for

Related

Minimum absolute difference between vector pairs (greedy algorithm)

Given a numeric vector, I'd like to find the smallest absolute difference in combinations of size 2. However, the point of friction comes with the use of combn to create the matrix holding the pairs. How would one handle issues when a matrix/vector is too large?
When the number of resulting pairs (number of columns) using combn is too large, I get the following error:
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
This post states that the size limit of a matrix is roughly one billion rows and two columns.
Here is the code I've used. Apologies for the use of cat in my function output -- I'm solving the Minimum Absolute Difference in an Array Greedy Algorithm problem in HackerRank and R outputs are only counted as correct if they're given using cat:
minimumAbsoluteDifference <- function(arr) {
combos <- combn(arr, 2)
cat(min(abs(combos[1,] - combos[2,])))
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
# This fails
inputFail <- rpois(10e4, 1)
minimumAbsoluteDifference(inputFail)
#Error in matrix(r, nrow = len.r, ncol = count) :
# invalid 'ncol' value (too large or NA)
TL;DR
No need for combn or the like, simply:
min(abs(diff(sort(v))))
The Nitty Gritty
Finding the difference between every possible combinations is O(n^2). So when we get to vectors of length 1e5, the task is burdensome both computationally and memory-wise.
We need a different approach.
How about sorting and taking the difference only with its neighbor?
By first sorting, for any element vj, the difference min |vj - vj -/+ 1| will be the smallest such difference involving vj. For example, given the sorted vector v:
v = -9 -8 -6 -4 -2 3 8
The smallest distance from -2 is given by:
|-2 - 3| = 5
|-4 - -2| = 2
There is no need in checking any other elements.
This is easily implemented in base R as follows:
getAbsMin <- function(v) min(abs(diff(sort(v))))
I'm not going to use rpois as with any reasonably sized vector, duplicates will be produces, which will trivially give 0 as an answer. A more sensible test would be with runif or sample (minimumAbsoluteDifference2 is from the answer provided by #RuiBarradas):
set.seed(1729)
randUnif100 <- lapply(1:100, function(x) {
runif(1e3, -100, 100)
})
randInts100 <- lapply(1:100, function(x) {
sample(-(1e9):(1e9), 1e3)
})
head(sapply(randInts100, getAbsMin))
[1] 586 3860 2243 2511 5186 3047
identical(sapply(randInts100, minimumAbsoluteDifference2),
sapply(randInts100, getAbsMin))
[1] TRUE
options(scipen = 99)
head(sapply(randUnif100, getAbsMin))
[1] 0.00018277206 0.00020549633 0.00009834766 0.00008395873 0.00005299225 0.00009313226
identical(sapply(randUnif100, minimumAbsoluteDifference2),
sapply(randUnif100, getAbsMin))
[1] TRUE
It's very fast as well:
library(microbenchmark)
microbenchmark(a = getAbsMin(randInts100[[50]]),
b = minimumAbsoluteDifference2(randInts100[[50]]),
times = 25, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
a 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000 25
b 117.9799 113.2221 105.5144 107.6901 98.55391 81.05468 25
Even for very large vectors, the result is instantaneous;
set.seed(321)
largeTest <- sample(-(1e12):(1e12), 1e6)
system.time(print(getAbsMin(largeTest)))
[1] 3
user system elapsed
0.083 0.003 0.087
Something like this?
minimumAbsoluteDifference2 <- function(x){
stopifnot(length(x) >= 2)
n <- length(x)
inx <- rep(TRUE, n)
m <- NULL
for(i in seq_along(x)[-n]){
inx[i] <- FALSE
curr <- abs(x[i] - x[which(inx)])
m <- min(c(m, curr))
}
m
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
minimumAbsoluteDifference2(input0) #returns 3
set.seed(2020)
input1 <- rpois(1e3, 1)
minimumAbsoluteDifference(input1) #returns 0
minimumAbsoluteDifference2(input1) #returns 0
inputFail <- rpois(1e5, 1)
minimumAbsoluteDifference(inputFail) # This fails
minimumAbsoluteDifference2(inputFail) # This does not fail

Reduce total sum of vector elements in R

in R, I have a vector of integers. From this vector, I would like to reduce the value of each integer element randomly, in order to obtain a sum of the vector that is a percentage of the initial sum.
In this example, I would like to reduce the vector "x" to a vector "y", where each element has been randomly reduced to obtain a sum of the elements equal to 50% of the initial sum.
The resulting vector should have values that are non-negative and below the original value.
set.seed(1)
perc<-50
x<-sample(1:5,10,replace=TRUE)
xsum<-sum(x) # sum is 33
toremove<-floor(xsum*perc*0.01)
x # 2 2 3 5 2 5 5 4 4 1
y<-magicfunction(x,perc)
y # 0 2 1 4 0 3 2 1 2 1
sum(y) # sum is 16 (rounded half of 33)
Can you think of a way to do it? Thanks!
Assuming that x is long enough, we may rely on some appropriate law of large numbers (also assuming that x is regular enough in certain other ways). For that purpose we will generate values of another random variable Z taking values in [0,1] and with mean perc.
set.seed(1)
perc <- 50 / 100
x <- sample(1:10000, 1000)
sum(x)
# [1] 5014161
x <- round(x * rbeta(length(x), perc / 3 / (1 - perc), 1 / 3))
sum(x)
# [1] 2550901
sum(x) * 2
# [1] 5101802
sum(x) * 2 / 5014161
# [1] 1.017479 # One percent deviation
Here for Z I chose a certain beta distribution giving mean perc, but you could pick some other too. The lower the variance, the more precise the result. For instance, the following is much better as the previously chosen beta distribution is, in fact, bimodal:
set.seed(1)
perc <- 50 / 100
x <- sample(1:1000, 100)
sum(x)
# [1] 49921
x <- round(x * rbeta(length(x), 100 * perc / (1 - perc), 100))
sum(x)
# [1] 24851
sum(x) * 2
# [1] 49702
sum(x) * 2 / 49921
# [1] 0.9956131 # Less than 0.5% deviation!
An alternative solution is this function, which downsamples the original vector by a random fraction proportional to the vector element size. Then it checks that elements don't fall below zero, and iteratively approaches an optimal solution.
removereads<-function(x,perc=NULL){
xsum<-sum(x)
toremove<-floor(xsum*perc)
toremove2<-toremove
irem<-1
while(toremove2>(toremove*0.01)){
message("Downsampling iteration ",irem)
tmp<-sample(1:length(x),toremove2,prob=x,replace=TRUE)
tmp2<-table(tmp)
y<-x
common<-as.numeric(names(tmp2))
y[common]<-x[common]-tmp2
y[y<0]<-0
toremove2<-toremove-(xsum-sum(y))
irem<-irem+1
}
return(y)
}
set.seed(1)
x<-sample(1:1000,10000,replace=TRUE)
perc<-0.9
y<-removereads(x,perc)
plot(x,y,xlab="Before reduction",ylab="After reduction")
abline(0,1)
And the graphical results:
Here's a solution which uses draws from the Dirichlet distribution:
set.seed(1)
x = sample(10000, 1000, replace = TRUE)
magic = function(x, perc, alpha = 1){
# sample from the Dirichlet distribution
# sum(p) == 1
# lower values should reduce by less than larger values
# larger alpha means the result will have more "randomness"
p = rgamma(length(x), x / alpha, 1)
p = p / sum(p)
# scale p up an amount so we can subtract it from x
# and get close to the desired sum
reduce = round(p * (sum(x) - sum(round(x * perc))))
y = x - reduce
# No negatives
y = c(ifelse(y < 0, 0, y))
return (y)
}
alpha = 500
perc = 0.7
target = sum(round(perc * x))
y = magic(x, perc, alpha)
# Hopefully close to 1
sum(y) / target
> 1.000048
# Measure of the "randomness"
sd(y / x)
> 0.1376637
Basically, it tries to figure out how much to reduce each element by while still getting close to the sum you want. You can control how "random" you want the new vector by increasing alpha.

Repeated Sampling

I have a question about repeated sampling. Let's say I am interested in the distribution of sample means. So what I would do is generate 10000 times a sample of size 1000 and look at the mean of each sample. Can I instead just take one sample of size 10000*1000 and then look at the mean of the first 1000 elements than from 1001 to 2000 and so on?
I would say yes. In taking 10,000,000 samples you've randomly sampled most of the experimental space. If you set.seed the same for both the approaches you mention you get the exact same answer. If you change the seed and run a t-test, the results are not significantly different.
#First Method
seed <- 5554
set.seed(seed)
group_of_means_1 <- replicate(n=10000, expr = mean(rnorm(1000)))
set.seed(seed)
mean_of_means_1 <- mean(replicate(n=10000, expr = mean(rnorm(1000))))
#Method you propose
set.seed(5554)
big_sample <- data.frame(
group=rep(1:10000, each=1000),
samples=rnorm(10000 * 1000, 0, 1)
)
group_means_2 <- aggregate(samples ~ group,
FUN = mean,
data=big_sample)
mean_of_means_2 <- mean(group_means_2$samples)
#comparison
mean_of_means_1 == mean_of_means_2
t.test(group_of_means_1, group_means_2$samples)
If you're controlling for the seed, both approaches should yield identical outcomes:
set.seed(1)
mean(sample(1:9, 3))
#[1] 5.666667
mean(sample(1:9, 3))
#[1] 4
mean(sample(1:9, 3))
# [1] 5.333333
set.seed(1)
x <- sample(1:9)
mean(x[1:3])
#[1] 5.666667
mean(x[4:6])
#[1] 4
mean(x[7:9])
# [1] 5.333333
Here is an example that generates 10,000 sample means of 1,000 items drawn randomly from a uniform distribution. Based on the Central Limit Theorem, we expect these means to be normally distributed with a mean of 0.5.
# set seed to make reproducible
set.seed(95014)
# generate 10,000 means of 1,000 items pulled from a uniform distribution
mean_x <- NULL
for (i in 1:10000){
mean_x <- c(mean_x,mean(runif(1000)))
}
hist(mean_x)
...and the output:
# Len Greski
I can also do it that way right?
a <- runif(10000000)
j <- 1
x <- NULL
while (j <= 10000000){
x <- c(x,mean(a[j:(j+999)]))
j <- j + 1000
}
x
hist(x)

Generate non-negative (or positive) random integers that sum to a fixed value

I would like to randomly assign positive integers to G groups, such that they sum up to V.
For example, if G = 3 and V = 21, valid results may be (7, 7, 7), (10, 6, 5), etc.
Is there a straightforward way to do this?
Editor's notice (from 李哲源):
If values are not restricted to integers, the problem is simple and has been addressed in Choosing n numbers with fixed sum.
For integers, there is a previous Q & A: Generate N random integers that sum to M in R but it appears more complicated and is hard to follow. The loop based solution over there is also not satisfying.
non-negative integers
Let n be sample size:
x <- rmultinom(n, V, rep.int(1 / G, G))
is a G x n matrix, where each column is a multinomial sample that sums up to V.
By passing rep.int(1 / G, G) to argument prob I assume that each group has equal probability of "success".
positive integers
As Gregor mentions, a multinomial sample can contain 0. If such samples are undesired, they should be rejected. As a result, we sample from a truncated multinomial distribution.
In How to generate target number of samples from a distribution under a rejection criterion I suggested an "over-sampling" approach to achieve "vectorization" for a truncated sampling. Simply put, Knowing the acceptance probability we can estimate the expected number of trials M to see the first "success" (non-zero). We first sample say 1.25 * M samples, then there will be at least one "success" in these samples. We randomly return one as the output.
The following function implements this idea to generate truncated multinomial samples without 0.
positive_rmultinom <- function (n, V, prob) {
## input validation
G <- length(prob)
if (G > V) stop("'G > V' causes 0 in a sample for sure!")
if (any(prob < 0)) stop("'prob' can not contain negative values!")
## normalization
sum_prob <- sum(prob)
if (sum_prob != 1) prob <- prob / sum_prob
## minimal probability
min_prob <- min(prob)
## expected number of trials to get a "success" on the group with min_prob
M <- round(1.25 * 1 / min_prob)
## sampling
N <- n * M
x <- rmultinom(N, V, prob)
keep <- which(colSums(x == 0) == 0)
x[, sample(keep, n)]
}
Now let's try
V <- 76
prob <- c(53, 13, 9, 1)
Directly using rmultinom to draw samples can occasionally result in ones with 0:
## number of samples that contain 0 in 1000 trials
sum(colSums(rmultinom(1000, V, prob) == 0) > 0)
#[1] 355 ## or some other value greater than 0
But there is no such issue by using positive_rmultinom:
## number of samples that contain 0 in 1000 trials
sum(colSums(positive_rmultinom(1000, V, prob) == 0) > 0)
#[1] 0
Probably a less expensive way, but this seems to work.
G <- 3
V <- 21
m <- data.frame(matrix(rep(1:V,G),V,G))
tmp <- expand.grid(m) # all possibilities
out <- tmp[which(rowSums(tmp) == V),] # pluck those that sum to 'V'
out[sample(1:nrow(out),1),] # randomly select a column
Not sure how to do with runif
I figured out what I believe to be a much simpler solution. You first generate random integers from your minimum to maximum range, count them up and then make a vector of the counts (including zeros).
Note that this solution may include zeros even if the minimum value is greater than zero.
Hope this helps future r people with this problem :)
rand.vect.with.total <- function(min, max, total) {
# generate random numbers
x <- sample(min:max, total, replace=TRUE)
# count numbers
sum.x <- table(x)
# convert count to index position
out = vector()
for (i in 1:length(min:max)) {
out[i] <- sum.x[as.character(i)]
}
out[is.na(out)] <- 0
return(out)
}
rand.vect.with.total(0, 3, 5)
# [1] 3 1 1 0
rand.vect.with.total(1, 5, 10)
#[1] 4 1 3 0 2
Note, I also posted this here Generate N random integers that sum to M in R, but this answer is relevant to both questions.

R round product between positive real numbers x and y to between 0 and 1 efficiently

I am having this function to make products between two positive number that returns the product if this it less or equal to 1, otherwise returns 1.
f1 <- function(x, y) ifelse(x*y <= 1, x*y, 1)
It annoys me that I have to do the x*y calculation twice - is there a base R function that can do this, or another way to do the task ? I am aware that the difference in computing time perhaps is small (is it O vs 2*O ?) but still ... and out of curiosity.
We create the object and then do the assignment
out <- x*y
out[out >1] <- 1
Or another option is pmin
out1 <- pmin(x*y, 1)
-checking
identical(out, out1)
#[1] TRUE
data
set.seed(24)
x <- abs(rnorm(10, 0.5))
y <- abs(rnorm(10, 0.7))

Resources