Repeated Sampling - r

I have a question about repeated sampling. Let's say I am interested in the distribution of sample means. So what I would do is generate 10000 times a sample of size 1000 and look at the mean of each sample. Can I instead just take one sample of size 10000*1000 and then look at the mean of the first 1000 elements than from 1001 to 2000 and so on?

I would say yes. In taking 10,000,000 samples you've randomly sampled most of the experimental space. If you set.seed the same for both the approaches you mention you get the exact same answer. If you change the seed and run a t-test, the results are not significantly different.
#First Method
seed <- 5554
set.seed(seed)
group_of_means_1 <- replicate(n=10000, expr = mean(rnorm(1000)))
set.seed(seed)
mean_of_means_1 <- mean(replicate(n=10000, expr = mean(rnorm(1000))))
#Method you propose
set.seed(5554)
big_sample <- data.frame(
group=rep(1:10000, each=1000),
samples=rnorm(10000 * 1000, 0, 1)
)
group_means_2 <- aggregate(samples ~ group,
FUN = mean,
data=big_sample)
mean_of_means_2 <- mean(group_means_2$samples)
#comparison
mean_of_means_1 == mean_of_means_2
t.test(group_of_means_1, group_means_2$samples)

If you're controlling for the seed, both approaches should yield identical outcomes:
set.seed(1)
mean(sample(1:9, 3))
#[1] 5.666667
mean(sample(1:9, 3))
#[1] 4
mean(sample(1:9, 3))
# [1] 5.333333
set.seed(1)
x <- sample(1:9)
mean(x[1:3])
#[1] 5.666667
mean(x[4:6])
#[1] 4
mean(x[7:9])
# [1] 5.333333

Here is an example that generates 10,000 sample means of 1,000 items drawn randomly from a uniform distribution. Based on the Central Limit Theorem, we expect these means to be normally distributed with a mean of 0.5.
# set seed to make reproducible
set.seed(95014)
# generate 10,000 means of 1,000 items pulled from a uniform distribution
mean_x <- NULL
for (i in 1:10000){
mean_x <- c(mean_x,mean(runif(1000)))
}
hist(mean_x)
...and the output:

# Len Greski
I can also do it that way right?
a <- runif(10000000)
j <- 1
x <- NULL
while (j <= 10000000){
x <- c(x,mean(a[j:(j+999)]))
j <- j + 1000
}
x
hist(x)

Related

Partial sorting of a vector

Say I have a vector of random numbers, I can order them lowest to highest:
set.seed(1)
x <- runif(20)
v <- x[order(x)]
Now, say I want to order them but with some degree of noise.
I can randomly move elements like this:
z <-sample(1:20,2)
replace(v, z, v[rev(z)])
but this doesn't necessarily move closely related values. I could be equally likely to randomly switch the 1st and 20th values as the 5th and 6th. I would like to have some control over the switching, so I can switch more closely related values.
Ideally, I would be able to reorder the vector to have a specific Spearman's correlation. Say rather than the Spearman correlation of rank order being 1 when they are perfectly ordered, is there a way to reorder that same vector of numbers to have e.g. a Spearman's correlation of 0.5 ?
What if you added some noise to their rankings. This will makes sure values don't get moved too far away from the starting point. For example
set.seed(1)
N <- 50
D <- 3 # controls how far things can move
x <- runif(N)
v <- x[vx <- order(rank(x) + runif(N, -D, D))]
z <- x[order(x)]
layout(matrix(c(1,3,2,3), nrow=2))
plot(v, main ="Ordered")
plot(z, main ="Mixed")
plot(v, z, xlab="ordered", ylab="mixed"); abline(0,1)
I don't think I have completely understood your question but here's a start. I am simply recursively swapping random consecutive values of the sorted vector. You can control the amount of swapping with n_swaps argument. -
noisy_sort <- function(x, n_swaps) {
sorted_x <- sort(x)
indices <- sample(seq_along(x[-1]), n_swaps)
for(i in indices) {
sorted_x[c(i, i+1)] <- sorted_x[c(i+1, i)]
}
sorted_x
}
set.seed(1)
x <- runif(20)
result <- noisy_sort(x, 3)
order(result)
[1] 1 2 3 5 4 6 7 8 9 10 11 13 12 14 15 16 17 19 18 20
^ ^ ^ ^ ^ ^
Here is a very rudimentary algo.
Using Spearman correlation for distinct ranks, you can back out the desired sum of squared difference (SSE) between ranks. Then, using a Markov Chain Monte Carlo (MCMC) approach, you sample a pair of indices to swap and transit to the new vector with swapped elements if it improves the SSE towards desired score.
I used the number of iterations as the stopping criteria. You can change the condition so that it meets a target tolerance level.
set.seed(1)
n <- 20
x <- runif(n)
v <- sort(x)
calc_exp_sse <- function(rho, N) {
(1 - rho) * N * (N^2 - 1) / 6
}
exp_sse <- calc_exp_sse(0.5, n)
ord <- 1:n
vec <- ord
for (i in 1:1000) {
swap <- vec
swid <- sample(n, 2L)
swap[swid] <- swap[c(swid[2L], swid[1L])]
if (abs(exp_sse - sum((ord-swap)^2)) < abs(exp_sse - sum((ord-vec)^2))) {
vec <- swap
}
}
vec
cor(vec, ord, method="spearman")
#[1] 0.5007519
cor(v, v[vec], method="spearman")
#[1] 0.5007519

Generate samples from data following normal distribution but with new mean

I have a vector of numbers that is
set.seed(1)
x <- rnorm(8334, 1.456977, 0.3552899)
mean(x)
[1] 1.454307
Essentially, I want to randomly sample 2000 numbers from x such that mean of this sample is lower.
The key is I don't want to generate new random numbers but only sample from x, without replacement, such that I get a subset with a different mean.
Can anyone help me?
Thanks!
This method is not truly "random" as it only picks from values that are smaller than mean(x). Let me know if this is good enough for you -
set.seed(1)
x <- rnorm(8334, 1.456977, 0.3552899)
mean(x)
[1] 1.454307
y <- sample(x, 2000, prob = x <= mean(x)) # x > mean(x) has 0 chance of getting sampled
all(y %in% x)
[1] TRUE
mean(y)
[1] 1.170856
This is effectively the same as -
z <- sample(x[x <= mean(x)], 2000)
all(z %in% x)
[1] TRUE
mean(z)
[1] 1.172033
Also, for 2000 values, the lowest possible mean is this -
mean(sort(x)[1:2000])
[1] 0.9847526
UPDATE -
Here's one way to get random sample from both sides of mean(x) although it is arbitrary and I don't know if this would guarantee sample mean less than mean(x). -
z <- sample(x, 2000, prob = (x <= mean(x)) + 0.1)
mean(z)
[1] 1.225991
table(z <= mean(x))
FALSE TRUE
202 1798
How about doing rejection sampling, i.e. sampling 2000 numbers from your vector until you hit one sample that fulfills the desired properties?
set.seed(1)
x <- rnorm(8334, 1.456977, 0.3552899)
m_x <-mean(x)
y <- sample(x, 2000)
while(mean(y) >= m_x)
y <- sample(x, 2000)
mean(y)
#> [1] 1.4477
Created on 2019-06-18 by the reprex package (v0.3.0)
This should be quite fast since there is an (roughly) even chance for the new mean to be greater or smaller than the old one.
randomize normal distribution for the example
x= rnorm(8334,1.45,0.355)
pick a sample of 2000 nums
y= sample(x,2000)
lower y mean by 0.5
y=y-05
increase y's sd by 1.5
y= y*1.5
now the sd and the mean of Y will be about
mean(y)# ~0.9325603
sd(y)# ~0.5348885
hope it is the answer you are looking for

how to find the most similar columns in a matrix?

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)
# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)
I personally thought of "cor" or "all.equal" and I did as follows, but did not work.
indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))
what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)
similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)
I would also love to see how then this is different from correlated. do they result in identical results ?
Using correlation you could try (with a simpler matrix for demonstration)
set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)
corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)
data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
row col correlation
1 8 1 0.3387738
2 6 2 0.3350891
Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738
I'd take linear regression approach:
Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))
And residual standard error as an optional criteria.
You can further order data frame by sigma and get best/worst combinations.
If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:
set.seed(1)
Mat <- matrix(runif(40000), ncol=100, nrow=400)
col.combs <- t(combn(1:ncol(Mat), 2))
similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)
# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
i <- col.combs[k, 1]
j <- col.combs[k, 2]
# Difference within threshold?
diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)
pair.corr <- cor(Mat[, 1], Mat[, 2])
if (mean(diff.thresh) > 0.75)
similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}
In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.
> head(similar)
Col1 Col2 Corr Pct.Diff
1 1 2 -0.003187894 76.75
2 1 3 0.074061019 76.75
3 1 4 0.082668387 78.00
4 1 5 0.001713751 75.50
5 1 8 0.052228907 75.75
6 1 12 -0.017921978 78.00
Perhaps it's not the best solution, but gets the job done.
Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.

How to add data to R data frame

I can't imagine it should be that difficult, but probably, coming from Python, my mindset is biased.
I know I'm going to carry out 50 calculations and the result of each calculation, together with two parameters characterizing the calculation, should build up a data frame.
So my approach is to instantiate the data frame and then I want to add the results whenever they become available. Please see the indicated row below:
# Number of simulations
nsim = 50
# The data frame which should carry the calculation (parameters and solutions).
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
# Fifty values for n.
n <- seq.int(5, 5000, length.out=nsim)
for(ni in n)
{
# A random sample containing possible duplicates.
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n)) # <<-- How to do this correctly??
}
This doesn't work.
There are two ways to do this correctly. One is to pre-define your data.frame (its size) and then populate it iteratively in a for-loop:
nsim <- 10 # reduce to 10 to simplify output
n <- seq.int(5, 5000, length.out=nsim)
sol <- setNames(data.frame(matrix(nrow=nsim, ncol=3)), c("ni", "Xbar", "n"))
set.seed(1) # for reproducibility
for(ni in seq_along(n)) {
Xbar <- round(mean(sample(seq(-n[ni], n[ni], length=n[ni]+1), replace=T)), 3)
sol[ni,] <- c(ni, Xbar, n[ni])
}
Alternatively, you can use sapply on your n vector to create a vector of results and then cbind everything back together:
set.seed(1) # for reproducibility
sol <- data.frame(
ni = seq_along(n),
Xbar = sapply(n, function(ni) {
round(mean(sample(seq(-ni, ni, length=ni+1), replace=T)), 3)
}),
n = n
)
Either way, you'll end up with a nice dataframe:
> str(sol)
'data.frame': 10 obs. of 3 variables:
$ ni : num 1 2 3 4 5 6 7 8 9 10
$ Xbar: num 0.667 -0.232 -14.599 -26.026 36.51 ...
$ n : num 5 560 1115 1670 2225 ...
1) Check what your initial sol contains.
> sol <- data.frame(col.names=c("ni", "Xbar", "n"))
> sol
col.names
1 ni
2 Xbar
3 n
Not what you want. See this question.
2) Make sure seq.int does what you expect - check the documentation of (or just the output of) seq.int. e.g. look at what n contains:
> n
[1] 5.0000 106.9388 208.8776 310.8163 412.7551 514.6939 616.6327
[8] 718.5714 820.5102 922.4490 1024.3878 1126.3265 1228.2653 1330.2041
[15] 1432.1429 1534.0816 1636.0204 1737.9592 1839.8980 1941.8367 2043.7755
[22] 2145.7143 2247.6531 2349.5918 2451.5306 2553.4694 2655.4082 2757.3469
[29] 2859.2857 2961.2245 3063.1633 3165.1020 3267.0408 3368.9796 3470.9184
[36] 3572.8571 3674.7959 3776.7347 3878.6735 3980.6122 4082.5510 4184.4898
[43] 4286.4286 4388.3673 4490.3061 4592.2449 4694.1837 4796.1224 4898.0612
[50] 5000.0000
Is that what you meant?
3) Given (1) the problems are not surprising, but in any case, just carry out the first time through the loop a line at a time. See what happens:
sim = 50
sol <- data.frame(col.names=c("ni", "Xbar", "n"))
ni=5
X <- sample(seq(-ni, ni, length=ni+1), replace=T)
Xbar <- round(mean(X), 3)
sol <- rbind(sol, c(ni, Xbar, n))
print(sol)
Gives:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 5) :
invalid factor level, NA generated
> print(sol)
col.names
1 ni
2 Xbar
3 n
4 <NA>
Now the behavior is unsurprising; we can't add three columns to something with one column.
4) You don't want to do it this way anyway. It's better to initialize sol to be its final size and then fill it in.
See, for example, this answer
However, the more common R idiom would be to avoid loops where possible; there are a number of functions that will let you create the whole thing at once.
First of all, can you clarify the expected output format that you expect?
As of now, on modifying the code to generate a data frame, the following output will be generated (let me know if this is what you expect & then its not difficult to generate the following) :
ni Xbar n
10.000 2.182 12.000
If this is what you expect, then one way to do this would be as follows:
Step 1: Create Vectors
Step 2: Create Data frame from the above vectors
Step 3: Run your operations in a loop & fill in row by row.
nsim=50
n=seq.int(5, 5000, length.out=nsim)
ni<-vector(mode='numeric',length=nsim)
Xbar<-vector(mode='numeric',length=nsim)
out<-data.frame(ni=ni,Xbar=Xbar,n=n)
for ( i in 1:length(n)){
X<- sample(seq(-n[i], n[i], length=n[i]+1), replace=T)
out[i,'Xbar'] <- round(mean(X), 3)
out[i,'ni']<-n[i]
}
The output is as follows:

Finding an inversion score using R

Sorry in advance if "inversion score" isn't the proper terminology. Here's a wiki entry.
Consider a list of values, for instance
1 2 3 4 7 6 9 10 8
would have three penalties (a score of 3)
The 6 comes after 7
The 8 comes after 9
The 8 comes after 10
How can I calculate this inversion for a given vector of numbers in R? Note that some values will be NA, and I just want to skip these.
Your "inversion score" is a central component of Kendall's tau statistic. According to Wikipedia (see link), the tau statistic is (# concordant pairs-#discordant pairs)/(n*(n-1)/2). I believe that what R reports as T is the number of concordant pairs. Therefore, we should be able to reconstruct the number of discordant pairs (which I think is what you want) via n*(n-1)/2-T, as follows
x <- c(1,2,3,4,7,6,9,10,8)
(cc <- cor.test(sort(x),x,method="kendall"))
## Kendall's rank correlation tau
## data: sort(x) and x
## T = 33, p-value = 0.0008543
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.8333333
So this function should work:
ff <- function(x) {
cc <- cor.test(sort(x),x,method="kendall")
n <- length(x)
n*(n-1)/2-unname(cc$statistic["T"])
}
ff(x) is 3 as requested (it would be good if you gave more examples of desired output ...) Haven't checked speed, but this has the advantage of being implemented in underlying C code.
I quickly came up with two strategies. A naive and a more clever using the outer function.
We look at two vectors of numbers A and B, where A is your example.
A <- scan(text = "1 2 3 4 7 6 9 10 8")
B <- sample(1:2321)
Define and try the naive inversion counting:
simpleInversion <- function(A) {
sum <- 0
n <- length(A)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
sum <- sum + (A[i] > A[j])
}
}
return(sum)
}
simpleInversion(A)
simpleInversion(B)
Define and try the slightly more clever inversion counting:
cleverInversion <- function(A) {
tab <- outer(A, A, FUN = ">")
return(sum(tab[upper.tri(tab)]))
}
cleverInversion(A)
cleverInversion(B)
For the version which ignores NAs we can simply add an na.omit:
cleverInversion2 <- function(A) {
AA <- na.omit(A)
Tab <- outer(AA, AA, FUN = ">")
return(sum(Tab[upper.tri(Tab)]))
}
A[2] <- NA
cleverInversion2(A)
Hope this helps.
Edit: A faster version
Both functions become quite slow quickly when the size of the vector grows. So I came up with at faster version:
fastInversion <- function(A) {
return(sum(cbind(1, -1) %*% combn(na.omit(AA), 2) > 0))
}
C <- sample(c(1:500, NA))
library("microbenchmark")
microbenchmark(
simpleInversion(C),
cleverInversion(C),
fastInversion(C))
#Unit: microseconds
# expr min lq median uq max neval
# simpleInversion(C) 128538.770 130483.626 133999.272 144660.116 185767.208 100
# cleverInversion(C) 9546.897 9893.358 10513.799 12564.298 17041.789 100
# fastInversion(C) 104.632 114.229 193.144 198.209 324.614 100
So we gain quite a speed-up of nearly two orders of magnitude. The speed-up is even greater for larger vectors.
You could test each pair of values in your vector, counting the number that are inverted:
inversion.score <- function(vec) {
sum(apply(combn(length(vec), 2), 2, function(x) vec[x[2]] < vec[x[1]]), na.rm=T)
}
inversion.score(c(1, 2, 3, 7, 6, 9, 10, 8, NA))
# [1] 3

Resources