R: Having trouble producing multiple multi-number samples - r

I'm trying to draw samples from a runif(100,900,1100) population. Now I want to draw 25 samples of size n = 5 from this population with replacement, but it seems that sample() outputs only scalar samples. What is the best approach for this?

This gives you a 5*25 matrix (each column corresponds to one sample) with numbers generated from a uniform distribution.
matrix(runif(5*25,900,1100), nrow = 5, ncol = 25)
or you can do the following if instead, you want to first generate runif(100,900,1100), then draw 25 samples from the resulting vector:
sapply(1:25, function(x) sample(runif(100,900,1100), 5, replace = TRUE))

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

adjusting lists so that drawing elements from them leads to a uniform distribution

I have 2 lists, and I would like to sample elements from each of the lists and then calculate the difference between them (get the nth index from list A and nth index from list B and subtract them). When I plot the probability density of the difference, I see that the result is normally distributed, which is expected (see the example and plot below).
Is there a way to adjust the lists so that randomly drawing elements from each, when subtracted by the same nth index, would lead to similar frequencies of occurrences across the range of possibilities? In other words, can I arrange the list so that the subtraction of randomly drawn elements from each list leads to uniform distribution?
I provided the code below to illustrate my question.
listA <- -10:10
listB <- -10:10
#plot the distribution of the list
hist(listB, freq = FALSE, xlab = 'x', density = 20)
#sample random elements from both of the lists
sampledListA <- sample(listA, 1000, replace = TRUE)
sampledListB <- sample(listB, 1000, replace = TRUE)
# I then draw one element from each of the lists and I calculate the difference of the two drawn value
# and I want the occurrences of the differences to be similar in probability.
# I can calculate the difference by element
listDiff <- sampledListA - sampledListB
#here is the normal distribution this leads to
hist(listDiff, freq = FALSE, xlab = 'x', density = 20)
# I can calculate the possible differences using the outer function
diffMatrix <- data.frame(outer(listA,listB, '-'))
#change the column and row names
library(stats)
nms <- as.character(listA)
rownames(diffMatrix) <- nms
names(diffMatrix) <- nms
diffMatrix
# I can then find the list of possible unique differences, and draw samples from that
vectorized <- unlist(diffMatrix)
diffRange <- unique(vectorized)
getDiffSamples<-sample(diffRange, 1000, replace = TRUE) #get 1000 random sample from each diff value
hist(getDiffSamples,freq = FALSE, xlab = 'x', density = 20) #then I will have uniform distribution
# I can then get any value from this distribution, find its index (location of row and column) in the matrix of differences
which(diffMatrix == a[1], arr.ind=TRUE)
# but I am looking for a way to adjust the list, because, my ultimate goal is to have a list of A and B that when I randomly pick one from each list and
# get the subtraction of the pair when plotted U(listA, listB) will have the shape of uniform probability.

Hierarchical clustering for centers of kmeans in R

I have a huge data set (200,000 rows * 40 columns) where each row represents an observation and each column is a variable. For this data, I would like to do hierarchical clustering. Unfortunately, as the number of rows is huge, then it is impossible to do this using my computer since I need to compute the distance matrix for all pairs of observations so (200,000 * 200,000) matrix.
The answer of this question suggests to use first kmeans to calculate a number of centers, then to perform the hierarchical clustering on the coordinates of these centers using the library FactoMineR.
The problem: I keep getting an error when applying the same method!
#example
# Data
MyData <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
kClust_MyData <- kmeans(MyData, 1000, iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(Hclust_MyData, choice="tree")
But
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) :
object 'data.clust' not found
The package fastcluster has a method hclust.vector that does not require a distance matrix as input, but computes the distances itself in a more memory efficient way. From the fastcluster manual:
The call
hclust.vector(X, method='single', metric=[...])
is equivalent to
hclust(dist(X, metric=[...]), method='single')
but uses less memory and is equally fast

Calculate each element in a matrix based on elements in two other matrices

I'm trying to populate a matrix with i x j entries from a random normal distribution based on the means and standard deviations stored in two other matrices. Is there a way to use rnorm pulling each entry from the two "data" matrices (the two matrices with the means and standard deviations) without using a loop?
Sure, just do it:
means <- matrix(1:4, 2, 2)
sds <- matrix((1:4)/1000, 2, 2)
result <- matrix(rnorm(4, mean = means, sd = sds), 2, 2)
or (following the comment from Frank below)
result <- array(rnorm(length(means), mean = means, sd = sds),
dim = dim(means))

Sampling Distribution from a data-set with one column

I want to create a sampling distribution for a mean. I have a variable x with at least ten thousand values. I want take 500 samples (n=10) and then show the distribution of the sample means in a histogram. I think it worked with the following, but can anyone check if this is what i meant and tell me what the 2 within the apply function stands for?
x <- rnorm(10000, 7.5, 1.5)
draws = sample(x, size = 10 * 500, replace = TRUE)
draws = matrix(draws, 10)
drawmeans = apply(draws, 2, mean)
hist(drawmeans)
would be sincerely appreciated!
You could do this using replicate if you wanted. One of lots of different ways. For data frame df
out = replicate(500, mean(sample(df$Scores,10)))
hist(out)

Resources