How should I specify argument "prob" when using sample() for resampling? - r

In short
I'm trying to better understand the argument prob as part of the function sample in R. In what follows, I both ask a question, and provide a piece of R code in connection with my question.
Question
Suppose I have generated 10,000 random standard rnorms. I then want to draw a sample of size 5 from this mother 10,000 standard rnorms.
How should I set the prob argument within the sample such that the probability of drawing these 5 numbers from the mother rnorm considers that the middle areas of the mother rnorm are denser but tail areas are thinner (so in drawing these 5 numbers it would draw from the denser areas more frequently than the tail areas)?
x = rnorm(1e4)
sample( x = x, size = 5, replace = TRUE, prob = ? ) ## what should be "prob" here?
# OR I leave `prob` to be the default by not using it:
sample( x = x, size = 5, replace = TRUE )

Overthinking is devil.
You want to resample these samples, following the original distribution or an empirical distribution. Think about how an empirical CDF is obtained:
plot(sort(x), 1:length(x)/length(x))
In other words, the empirical PDF is just
plot(sort(x), rep(1/length(x), length(x)))
So, we want prob = rep(1/length(x), length(x)) or simply, prob = rep(1, length(x)) as sample normalizes prob internally. Or, just leave it unspecified as equal probability is default.

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

How to compute some basic statistics in R with the density and support only?

I have no sample and I'd like to compute the variance, mean, median, and mode of a distribution which I only have a vector with it's density and a vector with it's support. Is there an easy way to compute this statistics in R with this information?
Suppose that I only have the following information:
Support
Density
sum(Density) == 1 #TRUE
length(Support)==length(Density)# TRUE
You have to do weighted summations
F.e., starting with #Johann example
set.seed(312345)
x = rnorm(1000, mean=10, sd=1)
x_support = density(x)$x
x_density = density(x)$y
plot(x_support, x_density)
mean(x)
prints
[1] 10.00558
and what, I believe, you're looking for
m = weighted.mean(x_support, x_density)
computes mean as weighted mean of values, producing output
10.0055796130192
There are weighted.sd, weighted.sum functions which should help you with other quantities you're looking for.
Plot
If you don't need a mathematical solution, and an empirical one is all right, you can achieve a pretty good approximation by sampling.
Let's generate some data:
set.seed(6854684)
x = rnorm(50,mean=10,sd=1)
x_support = density(x)$x
x_density = density(x)$y
# see our example:
plot(x_support, x_density )
# the real mean of x
mean(x)
Now to 'reverse' the process we generate a large sample from that density distribution:
x_sampled = sample(x = x_support, 1000000, replace = T, prob = x_density)
# get the statistics
mean(x_sampled)
median(x_sampled)
var(x_sampled)
etc...

What should be an Optimal value of K in K means Clustering for it to be implemented on ANY Dataset?

Like the Question speaks, I'm making a Visualization tool that is bound to work for any dataset provided. What should be the Optimal K value I should select and How?
So you can use Calinski criterion from vegan package, also your phrasing of question is little debatable. I am hoping this is what you expecting, please comment in case of otherwise.
For example, You can do:
n = 100
g = 6
set.seed(g)
d <- data.frame(
x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
require(vegan)
fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
This would result in value of 5, which means you can use 5 clusters, the algorithm works with the fundamentals on withiness and betweeness of k means clustering. You can also write a manual code basis on that.
From the documentation from here:
criterion: The criterion that will be used to select the best
partition. The default value is "calinski", which refers to the
Calinski-Harabasz (1974) criterion. The simple structure index ("ssi")
is also available. Other indices are available in function clustIndex
(package cclust). In our experience, the two indices that work best
and are most likely to return their maximum value at or near the
optimal number of clusters are "calinski" and "ssi".
A manual code would look like something as below:
At the first iteration since there is no SSB( Betweeness of the variance).
wss <- (nrow(d)-1)*sum(apply(d,2,var))
#TSS = WSS ##No betweeness at first observation, total variance equal to withness variance, TSS is total sum of squares, WSS is within sum of squress
for (i in 2:15) wss[i] <- sum(kmeans(d,centers=i)$withinss) #from second observation onward, since TSS would remain constant and between sum of squares will increase, correspondingly withiness would decrease.
#Plotting the same using the plot command for 15 iterations.(This is not constant, you have to decide what iterations you can do here.
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",col="mediumseagreen",pch=12)
An output of above can look like this, Here after the point at which the line become constant is the point that you have to pick for optimum cluster size, in this case it is 5 :

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

How to vectorise sampling from non-identically distributed Bernoulli random variables?

Given a sequence of independent but not identically distributed Bernoulli trials with success probabilities given by a vector, e.g.:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000 # trial probabilities for trials 1 to 501
What is the most efficient way to obtain a random variate from each trial? I am assuming that vectorisation is the way to go.
I know of two functions that give Bernoulli random variates:
rbernoulli from the package purr, which does not accept a vector of success probabilities as an input. In this case it may be possible to wrap the function in an apply type operation.
rbinom with arguments size = 1 gives Bernoulli random variates. It also accepts a vector of probabilities, so that:
rbinom(n = length(prob), size = 1, prob = prob)
gives an output with the right length. However, I am not entirely sure that this is actually what I want. The bits in the helpfile ?rbinom that seem relevant are:
The length of the result is determined by n for rbinom, and is the
maximum of the lengths of the numerical arguments for the other
functions.
The numerical arguments other than n are recycled to the length of the
result. Only the first elements of the logical arguments are used.
However, n is a parameter with no default, so I am not sure what the first sentence means. I presume the second sentence means that I get what I want, since only size = 1 should be recycled. However this thread seems to suggest that this method does not work.
This blog post gives some other methods as well. One commentator mentions my suggested idea using rbinom.
Another way to test that rbinom is vectorised for prob, taking advantage of the fact that the sum of N bernoulli random variables is a binomial random variable with denominator N:
x <- seq(0, 50, 0.1)
prob <- -x*(x - 50)/1000
n <- rbinom(prob, size=1000, prob)
par(mfrow=c(1, 2))
plot(prob ~ x)
plot(n ~ x)
If you don't trust random strangers on the internet and do not understand documentation, maybe you can convince yourself by testing. Just set the random seed to get reproducible results:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000
#501 seperate draws of 1 random number
set.seed(42)
res1 <- sapply(prob, rbinom, n = 1, size = 1)
#501 "simultaneous" (vectorized) draws
set.seed(42)
res2 <- rbinom(501, 1, prob)
identical(res1, res2)
#[1] TRUE

Resources