Geometric distribution with general random variable - r

I got this exercise for my homework in the "Statistical Theory" course.
We were asked to find a geometric distribution for a random variable, so far this is my code and the graph immediately after it.
Define a discreet random variable starting from the Uniform(0,1)
distribution. Simulate for n=1000 and plot the distribution of it’s
mean as the function of n and the PMF. Add a horizontal line for the
theoretical mean (find it analytically, write your solution in tex,
you may use a known for this distribution formula).
Geometric(p) Choose the p parameter randomly from U(0,1) while writing
your code for a general p. Please avoid “magic numbers” inside the
code.The writing shoud be strictly parametric.
My question is, how can I get a better and more accurate result? My goal is for the blue line to converge as much as possible to the original value of Expected value(Mean).
library(glue)
p = runif(1) # choosing random p
n = 1000
real_avg = 1/p
cum_sum = 0
avg = numeric()
for (i in 1:n) {
cum_sum = cum_sum + ceiling(log(U[i],10)/log(1-p,10))
avg=c(avg,cum_sum / i)
}
plot(1 : n, avg, type = "l", lwd = 2, col = "blue", ylab = glue("Oberved Mean for p={round(p,digits=4)}"),
xlab = "Number of Experiments")
abline(h=real_avg,col="red")
print(glue("p={round(p,4)}"))
print(glue("E[X]={1/p}"))

Related

Difference between prop.table() & dnorm()

Could someone explain why the following two plots yield different results:
prop.table(table(S)) [where 'S' is the Random variable...representing Roulette wheel outcomes in this case]
dnorm([a list of values over the range of S], mean(S), sd(S))
Here is my code Snippet:
Frequency Plot of Random Variable (S)
plot(prop.table(table(S)), xlab = "Net Profit", ylab = "Probability", type = "h")
base <- seq(min(S),max(S),length = B)
pdf = data.frame(profit = base, probability = dnorm(base,avg,sd))
lines(pdf)
I can't upload pictures of my plot because of inadequate reputation
However, the 'line-plot' peak is about half of the 'prop.table(table(S))' plot
Cold you clear my understanding?
prop.table(Table(S)) gives us the probability of a value occurring ( as given by the value's frequency of occurrence)
dnorm(value,mean,std) gives us the probability of a value occurring (as given by the normal distribution )
if both are the probability of the same thing, shouldn't the peaks overlap, as shown in the video
Thanks in advance :D
Update:
Here is the exact code I'm using:
set.seed(1)
plays <- 1000
B <- 10000
#Monte Carlo Sim for Roulette Wheel
S <- replicate(B,{ # S because Random Variable
sum(sample(c(-1,1), plays, replace = TRUE, prob = c(18/38,20/38)))
# -1 -> Casino loose bet ; 1 -> Casino win bet
})
avg = mean(S); sd = sd(S)
# Frequency Plot of Random Variable of R. Wheel outcome
plot(prop.table(table(S)), xlab = "Net Profit", ylab = "Probability", type = "h")
base <- seq(min(S),max(S),length = B)
pdf = data.frame(profit = base, probability = dnorm(base,avg,sd))
lines(pdf)
A probability density is not a probability. It is a probability per unit of something.
Your sample, S, is only ever going to be divisible by 2, since the outcome is either -1 or 1. When you tabulate, you'll notice this. Then prop.table returns the proportion or probabilities of those values (-2, 0, 2, 4, 6, ...). These are discrete values, not continuous.
dnorm returns the density for a given normal ditribution. So if you want to use dnorm to emulate a probability, you need to multiply it by the per unit. In this case, 2 - the width of the histogram bars.
pdf2 = data.frame(profit = base, probability = dnorm(base,avg,sd) * 2)
lines(pdf2, col="blue", lwd=2)

What should be an Optimal value of K in K means Clustering for it to be implemented on ANY Dataset?

Like the Question speaks, I'm making a Visualization tool that is bound to work for any dataset provided. What should be the Optimal K value I should select and How?
So you can use Calinski criterion from vegan package, also your phrasing of question is little debatable. I am hoping this is what you expecting, please comment in case of otherwise.
For example, You can do:
n = 100
g = 6
set.seed(g)
d <- data.frame(
x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
require(vegan)
fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
This would result in value of 5, which means you can use 5 clusters, the algorithm works with the fundamentals on withiness and betweeness of k means clustering. You can also write a manual code basis on that.
From the documentation from here:
criterion: The criterion that will be used to select the best
partition. The default value is "calinski", which refers to the
Calinski-Harabasz (1974) criterion. The simple structure index ("ssi")
is also available. Other indices are available in function clustIndex
(package cclust). In our experience, the two indices that work best
and are most likely to return their maximum value at or near the
optimal number of clusters are "calinski" and "ssi".
A manual code would look like something as below:
At the first iteration since there is no SSB( Betweeness of the variance).
wss <- (nrow(d)-1)*sum(apply(d,2,var))
#TSS = WSS ##No betweeness at first observation, total variance equal to withness variance, TSS is total sum of squares, WSS is within sum of squress
for (i in 2:15) wss[i] <- sum(kmeans(d,centers=i)$withinss) #from second observation onward, since TSS would remain constant and between sum of squares will increase, correspondingly withiness would decrease.
#Plotting the same using the plot command for 15 iterations.(This is not constant, you have to decide what iterations you can do here.
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",col="mediumseagreen",pch=12)
An output of above can look like this, Here after the point at which the line become constant is the point that you have to pick for optimum cluster size, in this case it is 5 :

Method of Moments for Gamma distribution- histogram and superimposing the PDF

I have this question. 'Model the data in nfsold (nfsold is just a vector containing 150 numbers)as a set of 150independent observations from a Gamma(lambda; k) distribution. Use the Method of Moments, to obtain estimates of k and lambda. Draw a histogram of the data and superimpose the PDF of your fitted gamma distribution as a preliminary check that this distribution matches the observed data.'
This is the code I have written.
#The first moment of each Xi, i = 1,...,n, is E(Xi) = k/lamda.
#The second moment of each Xi is E(Xi^2) = k(k+1)/(lamda)^2
#Since we have to find 2 two things, k and lamda we require 2 moments to do this.
x_bar = mean = sum(nfsold)/150 #This is the first moment
mean
second_moment = sum(nfsold^2)/150
second_moment
#(1/n)(sum xi) = k/lamda
#(1/n)(sum x^2i) = k(k+1)/(lamda)^2
#By solving these because of the methods of moments we get lambda and k.
lamda_hat = (x_bar)/((second_moment)-(x_bar)^2)
lamda_hat
k_hat = (x_bar)^2/ ((second_moment)-(x_bar)^2)
k_hat
independent_observations = dgamma(x,k_hat, rate = lamda_hat)
hist( independent_observations, breaks = 15, prob = TRUE, main="Histogram for the Gamma Distribution of the data in nfsold", xlab="Independent Observations", ylab="P.D.F")
curve(dgamma(x,k_hat, rate =lamda_hat), add=TRUE, col="green")
My problem is that my superimposed curve does not follow my histogram, so I feel like there is something wrong with my code, please would I be able to have some help with correcting it?
Thanks!

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Given a random variable with probability density function f(x), how to compute the expected value of this random variable in R?

Given a random variable with probability density function f(x), how to compute the expected value of this random variable in R?
If you want to compute the expected value, just compute :
E(X) = Integral of xf(x)dx over the whole domain of X.
The integration can easily be done using the function integrate().
Say you're having a normal density function (you can easily define your own density function) :
f <- function(x){
1/sqrt(2*pi)*exp((-1/2)*x^2)
}
You calculate the expected value simply by:
f2 <- function(x){x*f(x)}
integrate(f2,-Inf,Inf )
Pay attention, sometimes you need to use Vectorize() for your function. This is necessary to get integrate to work. For more info, see the help pages of integrate() and Vectorize().
Does it help to know that the expectation E is the integral of x*f(x) dx for x in (-inf, inf)?
You could also use the inverse sampling transformation. All you need is the cumulate density function F(x) of your random variable X. It utilises the fact that the random variable U = F(X) is uniform (with pdf f(x)). You then have that X = F^-1(U). This means that you can sample from a uniform variable and then transform it through F^-1(U) to get a sample from X. You can then take the mean of your sample.
Here is an example for the exponential distribution with parameter lambda = 5, mean = 1/5, F(x) = 1 - exp(-lambda * x) and F^-1(u) = -log(1 - x) / lambda.
sample_exp = function(n, lambda = 5){
u = runif(n)
y = -log(1 - u) / lambda
mean(y)
}
n = seq(10, 4000, 10)
res = sapply(n, sample_exp)
plot(n, res, type = "l", xlab = "sample size",
ylab = "Estimated mean", main = "True mean = 0.2")
Below is a plot of the estimated mean as a function of the sample size:

Resources