Difference between prop.table() & dnorm() - r

Could someone explain why the following two plots yield different results:
prop.table(table(S)) [where 'S' is the Random variable...representing Roulette wheel outcomes in this case]
dnorm([a list of values over the range of S], mean(S), sd(S))
Here is my code Snippet:
Frequency Plot of Random Variable (S)
plot(prop.table(table(S)), xlab = "Net Profit", ylab = "Probability", type = "h")
base <- seq(min(S),max(S),length = B)
pdf = data.frame(profit = base, probability = dnorm(base,avg,sd))
lines(pdf)
I can't upload pictures of my plot because of inadequate reputation
However, the 'line-plot' peak is about half of the 'prop.table(table(S))' plot
Cold you clear my understanding?
prop.table(Table(S)) gives us the probability of a value occurring ( as given by the value's frequency of occurrence)
dnorm(value,mean,std) gives us the probability of a value occurring (as given by the normal distribution )
if both are the probability of the same thing, shouldn't the peaks overlap, as shown in the video
Thanks in advance :D
Update:
Here is the exact code I'm using:
set.seed(1)
plays <- 1000
B <- 10000
#Monte Carlo Sim for Roulette Wheel
S <- replicate(B,{ # S because Random Variable
sum(sample(c(-1,1), plays, replace = TRUE, prob = c(18/38,20/38)))
# -1 -> Casino loose bet ; 1 -> Casino win bet
})
avg = mean(S); sd = sd(S)
# Frequency Plot of Random Variable of R. Wheel outcome
plot(prop.table(table(S)), xlab = "Net Profit", ylab = "Probability", type = "h")
base <- seq(min(S),max(S),length = B)
pdf = data.frame(profit = base, probability = dnorm(base,avg,sd))
lines(pdf)

A probability density is not a probability. It is a probability per unit of something.
Your sample, S, is only ever going to be divisible by 2, since the outcome is either -1 or 1. When you tabulate, you'll notice this. Then prop.table returns the proportion or probabilities of those values (-2, 0, 2, 4, 6, ...). These are discrete values, not continuous.
dnorm returns the density for a given normal ditribution. So if you want to use dnorm to emulate a probability, you need to multiply it by the per unit. In this case, 2 - the width of the histogram bars.
pdf2 = data.frame(profit = base, probability = dnorm(base,avg,sd) * 2)
lines(pdf2, col="blue", lwd=2)

Related

Geometric distribution with general random variable

I got this exercise for my homework in the "Statistical Theory" course.
We were asked to find a geometric distribution for a random variable, so far this is my code and the graph immediately after it.
Define a discreet random variable starting from the Uniform(0,1)
distribution. Simulate for n=1000 and plot the distribution of it’s
mean as the function of n and the PMF. Add a horizontal line for the
theoretical mean (find it analytically, write your solution in tex,
you may use a known for this distribution formula).
Geometric(p) Choose the p parameter randomly from U(0,1) while writing
your code for a general p. Please avoid “magic numbers” inside the
code.The writing shoud be strictly parametric.
My question is, how can I get a better and more accurate result? My goal is for the blue line to converge as much as possible to the original value of Expected value(Mean).
library(glue)
p = runif(1) # choosing random p
n = 1000
real_avg = 1/p
cum_sum = 0
avg = numeric()
for (i in 1:n) {
cum_sum = cum_sum + ceiling(log(U[i],10)/log(1-p,10))
avg=c(avg,cum_sum / i)
}
plot(1 : n, avg, type = "l", lwd = 2, col = "blue", ylab = glue("Oberved Mean for p={round(p,digits=4)}"),
xlab = "Number of Experiments")
abline(h=real_avg,col="red")
print(glue("p={round(p,4)}"))
print(glue("E[X]={1/p}"))

Set weights for ewcdf {spatstat} [R]

I want to compare a reference distribution d_1 with a sample d_2 drawn proportionally to size w_1 using the Kolmogorov–Smirnov distance.
Given that d_2 is weighted, I was considering accounting for this using the Weighted Empirical Cumulative Distribution Function in R (using ewcdf {spatstat}).
The example below shows that I am probably miss-specifying the weights, because when lenght(d_1) == lenght(d_2) the Kolmogorov–Smirnov is not giving a value of 0.
Can someone help me with this? For clarity, see the reproducible example below.
#loop for testing sample sizes 1:length(d_1)
d_stat <- data.frame(1:1000, rep(NA, 1000))
names(d_stat) <- c("sample_size", "ks_distance")
for (i in 1:1000) {
#reference distribution
d_1 <- rpois(1000, 500)
w_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, w_1)
#sample from the reference distribution
m_2 <-m_1[(sample(nrow(m_1), size=i, prob=w_1, replace=F)),]
d_2 <- m_2$d_1
w_2 <- m_2$w_1
#ewcdf for the reference distribution and the sample
f_d_1 <- ewcdf(d_1)
f_d_2 <- ewcdf(d_2, 1/w_2, normalise=F, adjust=1/length(d_2))
#kolmogorov-smirnov distance
d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
}
d_stat[1000,2]
Your code generates some data d1 and associated numeric weights w1. These data are then treated as a reference population. The code takes a random sample d2 from this population of values d1, with sampling probabilities proportional to the associated weights w1. From the sample, you compute the weighted empirical distribution function f_d_2 of the sampled values d2, with weights inversely proportional to the original sampling probabilities. This function f_d_2 is a correct estimate of the original population distribution function, by the Horvitz-Thompson principle. But it's not exactly equal to the original population distribution, because it's a sample. The Kolmogorov-Smirnov test statistic should not be zero; it should be a small value.
I don’t quite understand what you are trying to do here.
Why would you expect ewcdf(d_1) and ewcdf(d_2, w_2, normalise=F) to give
the same result for i=1000? The first one is the usual ecdf which jumps at
the unique values of the input vector with a jump size determined by the
number of times the value is repeated (more ties – larger jumps). The second
one jumps at the same unique values with a height determined by the sum of
the weights you have provided.
What does give identical results is ewcdf(d_2, w_2) and
ewcdf(d_1, w_1), but this is not the same as ewcdf(d_1).
To understand why the latter two are different, I would suggest a much
smaller handmade example with a couple of ties:
library(spatstat)
#> Loading required package: spatstat.data
#> Loading required package: nlme
#> Loading required package: rpart
#>
#> spatstat 1.60-1.006 (nickname: 'See Above')
#> For an introduction to spatstat, type 'beginner'
x <- c(1,2,3,3,4)
e <- ewcdf(x)
This is the usual ecdf which jumps with value 1/5 at x=1, 1/5 at x=2, 2*1/5 at
x=3 and 1/5 at x=4:
plot(e)
Now you define the weights as:
w <- x/sum(x)
w
#> [1] 0.07692308 0.15384615 0.23076923 0.23076923 0.30769231
Thus the ewcdf will jump with value 1/13 at x=1, 2/13 at x=2, 2*3/13 at
x=3 and 4/13 at x=4 (with the usual ecdf overlayed in red):
plot(ewcdf(x, w, normalise = FALSE), axes = FALSE)
axis(1)
axis(2, at = (0:13)/13, labels = c("0", paste(1:13, 13, sep = "/")), las = 2 )
abline(h = cumsum(c(1,2,6,4)/13), lty = 3, col = "gray")
plot(e, add = TRUE, col = "red")

Method of Moments for Gamma distribution- histogram and superimposing the PDF

I have this question. 'Model the data in nfsold (nfsold is just a vector containing 150 numbers)as a set of 150independent observations from a Gamma(lambda; k) distribution. Use the Method of Moments, to obtain estimates of k and lambda. Draw a histogram of the data and superimpose the PDF of your fitted gamma distribution as a preliminary check that this distribution matches the observed data.'
This is the code I have written.
#The first moment of each Xi, i = 1,...,n, is E(Xi) = k/lamda.
#The second moment of each Xi is E(Xi^2) = k(k+1)/(lamda)^2
#Since we have to find 2 two things, k and lamda we require 2 moments to do this.
x_bar = mean = sum(nfsold)/150 #This is the first moment
mean
second_moment = sum(nfsold^2)/150
second_moment
#(1/n)(sum xi) = k/lamda
#(1/n)(sum x^2i) = k(k+1)/(lamda)^2
#By solving these because of the methods of moments we get lambda and k.
lamda_hat = (x_bar)/((second_moment)-(x_bar)^2)
lamda_hat
k_hat = (x_bar)^2/ ((second_moment)-(x_bar)^2)
k_hat
independent_observations = dgamma(x,k_hat, rate = lamda_hat)
hist( independent_observations, breaks = 15, prob = TRUE, main="Histogram for the Gamma Distribution of the data in nfsold", xlab="Independent Observations", ylab="P.D.F")
curve(dgamma(x,k_hat, rate =lamda_hat), add=TRUE, col="green")
My problem is that my superimposed curve does not follow my histogram, so I feel like there is something wrong with my code, please would I be able to have some help with correcting it?
Thanks!

How should I specify argument "prob" when using sample() for resampling?

In short
I'm trying to better understand the argument prob as part of the function sample in R. In what follows, I both ask a question, and provide a piece of R code in connection with my question.
Question
Suppose I have generated 10,000 random standard rnorms. I then want to draw a sample of size 5 from this mother 10,000 standard rnorms.
How should I set the prob argument within the sample such that the probability of drawing these 5 numbers from the mother rnorm considers that the middle areas of the mother rnorm are denser but tail areas are thinner (so in drawing these 5 numbers it would draw from the denser areas more frequently than the tail areas)?
x = rnorm(1e4)
sample( x = x, size = 5, replace = TRUE, prob = ? ) ## what should be "prob" here?
# OR I leave `prob` to be the default by not using it:
sample( x = x, size = 5, replace = TRUE )
Overthinking is devil.
You want to resample these samples, following the original distribution or an empirical distribution. Think about how an empirical CDF is obtained:
plot(sort(x), 1:length(x)/length(x))
In other words, the empirical PDF is just
plot(sort(x), rep(1/length(x), length(x)))
So, we want prob = rep(1/length(x), length(x)) or simply, prob = rep(1, length(x)) as sample normalizes prob internally. Or, just leave it unspecified as equal probability is default.

Plot a ROC curve in R without using any packages

Hi i am pretty new to programming in R and i am having troble plotting a ROC curve without using any package.
I generated my data using:
d=rpearsonIII(100,0.5,360,20)
nd=rnorm(100,450,25)
i need a vector with values <400 for d and >400 for nd, so i did:
spec = (cumsum(nd[nd>400])/sum(nd))*100
sens = (cumsum(d[d<400])/sum(nd))*100
and the i plotted like this:
plot(1-spec,sens)
but the plot was nothing like i expected it to be
Edit:
Thanks to the advice given my code looks like this now:
sc2 = c(rnorm(50,450,25),rpearsonIII(50,0.5,360,20))
scF = sc2 < 395
thresholds <- sort(sc2)
pos <- sum(scF);pos
neg <- sum(!scF);neg
tn <- cumsum(!scF);tn
spec <- tn/neg;spec
tp <- pos - cumsum(scF);tp
sens <- tp/pos;sens
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
The plotted roc curve looks like this:
roc curve
My problem now is that if change the order of the generated data (rnorm and rpearsonIII), the curve is reversed.
I don't know what rpearsonIII is, so I am just going to make a sample random data with actual classes actuals as well as the scores for the predictions scores.
set.seed(100)
actuals <- sample(c(TRUE,FALSE), 100, replace = TRUE)
scores <- runif(100,-1,1)
The long version with explanation
If in your data the actuals are strings or factors rather than logicals, you will need to convert them to logicals using:
actuals <- actuals == "postiveClass"
Next we want to order the instances based on their scores. We can do this using:
actuals <- actuals[order(scores)]
If you want to keep track of the thresholds for the sensitivities and specificity, you can keep them aligned using:
thresholds <- sort(scores)
Now we need to get our sensitivities and specificities. Sensitivity is TP/P and specificity is TN/N. Getting the total number of positives P is easy, since our actuals are logical, we can just use sum(actuals). Similarity, we can get our negatives N using sum(!actuals).
pos <- sum(actuals)
neg <- sum(!actuals)
First lets get our true negatives at each threshold. That is pretty easy, it is just the number of FALSE values at or below each the threshold. Since our data are in order by threshold, we can calculate that (and the specificity) using:
tn <- cumsum(!actuals)
spec <- tn/neg
The number of true positives is slightly harder because we are looking for the number of positives greater that the threshold, so cumsum alone won't work. However, since the number above the threshold is equal to the total minus number below or at the threshold, we can get our true positives using:
tp <- pos - cumsum(actuals)
sens <- tp/pos
Now all we need to do is plot the two.
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
To get the AUC of the curve, we simply need to calculate the height of the curve (the sensitivity) multiplied by the width the (difference in 1 - specificity) at each value of actuals. We already have the sensitivity, we just need the specificity. The diff function will give us our difference in adjacent values of specificity, however, we need to put a 0 value at the beginning to get the width of the first columns.
width <- diff(c(0, 1 - sens))
auc <- sum(spec*width)
the minimal code version
actuals <- actuals[order(scores)]
sens <- (sum(actuals) - cumsum(actuals))/sum(actuals)
spec <- cumsum(!actuals)/sum(!actuals)
plot(1 - spec, sens, type = "l", col = "red",
ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))
(auc <- sum(spec*diff(c(0, 1 - sens))))

Resources