I have following task:
Assume the population of interest can be modeled by a Bernoulli distribution with
p = 0.5.
For each sample size n simulate r = 5, 000 draws (by using a for loop over (i in
1:r)) from that Bernoulli distribution with p = 0.5 and calculate the standardized
sample mean for each draw.
The last histogram looks good with a curve, but 1st and 2ns are wrong. Maybe someone han help me with this. Thanks in advance for your time!
I have done following:
set.seed(2005)
x1 <- rbinom(5000,3,0.5)
par(mfrow=c(2,2))
hist(x=x1,
main=expression(paste(" Random Variables with",size,"=1 and",prob,"=0.5")),
sub="Standardized value of smple sample avearge",
xlab="n=3", ylab="Probability", probability = TRUE)
curve(dnorm(x, mean = mean(x), sd=sd(x)), add = TRUE, col="blue")
Essentially what happened in the first two panels is that for a small n the histogram breaks were calculated in an ungraceful manner. You can fix that by letting the breaks depend on the data range. Here, I chose the breaks depending on whether the range of the data was smaller than 10. If this is TRUE, manually calculate breaks, otherwise use the default "Sturges" algorithm for breaks.
par(mfrow=c(2,2))
N <- c(2, 5, 25, 100)
for (i in seq_along(N)) {
set.seed(2015 + i)
n <- N[i]
xx <- rbinom(10000, n, 0.78)
if (diff(range(xx)) < 10) {
breaks <- seq(floor(min(xx)), ceiling(max(xx)))
} else {
breaks <- "Sturges"
}
hist(
x = xx, breaks = breaks,
main=expression(paste("Bernoulli Random Variables with",size,"=1 and",prob,"=0.78")),
sub = "Standardized value of sample average",
xlab = paste0("n=",n), ylab = "Probability", probability = TRUE
)
curve(dnorm(x, mean = mean(xx), sd=sd(xx)), add = TRUE, col="blue")
}
Created on 2021-01-07 by the reprex package (v0.3.0)
Related
For an assignment I was asked this:
For the values of
(shape=5,rate=1),(shape=50,rate=10),(shape=.5,rate=.1), plot the
histogram of a random sample of size 10000. Use a density rather than
a frequency histogram so that you can add in a line for the population
density (hint: you will use both rgamma and dgamma to make this plot).
Add an abline for the population and sample mean. Also, add a subtitle
that reports the population variance as well as the sample variance.
My current code looks like this:
library(ggplot2)
set.seed(1234)
x = seq(1, 1000)
s = 5
r = 1
plot(x, dgamma(x, shape = s, rate = r), rgamma(x, shape = s, rate = r), sub =
paste0("Shape = ", s, "Rate = ", r), type = "l", ylab = "Density", xlab = "", main =
"Gamma Distribution of N = 1000")
After running it I get this error:
Error in plot.window(...) : invalid 'xlim' value
What am I doing incorrectly?
plot() does not take y1 and y2 arguments. See ?plot. You need to do a plot (or histogram) of one y variable (e.g., from rgamma), then add the second y variable (e.g., from dgamma) using something like lines().
Here's one way to get a what you want:
#specify parameters
s = 5
r = 1
# plot histogram of random draws
set.seed(1234)
N = 1000
hist(rgamma(N, shape=s, rate=r), breaks=100, freq=FALSE)
# add true density curve
x = seq(from=0, to=20, by=0.1)
lines(x=x, y=dgamma(x, shape=s, rate=r))
I try to compare the power functions of the Chi-square-Test and the t-Test for one particular value and my overall goal was to show that the t-Test is more powerful (because it has an assumption about the distribution). I used the pwr package for R for calculating the power of each function and then wrote two functions and plotted the results.
However, I do not find that the t-test is better than the Chi-square-test, and I am puzzled by the result. I spend hours on it so every help is so much appreciated.
Is the code wrong, do I have a wrong understanding of the power functions, or is there something wrong in the package?
library(pwr)
#mu is the value for which the power is calculated
#no is the number of observations
#function of the power of the t-test with a h0 of .2
g <- function(mu, alpha, no) { #calculate the power of a particular value for the t-test with h0=.2
p <- mu-.20
sigma <- sqrt(.5*(1-.5))
pwr.t.test(n = no, d = p/sigma, sig.level = alpha, type = "one.sample", alternative="greater")$power # d is the effect size p/sigma
}
#chi squared test
h <- function(mu, alpha, no, degree) {#calculate the power of a particular value for the chi squared test
p01 <- .2 # these constructs the effect size (which is a bit different for the chi squared)
p02 <- .8
p11 <-mu
p12 <- 1-p11
effect.size <- sqrt(((p01-p11)^2/p01)+((p02-p12)^2/p02)) # effect size
pwr.chisq.test(N=no, df=degree, sig.level = alpha, w=effect.size)$power
}
#create a diagram
plot(1, 1, type = "n",
xlab = expression(mu),
xlim = c(.00, .75),
ylim = c(0, 1.1),
ylab = expression(1-beta),
axes=T, main="Power function t-Test and Chi-squared-Test")
axis(side = 2, at = c(0.05), labels = c(expression(alpha)), las = 3)
axis(side = 1, at = 3, labels = expression(mu[0]))
abline(h = c(0.05, 1), lty = 2)
legend(.5,.5, # places a legend at the appropriate place
c("t-Test","Chi-square-Test"), # puts text in the legend
lwd=c(2.5,2.5),col=c("black","red"))
curve(h(x, alpha = 0.05, no = 100, degree=1), from = .00, to = .75, add = TRUE, col="red",lwd=c(2.5,2.5) )
curve(g(x, alpha = 0.05, no = 100), from = .00, to = .75, add = TRUE, lwd=c(2.5,2.5))
Thanks a lot in advance!
If I understand the problem right, you are testing for a Binomial distribution with the mean under the null equal to 0.2 and the alternative being null greater than 0.2? If so, then on line 2 of you function g, shouldn't it be sigma <- sqrt(.2*(1-.2)) instead of sigma <- sqrt(.5*(1-.5))? That way, your standard deviation will be smaller, resulting in a larger test statistic and hence smaller p-value leading to higher power.
My GAM curves are being shifted downwards. Is there something wrong with the intercept? I'm using the same code as Introduction to statistical learning... Any help's appreciated..
Here's the code. I simulated some data (a straight line with noise), and fit GAM multiple times using bootstrap.
(It took me a while to figure out how to plot multiple GAM fits in one graph. Thanks to this post Sam's answer, and this post)
library(gam)
N = 1e2
set.seed(123)
dat = data.frame(x = 1:N,
y = seq(0, 5, length = N) + rnorm(N, mean = 0, sd = 2))
plot(dat$x, dat$y, xlim = c(1,100), ylim = c(-5,10))
gamFit = vector('list', 5)
for (ii in 1:5){
ind = sample(1:N, N, replace = T) #bootstrap
gamFit[[ii]] = gam(y ~ s(x, 10), data = dat, subset = ind)
par(new=T)
plot(gamFit[[ii]], col = 'blue',
xlim = c(1,100), ylim = c(-5,10),
axes = F, xlab='', ylab='')
}
The issue is with plot.gam. If you take a look at the help page (?plot.gam), there is a parameter called scale, which states:
a lower limit for the number of units covered by the limits on the ‘y’ for each plot. The default is scale=0, in which case each plot uses the range of the functions being plotted to create their ylim. By setting scale to be the maximum value of diff(ylim) for all the plots, then all subsequent plots will produced in the same vertical units. This is essential for comparing the importance of fitted terms in additive models.
This is an issue, since you are not using range of the function being plotted (i.e. the range of y is not -5 to 10). So what you need to do is change
plot(gamFit[[ii]], col = 'blue',
xlim = c(1,100), ylim = c(-5,10),
axes = F, xlab='', ylab='')
to
plot(gamFit[[ii]], col = 'blue',
scale = 15,
axes = F, xlab='', ylab='')
And you get:
Or you can just remove the xlim and ylim parameters from both calls to plot, and the automatic setting of plot to use the full range of the data will make everything work.
I am trying to get density estimates for the log of stock prices in R. I know I can plot it using plot(density(x)). However, I actually want values for the function.
I'm trying to implement the kernel density estimation formula. Here's what I have so far:
a <- read.csv("boi_new.csv", header=FALSE)
S = a[,3] # takes column of increments in stock prices
dS=S[!is.na(S)] # omits first empty field
N = length(dS) # Sample size
rseed = 0 # Random seed
x = rep(c(1:5),N/5) # Inputted data
set.seed(rseed) # Sets random seed for reproducibility
QL <- function(dS){
h = density(dS)$bandwidth
r = log(dS^2)
f = 0*x
for(i in 1:N){
f[i] = 1/(N*h) * sum(dnorm((x-r[i])/h))
}
return(f)
}
QL(dS)
Any help would be much appreciated. Been at this for days!
You can pull the values directly from the density function:
x = rnorm(100)
d = density(x, from=-5, to = 5, n = 1000)
d$x
d$y
Alternatively, if you really want to write your own kernel density function, here's some code to get you started:
Set the points z and x range:
z = c(-2, -1, 2)
x = seq(-5, 5, 0.01)
Now we'll add the points to a graph
plot(0, 0, xlim=c(-5, 5), ylim=c(-0.02, 0.8),
pch=NA, ylab="", xlab="z")
for(i in 1:length(z)) {
points(z[i], 0, pch="X", col=2)
}
abline(h=0)
Put Normal density's around each point:
## Now we combine the kernels,
x_total = numeric(length(x))
for(i in 1:length(x_total)) {
for(j in 1:length(z)) {
x_total[i] = x_total[i] +
dnorm(x[i], z[j], sd=1)
}
}
and add the curves to the plot:
lines(x, x_total, col=4, lty=2)
Finally, calculate the complete estimate:
## Just as a histogram is the sum of the boxes,
## the kernel density estimate is just the sum of the bumps.
## All that's left to do, is ensure that the estimate has the
## correct area, i.e. in this case we divide by $n=3$:
plot(x, x_total/3,
xlim=c(-5, 5), ylim=c(-0.02, 0.8),
ylab="", xlab="z", type="l")
abline(h=0)
This corresponds to
density(z, adjust=1, bw=1)
The plots above give:
I am using the following code to create a standard normal distribution in R:
x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=2)
I need the x-axis to be labeled at the mean and at points three standard deviations above and below the mean. How can I add these labels?
The easiest (but not general) way is to restrict the limits of the x axis. The +/- 1:3 sigma will be labeled as such, and the mean will be labeled as 0 - indicating 0 deviations from the mean.
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
Another option is to use more specific labels:
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
Using the code in this answer, you could skip creating x and just use curve() on the dnorm function:
curve(dnorm, -3.5, 3.5, lwd=2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
But this doesn't use the given code anymore.
If you like hard way of doing something without using R built in function or you want to do this outside R, you can use the following formula.
x<-seq(-4,4,length=200)
s = 1
mu = 0
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2))
plot(x,y, type="l", lwd=2, col = "blue", xlim = c(-3.5,3.5))
An extremely inefficient and unusual, but beautiful solution, which works based on the ideas of Monte Carlo simulation, is this:
simulate many draws (or samples) from a given distribution (say the normal).
plot the density of these draws using rnorm. The rnorm function takes as arguments (A,B,C) and returns a vector of A samples from a normal distribution centered at B, with standard deviation C.
Thus to take a sample of size 50,000 from a standard normal (i.e, a normal with mean 0 and standard deviation 1), and plot its density, we do the following:
x = rnorm(50000,0,1)
plot(density(x))
As the number of draws goes to infinity this will converge in distribution to the normal. To illustrate this, see the image below which shows from left to right and top to bottom 5000,50000,500000, and 5 million samples.
In general case, for example: Normal(2, 1)
f <- function(x) dnorm(x, 2, 1)
plot(f, -1, 5)
This is a very general, f can be defined freely, with any given parameters, for example:
f <- function(x) dbeta(x, 0.1, 0.1)
plot(f, 0, 1)
I particularly love Lattice for this goal. It easily implements graphical information such as specific areas under a curve, the one you usually require when dealing with probabilities problems such as find P(a < X < b) etc.
Please have a look:
library(lattice)
e4a <- seq(-4, 4, length = 10000) # Data to set up out normal
e4b <- dnorm(e4a, 0, 1)
xyplot(e4b ~ e4a, # Lattice xyplot
type = "l",
main = "Plot 2",
panel = function(x,y, ...){
panel.xyplot(x,y, ...)
panel.abline( v = c(0, 1, 1.5), lty = 2) #set z and lines
xx <- c(1, x[x>=1 & x<=1.5], 1.5) #Color area
yy <- c(0, y[x>=1 & x<=1.5], 0)
panel.polygon(xx,yy, ..., col='red')
})
In this example I make the area between z = 1 and z = 1.5 stand out. You can move easily this parameters according to your problem.
Axis labels are automatic.
This is how to write it in functions:
normalCriticalTest <- function(mu, s) {
x <- seq(-4, 4, length=200) # x extends from -4 to 4
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2)) # y follows the formula
of the normal distribution: f(Y)
plot(x,y, type="l", lwd=2, xlim = c(-3.5,3.5))
abline(v = c(-1.96, 1.96), col="red") # draw the graph, with 2.5% surface to
either side of the mean
}
normalCriticalTest(0, 1) # draw a normal distribution with vertical lines.
Final result: