R prob = TRUE in hist doesn't give out a density distribution - r

earthquake <- function(lambda=1, n_sim =10, n=100){
meanls <- c()
for (i in 1:n){
meanls <- c(meanls,round(mean(rexp(n_sim,1/lambda)),2))
}
return(meanls)
}
xbar <- earthquake(2.4,1000,40)
hist(xbar, prob=TRUE, col="moccasin",las= TRUE)
I have the code above, and it should return a density distribution histogram since I set probability to TRUE, while I just get frequency diagram. Is there anything else I should do with the data?

If you set a random seed you can replicate your results. Otherwise you will need to adjust your xlim= according to your data. You do not say why you are using sd=2.4/sqrt(40)) as the standard deviation instead of sd(xbar) which is what I have used here. That produces a very broad, flat curve that does not match the data at all. If you wanted the standard error curve, that would be sd(xbar)/sqrt(40).
set.seed(42)
xbar <- earthquake(2.4, 1000, 40)
range(xbar)
# [1] 2.19 2.59
hist(xbar, prob=TRUE, xlim=c(2.1, 2.7), col="moccasin", las= TRUE)
x <- seq(2.1, 2.7, length.out=100)
curve(dnorm(x, mean=mean(xbar), sd=sd(xbar)), col="blue", add=TRUE, lwd=2)
lines(density(xbar), col="red", lwd = 2)

Related

Variables in R how, central limit

I have following task:
Assume the population of interest can be modeled by a Bernoulli distribution with
p = 0.5.
For each sample size n simulate r = 5, 000 draws (by using a for loop over (i in
1:r)) from that Bernoulli distribution with p = 0.5 and calculate the standardized
sample mean for each draw.
The last histogram looks good with a curve, but 1st and 2ns are wrong. Maybe someone han help me with this. Thanks in advance for your time!
I have done following:
set.seed(2005)
x1 <- rbinom(5000,3,0.5)
par(mfrow=c(2,2))
hist(x=x1,
main=expression(paste(" Random Variables with",size,"=1 and",prob,"=0.5")),
sub="Standardized value of smple sample avearge",
xlab="n=3", ylab="Probability", probability = TRUE)
curve(dnorm(x, mean = mean(x), sd=sd(x)), add = TRUE, col="blue")
Essentially what happened in the first two panels is that for a small n the histogram breaks were calculated in an ungraceful manner. You can fix that by letting the breaks depend on the data range. Here, I chose the breaks depending on whether the range of the data was smaller than 10. If this is TRUE, manually calculate breaks, otherwise use the default "Sturges" algorithm for breaks.
par(mfrow=c(2,2))
N <- c(2, 5, 25, 100)
for (i in seq_along(N)) {
set.seed(2015 + i)
n <- N[i]
xx <- rbinom(10000, n, 0.78)
if (diff(range(xx)) < 10) {
breaks <- seq(floor(min(xx)), ceiling(max(xx)))
} else {
breaks <- "Sturges"
}
hist(
x = xx, breaks = breaks,
main=expression(paste("Bernoulli Random Variables with",size,"=1 and",prob,"=0.78")),
sub = "Standardized value of sample average",
xlab = paste0("n=",n), ylab = "Probability", probability = TRUE
)
curve(dnorm(x, mean = mean(xx), sd=sd(xx)), add = TRUE, col="blue")
}
Created on 2021-01-07 by the reprex package (v0.3.0)

Overlay a Normal curve to Histogram

I repeat 50 times a rnorm with n=100, mean=100 and sd=25. Then I plot the histogram of all the sample means, but now I need to overlay a normal curve over the histogram.
x <- replicate(50, rnorm(100, 100, 25), simplify = FALSE)
x
sapply(x, mean)
sapply(x, sd)
hist(sapply(x, mean))
Do you know ow to overlay a normal curve over the histogram of the means?
Thanks
When we plot the density rather than the frequency histogram by setting freq=FALSE, we may overlay a curve of a normal distribution with the mean of the means. For the xlim of the curve we use the range of the means.
mean.of.means <- mean(sapply(x, mean))
r <- range(sapply(x, mean))
v <- hist(sapply(x, mean), freq=FALSE, xlim=r, ylim=c(0, .5))
curve(dnorm(x, mean=mean.of.means, sd=1), r[1], r[2], add=TRUE, col="red")
Also possible is to draw a sufficient amount of a normal distribution, and overlay the histogram with the lines of the density distribution.
lines(density(rnorm(1e6, mean.of.means, 1)))
Note, that I have used 500 mean values in my answer, since the comparison with a normal distribution may become meaningless with too few values. However, you can play with the breaks= option in the histogram function.
Data
set.seed(42)
x <- replicate(500, rnorm(100, 100, 25), simplify = FALSE)

Create sample vector data in R with a skewed distribution with limited range

I want to create in R a sample vector of data in R, in which I can control the range of values selected, so I think I want to use sample to limit the range of values generated rather than an rnorm-type command that generates a range of values based upon the type of distribution, variance, SD, etc.
So I'm looking to do a sample with a specified range (e.g. 1-5) for a skewed distribution something like this:
x=rexp(100,1/10)
Here's what I have but does not provide a skewed distribution:
y=sample(1:5,234, replace=T)
How can I have my cake (limited range) and eat it too (skewed distribution), so to speak.
Thanks
set.seed(3)
hist(sample(1:10, size = 100, replace = TRUE, prob = 10:1))
The beta distribution takes values from 0 to 1. If you want your values to be from 0 to 5 for instance, then you can multiply them by 5. Finally, you can get a "skewness" with the beta distribution.
For example, for the skewness you can get these three types:
And using R and beta distribution you can get similar distributions as follows. Notice that the Green Vertical line refers to mean and the Red to median:
x= rbeta(10000,5,2)
hist(x, main="Negative or Left Skewness", freq=FALSE)
lines(density(x), col='red', lwd=3)
abline(v = c(mean(x),median(x)), col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))
x= rbeta(10000,2,5)
hist(x, main="Positive or Right Skewness", freq=FALSE)
lines(density(x), col='red', lwd=3)
abline(v = c(mean(x),median(x)), col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))
x= rbeta(10000,5,5)
hist(x, main="Symmetrical", freq=FALSE)
lines(density(x), col='red', lwd=3)
abline(v = c(mean(x),median(x)), col=c("green", "red"), lty=c(2,2), lwd=c(3, 3))
To better see what the sample function is doing with integers, use the barplot function, not the histogram function:
set.seed(3)
barplot(table(sample(1:10, size = 100, replace = TRUE, prob = 10:1)))

Getting values from kernel density estimation in R

I am trying to get density estimates for the log of stock prices in R. I know I can plot it using plot(density(x)). However, I actually want values for the function.
I'm trying to implement the kernel density estimation formula. Here's what I have so far:
a <- read.csv("boi_new.csv", header=FALSE)
S = a[,3] # takes column of increments in stock prices
dS=S[!is.na(S)] # omits first empty field
N = length(dS) # Sample size
rseed = 0 # Random seed
x = rep(c(1:5),N/5) # Inputted data
set.seed(rseed) # Sets random seed for reproducibility
QL <- function(dS){
h = density(dS)$bandwidth
r = log(dS^2)
f = 0*x
for(i in 1:N){
f[i] = 1/(N*h) * sum(dnorm((x-r[i])/h))
}
return(f)
}
QL(dS)
Any help would be much appreciated. Been at this for days!
You can pull the values directly from the density function:
x = rnorm(100)
d = density(x, from=-5, to = 5, n = 1000)
d$x
d$y
Alternatively, if you really want to write your own kernel density function, here's some code to get you started:
Set the points z and x range:
z = c(-2, -1, 2)
x = seq(-5, 5, 0.01)
Now we'll add the points to a graph
plot(0, 0, xlim=c(-5, 5), ylim=c(-0.02, 0.8),
pch=NA, ylab="", xlab="z")
for(i in 1:length(z)) {
points(z[i], 0, pch="X", col=2)
}
abline(h=0)
Put Normal density's around each point:
## Now we combine the kernels,
x_total = numeric(length(x))
for(i in 1:length(x_total)) {
for(j in 1:length(z)) {
x_total[i] = x_total[i] +
dnorm(x[i], z[j], sd=1)
}
}
and add the curves to the plot:
lines(x, x_total, col=4, lty=2)
Finally, calculate the complete estimate:
## Just as a histogram is the sum of the boxes,
## the kernel density estimate is just the sum of the bumps.
## All that's left to do, is ensure that the estimate has the
## correct area, i.e. in this case we divide by $n=3$:
plot(x, x_total/3,
xlim=c(-5, 5), ylim=c(-0.02, 0.8),
ylab="", xlab="z", type="l")
abline(h=0)
This corresponds to
density(z, adjust=1, bw=1)
The plots above give:

Plot weighted frequency matrix

This question is related to two different questions I have asked previously:
1) Reproduce frequency matrix plot
2) Add 95% confidence limits to cumulative plot
I wish to reproduce this plot in R:
I have got this far, using the code beneath the graphic:
#Set the number of bets and number of trials and % lines
numbet <- 36
numtri <- 1000
#Fill a matrix where the rows are the cumulative bets and the columns are the trials
xcum <- matrix(NA, nrow=numbet, ncol=numtri)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(5/6,1/6), replace = TRUE)
xcum[,i] <- cumsum(x)/(1:numbet)
}
#Plot the trials as transparent lines so you can see the build up
matplot(xcum, type="l", xlab="Number of Trials", ylab="Relative Frequency", main="", col=rgb(0.01, 0.01, 0.01, 0.02), las=1)
My question is: How can I reproduce the top plot in one pass, without plotting multiple samples?
Thanks.
You can produce this plot...
... by using this code:
boring <- function(x, occ) occ/x
boring_seq <- function(occ, length.out){
x <- seq(occ, length.out=length.out)
data.frame(x = x, y = boring(x, occ))
}
numbet <- 31
odds <- 6
plot(1, 0, type="n",
xlim=c(1, numbet + odds), ylim=c(0, 1),
yaxp=c(0,1,2),
main="Frequency matrix",
xlab="Successive occasions",
ylab="Relative frequency"
)
axis(2, at=c(0, 0.5, 1))
for(i in 1:odds){
xy <- boring_seq(i, numbet+1)
lines(xy$x, xy$y, type="o", cex=0.5)
}
for(i in 1:numbet){
xy <- boring_seq(i, odds+1)
lines(xy$x, 1-xy$y, type="o", cex=0.5)
}
You can also use Koshke's method, by limiting the combinations of values to those with s<6 and at Andrie's request added the condition on the difference of Ps$n and ps$s to get a "pointed" configuration.
ps <- ldply(0:35, function(i)data.frame(s=0:i, n=i))
plot.new()
plot.window(c(0,36), c(0,1))
apply(ps[ps$s<6 & ps$n - ps$s < 30, ], 1, function(x){
s<-x[1]; n<-x[2];
lines(c(n, n+1, n, n+1), c(s/n, s/(n+1), s/n, (s+1)/(n+1)), type="o")})
axis(1)
axis(2)
lines(6:36, 6/(6:36), type="o")
# need to fill in the unconnected points on the upper frontier
Weighted Frequency Matrix is also called Position Weight Matrix (in bioinformatics).
It can be represented in a form of a sequence logo.
This is at least how I plot weighted frequency matrix.
library(cosmo)
data(motifPWM); attributes(motifPWM) # Loads a sample position weight matrix (PWM) containing 8 positions.
plot(motifPWM) # Plots the PWM as sequence logo.

Resources