Getting values from kernel density estimation in R - r

I am trying to get density estimates for the log of stock prices in R. I know I can plot it using plot(density(x)). However, I actually want values for the function.
I'm trying to implement the kernel density estimation formula. Here's what I have so far:
a <- read.csv("boi_new.csv", header=FALSE)
S = a[,3] # takes column of increments in stock prices
dS=S[!is.na(S)] # omits first empty field
N = length(dS) # Sample size
rseed = 0 # Random seed
x = rep(c(1:5),N/5) # Inputted data
set.seed(rseed) # Sets random seed for reproducibility
QL <- function(dS){
h = density(dS)$bandwidth
r = log(dS^2)
f = 0*x
for(i in 1:N){
f[i] = 1/(N*h) * sum(dnorm((x-r[i])/h))
}
return(f)
}
QL(dS)
Any help would be much appreciated. Been at this for days!

You can pull the values directly from the density function:
x = rnorm(100)
d = density(x, from=-5, to = 5, n = 1000)
d$x
d$y
Alternatively, if you really want to write your own kernel density function, here's some code to get you started:
Set the points z and x range:
z = c(-2, -1, 2)
x = seq(-5, 5, 0.01)
Now we'll add the points to a graph
plot(0, 0, xlim=c(-5, 5), ylim=c(-0.02, 0.8),
pch=NA, ylab="", xlab="z")
for(i in 1:length(z)) {
points(z[i], 0, pch="X", col=2)
}
abline(h=0)
Put Normal density's around each point:
## Now we combine the kernels,
x_total = numeric(length(x))
for(i in 1:length(x_total)) {
for(j in 1:length(z)) {
x_total[i] = x_total[i] +
dnorm(x[i], z[j], sd=1)
}
}
and add the curves to the plot:
lines(x, x_total, col=4, lty=2)
Finally, calculate the complete estimate:
## Just as a histogram is the sum of the boxes,
## the kernel density estimate is just the sum of the bumps.
## All that's left to do, is ensure that the estimate has the
## correct area, i.e. in this case we divide by $n=3$:
plot(x, x_total/3,
xlim=c(-5, 5), ylim=c(-0.02, 0.8),
ylab="", xlab="z", type="l")
abline(h=0)
This corresponds to
density(z, adjust=1, bw=1)
The plots above give:

Related

Variables in R how, central limit

I have following task:
Assume the population of interest can be modeled by a Bernoulli distribution with
p = 0.5.
For each sample size n simulate r = 5, 000 draws (by using a for loop over (i in
1:r)) from that Bernoulli distribution with p = 0.5 and calculate the standardized
sample mean for each draw.
The last histogram looks good with a curve, but 1st and 2ns are wrong. Maybe someone han help me with this. Thanks in advance for your time!
I have done following:
set.seed(2005)
x1 <- rbinom(5000,3,0.5)
par(mfrow=c(2,2))
hist(x=x1,
main=expression(paste(" Random Variables with",size,"=1 and",prob,"=0.5")),
sub="Standardized value of smple sample avearge",
xlab="n=3", ylab="Probability", probability = TRUE)
curve(dnorm(x, mean = mean(x), sd=sd(x)), add = TRUE, col="blue")
Essentially what happened in the first two panels is that for a small n the histogram breaks were calculated in an ungraceful manner. You can fix that by letting the breaks depend on the data range. Here, I chose the breaks depending on whether the range of the data was smaller than 10. If this is TRUE, manually calculate breaks, otherwise use the default "Sturges" algorithm for breaks.
par(mfrow=c(2,2))
N <- c(2, 5, 25, 100)
for (i in seq_along(N)) {
set.seed(2015 + i)
n <- N[i]
xx <- rbinom(10000, n, 0.78)
if (diff(range(xx)) < 10) {
breaks <- seq(floor(min(xx)), ceiling(max(xx)))
} else {
breaks <- "Sturges"
}
hist(
x = xx, breaks = breaks,
main=expression(paste("Bernoulli Random Variables with",size,"=1 and",prob,"=0.78")),
sub = "Standardized value of sample average",
xlab = paste0("n=",n), ylab = "Probability", probability = TRUE
)
curve(dnorm(x, mean = mean(xx), sd=sd(xx)), add = TRUE, col="blue")
}
Created on 2021-01-07 by the reprex package (v0.3.0)

Conway Maxwell Distribution Density Plot

I have written my own code to simulate the Conway maxwell distribution sample.
This is the pmf (Guikema & Goffelt, 2008):
However, I have met some problem to plot the density plot.
rcomp <- function(n,lamb,v)
{
u <- runif(n)
w <- integer(n)
for(i in 1:n) {
z=sum(sapply( 0:100, function(j) (( ((lamb)^j) / (factorial(j)) )^v) ))
x <- seq(1, 50, 1) #seq of 1 to 50, increase by 1
px <- (((lamb^x)/factorial(x))^v)/z
# px is pmf of re-parameter conway maxwell
w[i] <- if (u[i] < px[1]) 0 else (max (which (cumsum(px) <= u[i])))
}
return (w)
}
dcomp <- function(x,lamb,v) {
z=sum(sapply( 0:100, function(j) (( ((lamb)^j) / (factorial(j)) )^v) ))
px <- (((lamb^x)/factorial(x))^v)/z
return(px)
}
As I wanna plot the density plot to check whether lamb or v is location parameter, the plot I get is weird.
x = rcomp(100,6,0.2); pdf = dcomp(x,6,0.2)
x1 = rcomp(100,6,0.5); pdf1 = dcomp(x1,6,0.5)
x2 = rcomp(100,6,0.7); pdf2 = dcomp(x2,6,0.7)
plot(x2, pdf2, type="l", lwd=1,lty=1,col="blue")
How could I solve this problem?
Source: Guikema & Goffelt (2008), A Flexible Count Data Regression Model for Risk Analysis. Risk Analysis 28(1): 215.
You have to sort the values of the x coordinate if you want a graph to connect the points in their axis order.
Note, however, that there might be better ways to graph the density you want. See the red curve. I first create a vector x of values within a certain range and then compute the PDF for those values. These pairs (x, y) are what function lines plots.
set.seed(2673) # Make the results reproducible
x2 <- rcomp(100, 6, 0.7)
x2 <- sort(x2)
pdf2 <- dcomp(x2, 6, 0.7)
plot(x2, pdf2, type = "l", lwd = 1, lty = 1, col = "blue")
x <- seq(0, 50, length.out = 100)
y <- dcomp(x, 6, 0.2)
lines(x, y, type = "l", col = "red")

adjusting plot axis in user defined function - R

I have a function in R which creates a standard normal plot, and then uses a for loop that calls density plots for the t distribution for various degrees of freedom. The plot looks like:
Note that the density for degrees of freedom = 2 extends outside of the y axis limits. I am wondering if there is a way to edit the for loop so that the axis limits are adjusted according to the range of the density lines that are drawn.
The for loop code that I am using is as follows:
N <- 1000
n <- c(25,50,100,200)
df<-c(1:4,seq(5,25,by=5))
histPlot <- function(data) {
x <- seq(-4, 4, length=100)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l",
main=paste("Distribution of size", nrow(data)/9000, sep=" "),
xlab="standard deviation")
colors <- brewer.pal(n = 9, name = "Spectral")
i<-1
for (d in df) {
lines(density(data[data$df==d, "t"]),col=colors[i])
legend("topright", pch=c(21,21), col=c(colors, "black"), legend=c(df, "normal"), bty="o", cex=.8)
i <- i+1
}
}
The lines functions called inside the for loop add up to the existing plot.
This means you have to change the ylim parameter in the plot function call. This will make a higher plot, and lines will be visible when added.
Try like this:
plot(x, y, type="l",
main=paste("Distribution of size", nrow(data)/9000, sep=" "),
xlab="standard deviation",
ylim = c(0, 1)) # This line will make the plot higher, i.e. the y axis range will be from 0 to 1

How to plot a normal distribution by labeling specific parts of the x-axis?

I am using the following code to create a standard normal distribution in R:
x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=2)
I need the x-axis to be labeled at the mean and at points three standard deviations above and below the mean. How can I add these labels?
The easiest (but not general) way is to restrict the limits of the x axis. The +/- 1:3 sigma will be labeled as such, and the mean will be labeled as 0 - indicating 0 deviations from the mean.
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
Another option is to use more specific labels:
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
Using the code in this answer, you could skip creating x and just use curve() on the dnorm function:
curve(dnorm, -3.5, 3.5, lwd=2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
But this doesn't use the given code anymore.
If you like hard way of doing something without using R built in function or you want to do this outside R, you can use the following formula.
x<-seq(-4,4,length=200)
s = 1
mu = 0
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2))
plot(x,y, type="l", lwd=2, col = "blue", xlim = c(-3.5,3.5))
An extremely inefficient and unusual, but beautiful solution, which works based on the ideas of Monte Carlo simulation, is this:
simulate many draws (or samples) from a given distribution (say the normal).
plot the density of these draws using rnorm. The rnorm function takes as arguments (A,B,C) and returns a vector of A samples from a normal distribution centered at B, with standard deviation C.
Thus to take a sample of size 50,000 from a standard normal (i.e, a normal with mean 0 and standard deviation 1), and plot its density, we do the following:
x = rnorm(50000,0,1)
plot(density(x))
As the number of draws goes to infinity this will converge in distribution to the normal. To illustrate this, see the image below which shows from left to right and top to bottom 5000,50000,500000, and 5 million samples.
In general case, for example: Normal(2, 1)
f <- function(x) dnorm(x, 2, 1)
plot(f, -1, 5)
This is a very general, f can be defined freely, with any given parameters, for example:
f <- function(x) dbeta(x, 0.1, 0.1)
plot(f, 0, 1)
I particularly love Lattice for this goal. It easily implements graphical information such as specific areas under a curve, the one you usually require when dealing with probabilities problems such as find P(a < X < b) etc.
Please have a look:
library(lattice)
e4a <- seq(-4, 4, length = 10000) # Data to set up out normal
e4b <- dnorm(e4a, 0, 1)
xyplot(e4b ~ e4a, # Lattice xyplot
type = "l",
main = "Plot 2",
panel = function(x,y, ...){
panel.xyplot(x,y, ...)
panel.abline( v = c(0, 1, 1.5), lty = 2) #set z and lines
xx <- c(1, x[x>=1 & x<=1.5], 1.5) #Color area
yy <- c(0, y[x>=1 & x<=1.5], 0)
panel.polygon(xx,yy, ..., col='red')
})
In this example I make the area between z = 1 and z = 1.5 stand out. You can move easily this parameters according to your problem.
Axis labels are automatic.
This is how to write it in functions:
normalCriticalTest <- function(mu, s) {
x <- seq(-4, 4, length=200) # x extends from -4 to 4
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2)) # y follows the formula
of the normal distribution: f(Y)
plot(x,y, type="l", lwd=2, xlim = c(-3.5,3.5))
abline(v = c(-1.96, 1.96), col="red") # draw the graph, with 2.5% surface to
either side of the mean
}
normalCriticalTest(0, 1) # draw a normal distribution with vertical lines.
Final result:

Plot weighted frequency matrix

This question is related to two different questions I have asked previously:
1) Reproduce frequency matrix plot
2) Add 95% confidence limits to cumulative plot
I wish to reproduce this plot in R:
I have got this far, using the code beneath the graphic:
#Set the number of bets and number of trials and % lines
numbet <- 36
numtri <- 1000
#Fill a matrix where the rows are the cumulative bets and the columns are the trials
xcum <- matrix(NA, nrow=numbet, ncol=numtri)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(5/6,1/6), replace = TRUE)
xcum[,i] <- cumsum(x)/(1:numbet)
}
#Plot the trials as transparent lines so you can see the build up
matplot(xcum, type="l", xlab="Number of Trials", ylab="Relative Frequency", main="", col=rgb(0.01, 0.01, 0.01, 0.02), las=1)
My question is: How can I reproduce the top plot in one pass, without plotting multiple samples?
Thanks.
You can produce this plot...
... by using this code:
boring <- function(x, occ) occ/x
boring_seq <- function(occ, length.out){
x <- seq(occ, length.out=length.out)
data.frame(x = x, y = boring(x, occ))
}
numbet <- 31
odds <- 6
plot(1, 0, type="n",
xlim=c(1, numbet + odds), ylim=c(0, 1),
yaxp=c(0,1,2),
main="Frequency matrix",
xlab="Successive occasions",
ylab="Relative frequency"
)
axis(2, at=c(0, 0.5, 1))
for(i in 1:odds){
xy <- boring_seq(i, numbet+1)
lines(xy$x, xy$y, type="o", cex=0.5)
}
for(i in 1:numbet){
xy <- boring_seq(i, odds+1)
lines(xy$x, 1-xy$y, type="o", cex=0.5)
}
You can also use Koshke's method, by limiting the combinations of values to those with s<6 and at Andrie's request added the condition on the difference of Ps$n and ps$s to get a "pointed" configuration.
ps <- ldply(0:35, function(i)data.frame(s=0:i, n=i))
plot.new()
plot.window(c(0,36), c(0,1))
apply(ps[ps$s<6 & ps$n - ps$s < 30, ], 1, function(x){
s<-x[1]; n<-x[2];
lines(c(n, n+1, n, n+1), c(s/n, s/(n+1), s/n, (s+1)/(n+1)), type="o")})
axis(1)
axis(2)
lines(6:36, 6/(6:36), type="o")
# need to fill in the unconnected points on the upper frontier
Weighted Frequency Matrix is also called Position Weight Matrix (in bioinformatics).
It can be represented in a form of a sequence logo.
This is at least how I plot weighted frequency matrix.
library(cosmo)
data(motifPWM); attributes(motifPWM) # Loads a sample position weight matrix (PWM) containing 8 positions.
plot(motifPWM) # Plots the PWM as sequence logo.

Resources