I'm trying to plot a few Binomial distributions and show that as N increases, the curve looks more and more like the normal. I've tried using dbinom, but here's what I get:
Here's the code I'm using to produce this distribution:
x <- -5:250
y10 <- dbinom(x, 10, 0.5)
y30 <- dbinom(x, 30, 0.5)
y60 <- dbinom(x, 60, 0.5)
y100 <- dbinom(x, 100, 0.5)
ynorm <- dnorm(x, mean=-1, sd=1)
y10 <- y10 * sqrt(y10) / 0.8
y30 <- y30 * sqrt(y30) / 0.8
y60 <- y60 * sqrt(y60) / 0.8
y100 <- y100 * sqrt(y100) / 0.8
y10 <- y10[7:17]
y30 <- y30[17:27]
y60 <- y60[32:42]
y100 <- y100[52:62]
plot(range(0, 10), range(0, 0.5), type = "n")]
lines(ynorm, col = "red", type = "l")
lines(y10, col = "blue", type = "l")
lines(y30, col = "orange", type = "l")
lines(y60, col = "green", type = "l")
lines(y100, col = "yellow", type = "l")
Does anyone know how to correctly adjust a binomial distribution in R?
Theoretically an N of 1000 should make it look like a normal distribution, but I have no clue how to get there, and I've tried/failed to use ggplot2 :(
You can rescale the x values so that x==0 always occurs at the peak density for each binomial density. You can do this by finding the x value at which the density is a maximum for each of the densities. For example:
library(ggplot2)
theme_set(theme_classic())
library(dplyr)
x <- -5:250
n = c(6,10,30,60,100)
p = 0.5
binom = data.frame(x=rep(x, length(n)),
y=dbinom(x, rep(n, each=length(x)), p),
n=rep(n, each=length(x)))
ggplot(binom %>% filter(y > 1e-5) %>%
group_by(n) %>%
mutate(x = x - x[which.max(y)]),
aes(x, y, colour=factor(n))) +
geom_line() + geom_point(size=0.6) +
labs(colour="n")
In reference to your comment, here's one way to add a normal density in addition to the binomial density: The mean of a binomial distribution is n*p, where n is the number of trials and p is the probability of success. The variance is n*p*(1-p). So, for each of the binomial densities above, we want normal densities with the same mean and variance. We create a data frame of these below and then plot the binomial and normal densities together.
First, create a new vector of x values that includes a higher density of points, to reflect the fact that the normal distribution is continuous, rather than discrete:
x = seq(-5,250,length.out=2000)
Now we create a data frame of normal densities with the same means and variances as the binomial densities above:
normal=data.frame(x=rep(x, length(n)),
y=dnorm(x, rep(n,each=length(x))*p, (rep(n, each=length(x))*p*(1-p))^0.5),
n=rep(n, each=length(x)))
# Cut off y-values below ymin
ymin = 1e-3
So now we have two data frames to plot. We still add the binom data frame in the main call to ggplot. But here we also add a call to geom_line for plotting the normal densities. And we give geom_line the normal data frame. Also, for this plot we've used geom_segment to emphasize the discrete points of the binomial density (you could also use geom_bar for this).
ggplot(binom %>% filter(y > ymin), aes(x, y)) +
geom_point(size=1.2, colour="blue") +
geom_line(data=normal %>% filter(y > ymin), lwd=0.7, colour="red") +
geom_segment(aes(x=x, xend=x, y=0, yend=y), lwd=0.8, alpha=0.5, colour="blue") +
facet_grid(. ~ n, scales="free", space="free")
Here's what the new plot looks like. You can change the scaling in various ways and there are probably many other ways to tweak it, depending on what you want to emphasize.
Related
GauPro is an R library for fitting gaussian processes. You can also get it to produce a nuce predicted curve for you.
The documentation for GauPro uses builtin r plotting functions to do plots like this:
gp <- GauPro(x,y) ## fit a gaussian process model to x & y
plot(x,y) ## plots the x,y points
curve(gp$predict(x), add=T, col=2) ## adds the predicted curve from the gaussian process
What would be the equivalent using ggplot? I can get the points to show up, but I can't quite figure out how to add the curve.
GauPro documentation I refer to is here
We can do this by building a little data frame of predictions. Let's start by loading the necessary packages and creating some sample data:
library(GauPro)
library(ggplot2)
set.seed(69)
x <- 1:10
y <- cumsum(runif(10))
Now we can create our model and plot it using the same plotting functions shown in the vignette you linked:
gp <- GauPro(x, y)
plot(x, y)
curve(gp$predict(x), add = TRUE, col = 2)
Now if we want to customize this plot using ggplot, we need a data frame with columns for the x values at which we wish to predict, the y prediction at that point, and a column each for upper and lower 95% confidence intervals. We can obtain the x values like this:
new_x <- seq(min(x), max(x), length.out = 100)
and we can get the three sets of corresponding y values using predict like this:
predict_df <- predict(gp, new_x, se.fit = TRUE)
predict_df$x <- new_x
predict_df$y <- predict_df$mean
predict_df$lower <- predict_df$y - 1.96 * predict_df$se
predict_df$upper <- predict_df$y + 1.96 * predict_df$se
this is now quite straightforward to plot in ggplot with themes customized as you choose:
ggplot(data.frame(x, y), aes(x, y)) +
geom_point() +
geom_line(data = predict_df, color = "deepskyblue4", linetype = 2) +
geom_ribbon(data = predict_df, aes(ymin = lower, ymax = upper),
alpha = 0.2, fill = "deepskyblue4") +
theme_minimal()
Created on 2020-07-29 by the reprex package (v0.3.0)
I've some data for fitting crude and adjusted logit GAMs:
library(mgcv)
## Simulate some data...
set.seed(3);n<-400
dat <- gamSim(1,n=n)
mu <- binomial()$linkinv(dat$f/4-2)
phi <- .5
a <- mu*phi;b <- phi - a;
dat$y <- rbeta(n,a,b)
## Fitting GAMs
crude <- gam(y~s(x0),family=binomial(link="logit"),data=dat)
adj <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=binomial(link="logit"),data=dat)
Now I would intercept the value of x0 with the odds ratio (OR) 1.00 (i.e. probability 0.50). For this purpose I use visreg with argument plot = FALSE.
## Prepare data for ggplotting
library(visreg)
p.crude <- visreg(crude, "x0", plot = FALSE)
p.adj <- visreg(adj, "x0", plot = FALSE)
library(dplyr)
bind_rows(
mutate(p.crude$fit, Model = "crude"),
mutate(p.adj$fit, Model = "adj")
) -> fits
Ok. I gonna compute OR from LogOR. Is the following code correct?
# Compute ORs and CI from LogOR
fits$or <- exp(fits$visregFit)
fits$ci.low <- exp(fits$visregLwr)
fits$ci.up <- exp(fits$visregUpr)
Now I use approx in order to interpolate the x0 value with OR 1.00
## Interpolate x0 which give OR 1.00 (or 50% of probability)
x.crude <- round(approx(x = crude$fitted.values, y=crude$model$x0, xout = .5)$y, 1)
x.adj <- round(approx(x = adj$fitted.values, y=adj$model$x0, xout = .5)$y, 1)
Finally, I'm plotting the two models in a single graph:
## Plotting using ggplot
library(ggplot2)
ggplot(data = fits) +
geom_vline(aes(xintercept = x.crude), size=.2, color="black")+
geom_vline(aes(xintercept = x.adj), size=.2, color="red")+
annotate(geom ="text", x= x.crude - 0.05, y=.5, label = x.crude, size=3.5) +
annotate(geom ="text", x= x.adj - 0.05, y=.5, label = x.adj, size=3.5, color="red") +
geom_ribbon(aes(x0, ymin=ci.low, ymax=ci.up, group=Model, fill=Model), alpha=.05) +
geom_line(aes(x0, or, group=Model, color=Model)) +
labs(x="X0", y="Odds ratio")+
theme_bw(16)
As you can see, only the crude model shows an intercept with OR almost equal to 1.00 (x0 = 0.9), while this never happens for the adj model.
First, how can I get an interpolation with OR that is exactly at 1?
Second...With the limitation of my statistical knowledge, it was my understanding that I should have observed an intercept with OR=1 for the adj model, as well, based on the observed values for x0 according to this model. Why is the relative curve set upwards?
First things first, I got the 2 mixed distributions (they have mixed part) and I've known the samples come from which distribution.
Then I want to plot histogram according to the samples' density and the mixture distribution.
Let's head to the code (seg 1):
library(mixtools)
# two components
set.seed(1) # for reproducible example
b1 <- rnorm(900000, mean=8, sd=2) # samples
b2 <- rnorm(100000, mean=17, sd=2)
# densities corresponding to samples
d = dnorm(c(b1, b2), mean = 8, sd = 2)*.9 + dnorm(c(b1, b2), mean = 17, sd = 2)*.1
# ground truth
b <- data.frame(ss=c(b1,b2), dd=d, gg=factor(c(rep(1, length(b1)), rep(2, length(b2)))))
# sample from mixed distribution
c <- b[sample(nrow(b), 500000),]
library(ggplot2)
ggplot(data = c, aes(x = ss)) +
geom_histogram(aes(y = stat(density)), binwidth = .5, alpha = .3, position="identity") +
geom_line(data = c, aes(x = ss, y = dd), color = "red", inherit.aes = FALSE)
this result is fine: like this
But I want to fill the color according to the samples' group. So I change the code (seg 2):
ggplot(data=c, aes(x=ss)) +
geom_histogram(aes(y=stat(density), fill=gg, color=gg),
binwidth=.5, alpha=.3, position="identity") +
geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)
the result is wrong. R calculate the density of two parts separately. So the two part looks like the same height.
Then I found some methods like this (seg 3):
breaks = seq(min(c$ss), max(c$ss), .5) # form cut points
bins1 = cut(with(c, ss[gg==1]), breaks) # form intervals by cutting
bins2 = cut(with(c, ss[gg==2]), breaks)
cnt1 = sapply(split(with(c, ss[gg==1]), bins1), length) # assign points to its interval
cnt2 = sapply(split(with(c, ss[gg==2]), bins2), length)
h = data.frame(
x = head(breaks, -1)+.25,
dens1 = cnt1/sum(cnt1,cnt2), # height of density bar
dens2 = cnt2/sum(cnt1,cnt2)
# weight = sapply(split(samples.mixgamma$samples, bins), sum)
)
ggplot(h) +
geom_bar(aes(x, dens1), fill="red", alpha = .3, stat="identity") +
geom_bar(aes(x, dens2), fill="blue", alpha = .3, stat="identity") +
geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)
or set y=stat(count)/sum(stat(count)) like this (seg 4):
ggplot(data=c, aes(x=ss)) +
geom_histogram(aes(y=stat(count)/sum(stat(count)), fill=gg, color=gg),
binwidth=.5, alpha=.3, position="identity") +
geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)
the results are the same and wrong, all the bars are about half as tall as seg 1.
So if I want to fill the 2 groups with different color with mixture like seg 2 and the right proportion like seg 1 and avoid the mistake like seg 3 and seg 4, what can I do?
Many thanks!
The solution is that: probability density should be calculated as y=stat(count)/.5/sum(stat(count)). I only do the normolization but not divide mass by it's volume. So the answer such as this and seg 3 need to be modified
I was wondering how I can modify the following code to have a plot something like
data(airquality)
library(quantreg)
library(ggplot2)
library(data.table)
library(devtools)
# source Quantile LOESS
source("https://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")
airquality2 <- na.omit(airquality[ , c(1, 4)])
#'' quantreg::rq
rq_fit <- rq(Ozone ~ Temp, 0.95, airquality2)
rq_fit_df <- data.table(t(coef(rq_fit)))
names(rq_fit_df) <- c("intercept", "slope")
#'' quantreg::lprq
lprq_fit <- lapply(1:3, function(bw){
fit <- lprq(airquality2$Temp, airquality2$Ozone, h = bw, tau = 0.95)
return(data.table(x = fit$xx, y = fit$fv, bw = paste0("bw=", bw), fit = "quantreg::lprq"))
})
#'' Quantile LOESS
ql_fit <- Quantile.loess(airquality2$Ozone, jitter(airquality2$Temp), window.size = 10,
the.quant = .95, window.alignment = c("center"))
ql_fit_df <- data.table(x = ql_fit$x, y = ql_fit$y.loess, bw = "bw=1", fit = "Quantile LOESS")
I want to have all these fits in a plot.
geom_quantile can calculate quantiles using the rq method internally, so we don't need to create the rq_fit_df separately. However, the lprq and Quantile LOESS methods aren't available within geom_quantile, so I've used the data frames you provided and plotted them using geom_line.
In addition, to include the rq line in the color and linetype mappings and in the legend we add aes(colour="rq", linetype="rq") as a sort of "artificial" mapping inside geom_quantile.
library(dplyr) # For bind_rows()
ggplot(airquality2, aes(Temp, Ozone)) +
geom_point() +
geom_quantile(quantiles=0.95, formula=y ~ x, aes(colour="rq", linetype="rq")) +
geom_line(data=bind_rows(lprq_fit, ql_fit_df),
aes(x, y, colour=paste0(gsub("q.*:","",fit),": ", bw),
linetype=paste0(gsub("q.*:","",fit),": ", bw))) +
theme_bw() +
scale_linetype_manual(values=c(2,4,5,1,1)) +
labs(colour="Method", linetype="Method",
title="Different methods of estimating the 95th percentile by quantile regression")
I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")