This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
ggplot2: Overlay histogram with density curve
sorry for what is probably a simple question, but I have a bit of a problem.
I have created a histogram that is based on a binomial distribution with mean=0.65 and sd=0.015 with 10000 samples. The histogram itself looks fine. However, I need to overlay a normal distribution on top of this (with the same mean and standard deviation). Currently, I have the following:
qplot(x, data=prob, geom="histogram", binwidth=.05) + stat_function(geom="line", fun=dnorm, arg=list(mean=0.65, sd=0.015))
A distribution shows up, but it is TINY. This is likely because the mean's count goes up to almost 2,000, while the normal distribution is much smaller. Simply put, it is not fitted with the data the way that R automatically would do. Is there a way to specify the line of the normal distribution to fit the histogram, or is there some way to manipulate the histogram to fit the normal distribution?
Thanks in advance.
"The distribution is tiny" because you are plotting a density function over counts. You should use the same metric in both plot, eg.:
I try to generate some data for your example:
x <- rbinom(10000, 10, 0.15)
prob <- data.frame(x=x/(mean(x)/0.65))
And plot both as density functions:
library(ggplot2)
ggplot(prob, aes(x=x)) + geom_histogram(aes(y = ..density..), binwidth=.05) + stat_function(geom="line", fun=dnorm, arg=list(mean=0.65, sd=0.015))
#daroczig's answer is correct about needing to be consistent in plotting densities rather than counts, but: I'm having trouble seeing how you managed to get a binomial sample with those properties. In particular, the mean of the binomial is n*p, the variance is n*p*(1-p), the standard deviation is sqrt(n*p*(1-p)), so ..
b.m <- 0.65
b.sd <- 0.015
Calculate variance:
b.v <- b.sd^2 ## n*p*(1-p)
Calculate p:
## (1-p) = b.v/(n*p) = b.v/b.m
## p = 1-b.v/b.m
b.p <- 1-b.v/b.m
Calculate n:
## n = n*p/p = b.m/b.p
b.n <- b.m/b.p
This gives n=0.6502251, p=0.9996538 -- so I don't see how you can get this binomial distribution without n<1, unless I messed up the algebra ...
Related
I have some some samples of a high dimensional density that I would like to plot. I like to create a grid of them where their bivariate density is plotted when they cross. For example, in Bayes and Big Data - The Consensus Monte Carlo Algorithm, Scott et al. (2016) has the following plot:
In this plot, above the diagonal are the distributions on a scale just large enough to fit the plots. In the below diagonal, the bivariate densities are plotted on a common scale.
Does anyone know how I can achieve such a plot?
For instance if I have just generated a 5-dimensional Gaussian distribution using:
library(MASS)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
This is relatively easy using the facet_matrix() from the ggforce package. You just have to specify which layer goes on what part of the plot (i.e. layer.upper = 1 says that the first layer (geom_density2d()) should go in the upper triangular part of the matrix. The geom_autodensity() makes sure that the KDE touches the bottom part of the plot.
library(MASS)
library(ggforce)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
df <- as.data.frame(data)
ggplot(df) +
geom_density2d(aes(x = .panel_x, y = .panel_y)) +
geom_autodensity() +
geom_point(aes(x = .panel_x, y = .panel_y)) +
facet_matrix(vars(V1:V5), layer.upper = 1, layer.diag = 2)
More details about facet_matrix() are posted here.
So, I am challenged and request a little guidance.
I have used the rriskDistributions package to evaluate some CDFs for some industrial sector injury data with the get.lnorm.par() function. It fits the data great, unfortunately, the axes require swapping because my response variable is currently on the x-axis, and needs to be on the y-axis. Unfortunately again, the get.lnorm.par() function requires that the probabilities be only on the y-axis, and I cannot figure out how to create the same curve with swapped axes.
I want to get it to look something like this:
An example of the code that I have worked through in ggplot follows:
x <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
y <- c(3170221,6810103,14999840,26623982,48903587,74177290,266181110)
prob <- c(x) ## There are 389 different x values, but keeping it simple!
quant <- c(y) ## Same as x.
df1 <- data.frame(prob,quant)
plot2 <- ggplot(df1, aes(x=prob, y=quant)) + geom_point() +
geom_smooth(method="lm", formula= log(y)~x, se=FALSE) +
labs(y="quantiles", x="probabilities", title="Probs vs Quants")
plot2
I have created lines that fit this data, but everything ends at the last data point.
When I used get.lnorm.par(), the fit was great, but like stated previously, the axes require flipping. When I tried this, I continued to get errors about infinite output and could not define the bounds of the function to be plotted.
So, here is the code using the rriskDistributions package:
pct <- c(0.0416988,0.0656371,0.1015444,0.1270270,0.1536680,0.1694981,0.2509653)
my.lnorm<-get.lnorm.par(p=pct, q=c(3170221,6810103,14999840,26623982,48903587,74177290,266181110),
tol = 0.001, scaleX = c(0,0.0809))
Essentially, I am trying to create a fit curve for the data (either exponential or power) that expands, or predicts beyond the final data point. This I cannot figure out for the life of me, and changing any of the parameters in the rriskDistributions functions is quite challenging.
Any thoughts?
Thanks.
So I have a barplot in which the y axis is the log (frequencies). From just eyeing it, it appears that bars decrease exponentially, but I would like to know this for sure. What I want to do is also plot an exponential on this same graph. Thus, if my bars fall below the exponential, I would know that my bars to decrease either exponentially or faster than exponential, and if the bars lie on top of the exponential, I would know that they dont decrease exponentially. How do I plot an exponential on a bar graph?
Here is my graph if that helps:
If you're trying to fit density of an exponential function, you should probably plot density histogram (not frequency). See this question on how to plot distributions in R.
This is how I would do it.
x.gen <- rexp(1000, rate = 3)
hist(x.gen, prob = TRUE)
library(MASS)
x.est <- fitdistr(x.gen, "exponential")$estimate
curve(dexp(x, rate = x.est), add = TRUE, col = "red", lwd = 2)
One way of visually inspecting if two distributions are the same is with a Quantile-Quantile plot, or Q-Q plot for short. Typically this is done when inspecting if a distribution follows standard normal.
The basic idea is to plot your data, against some theoretical quantiles, and if it matches that distribution, you will see a straight line. For example:
x <- qnorm(seq(0,1,l=1002)) # Theoretical normal quantiles
x <- x[-c(1, length(x))] # Drop ends because they are -Inf and Inf
y <- rnorm(1000) # Actual data. 1000 points drawn from a normal distribution
l.1 <- lm(sort(y)~sort(x))
qqplot(x, y, xlab="Theoretical Quantiles", ylab="Actual Quantiles")
abline(coef(l.1)[1], coef(l.1)[2])
Under perfect conditions you should see a straight line when plotting the theoretical quantiles against your data. So you can do the same plotting your data against the exponential function you think it will follow.
I have a probability density function in a plot called ph that i derived from two samples of data, by the help of a user of stackoverflow, in this way
few <-read.table('outcome.dat',head=TRUE)
many<-read.table('alldata.dat',head=TRUE)
mh <- hist(many$G,breaks=seq(0,1.,by=0.03), plot=FALSE)
fh <- hist(few$G, breaks=mh$breaks, plot=FALSE)
ph <- fh
ph$density <- fh$counts/(mh$counts+0.001)
plot(ph,freq=FALSE,col="blue")
I would like to fit the best curve of the plot of ph, but i can't find a working method.
how can i do this? I have to extract the vaule from ph and then works on they? or there is same function that works on
plot(ph,freq=FALSE,col="blue")
directly?
Assuming you mean that you want to perform a curve fit to the data in ph, then something along the lines of
nls(FUN, cbind(ph$counts, ph$mids),...) may work. You need to know what sort of function 'FUN' you think the histogram data should fit, e.g. normal distribution. Read the help file on nls() to learn how to set up starting "guess" values for the coefficients in FUN.
If you simply want to overlay a curve onto the histogram, then smoo<-spline(ph$mids,ph$counts);
lines(smoo$x,smoo$y)
will come close to doing that. You may have to adjust the x and/or y scaling.
Do you want a density function?
x = rnorm(1000)
hist(x, breaks = 30, freq = FALSE)
lines(density(x), col = "red")
I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.