I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.
Related
Why I did get lines instead of standard bubbles in my q-q plot?
My code:
data <- read.csv("C:\\Users\\anton\\SanFrancisco.csv")
x <- data$ï..San.Francisco
head(x)
library("fitdistrplus")
fitnor <- fitdist(x, "norm")
fitlogis <- fitdist(x, "logis")
qqcomp(list(fitnor, fitlogis), legendtext=c("Normal", "Logistic"))
From the documentation for qqcomp - get to it by ?qqcomp.
qqcomp provides a plot of the quantiles of each theoretical
distribution (x-axis) against the empirical quantiles of the data
(y-axis), by default defining probability points as (1:n - 0.5)/n for
theoretical quantile calculation (data are assumed continuous). For
large dataset (n > 1e4), lines are drawn instead of points and
customized with the fitpch parameter.
This is a design feature. Your data must have more than 10000 values. If that is the case, the bubbles on the q-q plot would be difficulty to individually distinguish. Additionally, they are large enough that the bubbles for one model would cover those for the other.
I have some some samples of a high dimensional density that I would like to plot. I like to create a grid of them where their bivariate density is plotted when they cross. For example, in Bayes and Big Data - The Consensus Monte Carlo Algorithm, Scott et al. (2016) has the following plot:
In this plot, above the diagonal are the distributions on a scale just large enough to fit the plots. In the below diagonal, the bivariate densities are plotted on a common scale.
Does anyone know how I can achieve such a plot?
For instance if I have just generated a 5-dimensional Gaussian distribution using:
library(MASS)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
This is relatively easy using the facet_matrix() from the ggforce package. You just have to specify which layer goes on what part of the plot (i.e. layer.upper = 1 says that the first layer (geom_density2d()) should go in the upper triangular part of the matrix. The geom_autodensity() makes sure that the KDE touches the bottom part of the plot.
library(MASS)
library(ggforce)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
df <- as.data.frame(data)
ggplot(df) +
geom_density2d(aes(x = .panel_x, y = .panel_y)) +
geom_autodensity() +
geom_point(aes(x = .panel_x, y = .panel_y)) +
facet_matrix(vars(V1:V5), layer.upper = 1, layer.diag = 2)
More details about facet_matrix() are posted here.
My data are pre-processed image data and I want to seperate two classes. In therory (and hopefully in practice) the best threshold is the local minimum between the two peaks in the bimodal distributed data.
My testdata is: http://www.file-upload.net/download-9365389/data.txt.html
I tried to follow this thread:
I plotted the histogram and calculated the kernel density function:
datafile <- read.table("....txt")
data <- data$V1
hist(data)
d <- density(data) # returns the density data with defaults
hist(data,prob=TRUE)
lines(d) # plots the results
But how to continue?
I would calculate the first and second derivates of the density function to find the local extrema, specifically the local minimum. However I have no idea how to do this in R and density(test) seems not to be a normal function. Thus please help me: how can I calculate the derivates and find the local minimum of the pit between the two peaks in the density function density(test)?
There are a few ways to do this.
First, using d for the density as in your question, d$x and d$y contain the x and y values for the density plot. The minimum occurs when the derivative dy/dx = 0. Since the x-values are equally spaced, we can estimate dy using diff(d$y), and seek d$x where abs(diff(d$y)) is minimized:
d$x[which.min(abs(diff(d$y)))]
# [1] 2.415785
The problem is that the density curve could also be maximized when dy/dx = 0. In this case the minimum is shallow but the maxima are peaked, so it works, but you can't count on that.
So a second way uses optimize(...) which seeks a local minimum in a given interval. optimize(...) needs a function as argument, so we use approxfun(d$x,d$y) to create an interpolation function.
optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
# [1] 2.415791
Finally, we show that this is indeed the minimum:
hist(data,prob=TRUE)
lines(d, col="red", lty=2)
v <- optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
abline(v=v, col="blue")
Another approach, which is preferred actually, uses k-means clustering.
df <- read.csv(header=F,"data.txt")
colnames(df) = "X"
# bimodal
km <- kmeans(df,centers=2)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
The data actually looks more trimodal than bimodal.
# trimodal
km <- kmeans(df,centers=3)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
So I have a barplot in which the y axis is the log (frequencies). From just eyeing it, it appears that bars decrease exponentially, but I would like to know this for sure. What I want to do is also plot an exponential on this same graph. Thus, if my bars fall below the exponential, I would know that my bars to decrease either exponentially or faster than exponential, and if the bars lie on top of the exponential, I would know that they dont decrease exponentially. How do I plot an exponential on a bar graph?
Here is my graph if that helps:
If you're trying to fit density of an exponential function, you should probably plot density histogram (not frequency). See this question on how to plot distributions in R.
This is how I would do it.
x.gen <- rexp(1000, rate = 3)
hist(x.gen, prob = TRUE)
library(MASS)
x.est <- fitdistr(x.gen, "exponential")$estimate
curve(dexp(x, rate = x.est), add = TRUE, col = "red", lwd = 2)
One way of visually inspecting if two distributions are the same is with a Quantile-Quantile plot, or Q-Q plot for short. Typically this is done when inspecting if a distribution follows standard normal.
The basic idea is to plot your data, against some theoretical quantiles, and if it matches that distribution, you will see a straight line. For example:
x <- qnorm(seq(0,1,l=1002)) # Theoretical normal quantiles
x <- x[-c(1, length(x))] # Drop ends because they are -Inf and Inf
y <- rnorm(1000) # Actual data. 1000 points drawn from a normal distribution
l.1 <- lm(sort(y)~sort(x))
qqplot(x, y, xlab="Theoretical Quantiles", ylab="Actual Quantiles")
abline(coef(l.1)[1], coef(l.1)[2])
Under perfect conditions you should see a straight line when plotting the theoretical quantiles against your data. So you can do the same plotting your data against the exponential function you think it will follow.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
ggplot2: Overlay histogram with density curve
sorry for what is probably a simple question, but I have a bit of a problem.
I have created a histogram that is based on a binomial distribution with mean=0.65 and sd=0.015 with 10000 samples. The histogram itself looks fine. However, I need to overlay a normal distribution on top of this (with the same mean and standard deviation). Currently, I have the following:
qplot(x, data=prob, geom="histogram", binwidth=.05) + stat_function(geom="line", fun=dnorm, arg=list(mean=0.65, sd=0.015))
A distribution shows up, but it is TINY. This is likely because the mean's count goes up to almost 2,000, while the normal distribution is much smaller. Simply put, it is not fitted with the data the way that R automatically would do. Is there a way to specify the line of the normal distribution to fit the histogram, or is there some way to manipulate the histogram to fit the normal distribution?
Thanks in advance.
"The distribution is tiny" because you are plotting a density function over counts. You should use the same metric in both plot, eg.:
I try to generate some data for your example:
x <- rbinom(10000, 10, 0.15)
prob <- data.frame(x=x/(mean(x)/0.65))
And plot both as density functions:
library(ggplot2)
ggplot(prob, aes(x=x)) + geom_histogram(aes(y = ..density..), binwidth=.05) + stat_function(geom="line", fun=dnorm, arg=list(mean=0.65, sd=0.015))
#daroczig's answer is correct about needing to be consistent in plotting densities rather than counts, but: I'm having trouble seeing how you managed to get a binomial sample with those properties. In particular, the mean of the binomial is n*p, the variance is n*p*(1-p), the standard deviation is sqrt(n*p*(1-p)), so ..
b.m <- 0.65
b.sd <- 0.015
Calculate variance:
b.v <- b.sd^2 ## n*p*(1-p)
Calculate p:
## (1-p) = b.v/(n*p) = b.v/b.m
## p = 1-b.v/b.m
b.p <- 1-b.v/b.m
Calculate n:
## n = n*p/p = b.m/b.p
b.n <- b.m/b.p
This gives n=0.6502251, p=0.9996538 -- so I don't see how you can get this binomial distribution without n<1, unless I messed up the algebra ...