Who can explain this to me?
If I run the following
repet <- 10000
size <- 100
p <- .5
data <- (rbinom(repet, size, p) - size * p) / sqrt(size * p * (1-p))
hist(data, freq = FALSE)
x = seq(min(data) - 1, max(data) + 1, .01)
lines(x, dnorm(x), col='green', lwd = 4)
then I get reasonable agreement of the histogram and the theoretical density (due to the Central Limit Theorem).
If I use
hist(data, breaks = 100, freq = FALSE)
the histogram is significantly different from the theoretical density.
This change in behavior happens when I increase the number of breaks from 51 to 52. Why does it happen?
Is has to do with the fact that the data you are generating from rbinom isn't continuous. It's discrete. There are only ~35 discrete values in there (with set.seed(15) and length(unique(data))). When you force the histogram to have 100 breaks, you find that many of those bin end up being empty
sum(hist(data, breaks = 100, freq = FALSE)$counts==0)
# [1] 36
So if you'll notice the second histogram has a bar, then a space (for a bar with height 0), repeating. The total area under the curve has to be the same for both histograms but because the bars in the second plot are half as wide, they need to be twice as all.
The point of all of this is to be careful when using histograms with discrete data. They are intended for continuous data. Also, the number of bins you choose can make a big difference on interpretation. If you change defaults, you should have a very good reason to do so.
Look at the values in data -- the precision is limited to tenths of a unit. Therefore, if you have too many bins, some of the bins will fall between the data points and will have a zero hit count. The others will have a correspondingly higher density.
In your experiments, there is a discontinuous effect because breaks...
is a suggestion only; the breakpoints will be set to pretty values
You can override the arbitrary behavior of breaks by precisely specifying the breaks with a vector. I demonstrate that below, along with a more direct (integer-based) histogram of the binomial results:
probability=0.5 ## probability of success per trial
trials=14 ## number of trials per result
reps=1e6 ## number of results to generate (data size)
## generate histogram of random binomial results
data <- rbinom(reps,trials,probability)
offset = 0.5 ## center histogram bins around integer data values
window <- seq(0-offset,trials+offset) ## create the vector of 'breaks'
hist(data,breaks=window)
## demonstrate the central limit theorem with a predictive curve over the histogram
population_variance = probability*(1-probability) ## from model of Bernoulli trials
prediction_variance <- population_variance / trials
y <- dnorm(seq(0,1,0.01),probability,sqrt(prediction_variance))
lines(seq(0,1,0.01)*trials,y*reps/trials, col='green', lwd=4)
Regarding the first chart shown in the question: Using repet <- 10000, the histogram should be very close to normal (the "Law of large numbers" results in convergence), and running the same experiment repeatedly (or further increasing repet) doesn't change the shape much -- despite the explicit randomness. The apparent randomness in the first chart is also an artifact of the plotting bug in question. To put it more plainly: both charts shown in the question are very wrong (because of breaks).
Related
I want to change the x axis range of my histogram.
How can I remove all hours for which there is no data from the x axis?
Help would be appreciated.
One way you can artificially squeeze the values together is to change the bin size. Here is a randomly generated extreme bimodal distribution like yours. In each bar is 500 counts of data:
hist(rbinom(n=1000,
size=1,
prob = .50))
If we simply reduce the number of bins that the values go into, it reduces the gap in between the values and labeling below where they lie:
hist(rbinom(n=1000,
size=1,
prob = .50),
breaks = 3)
The problem lies in that data won't be represented accurately depending on what you're doing, so I would suggest explaining why you made that decision if you share this elsewhere.
A less extreme example below with various counts but still gaps:
hist(rbinom(n=1000,
size=10,
prob = .20))
And here if we shift the bins again, they will distribute in the plot with less gaps, making it more right skewed but more distribution within the bars:
hist(rbinom(n=1000,
size=10,
prob = .20),
breaks = 6)
As far as I'm aware there isn't really a way to just remove the values in between the plot. For one, its a distribution plot, so whatever values you remove a priori will just redistribute when you generate the plot. Second, if there is in fact a way, and I wouldn't doubt there is, your x axis would look strange, as you would have a bunch of missing data unaccounted for, looking like this:
Therefore, changing the bins, showing where the data is actually allocated, is probably the best choice.
I have a single series of values (i.e. one column of data), and I would like to create a plot with the range of data values on the x-axis and the frequency that each value appears in the data set on the y-axis.
What I would like is very close to a Kernel Density Plot:
# Kernel Density Plot
d <- density(mtcars$mpg) # returns the density data
plot(d) # plots the results
and Frequency distribution in R on stackoverflow.
However, I would like frequency (as opposed to density) on the y-axis.
Specifically, I'm working with network degree distributions, and would like a double-log scale with open, circular points, i.e. this image.
I've done research into related resources and questions, but haven't found what I wanted:
Cookbook for R's Plotting distributions is close to what I want, but not precisely. I'd like to replace the y-axis in its density curve example with "count" as it is defined in the histogram examples.
The ecdf() function in R (i.e. this question) may be what I want, but I'd like the observed frequency, and not a normalized value between 0 and 1, on the y-axis.
This question is related to frequency distributions, but I'd like points, not bars.
EDIT:
The data is a standard power-law distribution, i.e.
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
The integral of a density is approximately 1 so multiplying the density$y estimate by the number of values should give you something on the scale of a frequency. If you want a "true" frequency then you should use a histogram:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d)
This is a histogram with breaks that are 1 unit each:
hist(mtcars$mpg,
breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
So this is the superposed comparison:
d <- density(mtcars$mpg)
d$y <- d$y * length(mtcars$mpg) ; plot(d, ylim=c(0,4) )
hist(mtcars$mpg, breaks=trunc(min(mtcars$mpg)):(1+trunc(max(mtcars$mpg))), add=TRUE)
You'll want to look at the density page where the default density bandwidth choice is criticized and alternatives offered. f you use the adjust parameter you might see a closer (smoothed correspondence to the histogram
If you have discrete values for observations and want to make a plot with points on the log scale, then
dat <- c(rep(1, 1000), rep(10, 100), rep(100, 10), 100)
dd<-aggregate(rep.int(1, length(dat))~dat, FUN=sum)
names(dd)<-c("val","freq")
plot(freq~val, dd, log="xy")
might be what you are after.
So I have a barplot in which the y axis is the log (frequencies). From just eyeing it, it appears that bars decrease exponentially, but I would like to know this for sure. What I want to do is also plot an exponential on this same graph. Thus, if my bars fall below the exponential, I would know that my bars to decrease either exponentially or faster than exponential, and if the bars lie on top of the exponential, I would know that they dont decrease exponentially. How do I plot an exponential on a bar graph?
Here is my graph if that helps:
If you're trying to fit density of an exponential function, you should probably plot density histogram (not frequency). See this question on how to plot distributions in R.
This is how I would do it.
x.gen <- rexp(1000, rate = 3)
hist(x.gen, prob = TRUE)
library(MASS)
x.est <- fitdistr(x.gen, "exponential")$estimate
curve(dexp(x, rate = x.est), add = TRUE, col = "red", lwd = 2)
One way of visually inspecting if two distributions are the same is with a Quantile-Quantile plot, or Q-Q plot for short. Typically this is done when inspecting if a distribution follows standard normal.
The basic idea is to plot your data, against some theoretical quantiles, and if it matches that distribution, you will see a straight line. For example:
x <- qnorm(seq(0,1,l=1002)) # Theoretical normal quantiles
x <- x[-c(1, length(x))] # Drop ends because they are -Inf and Inf
y <- rnorm(1000) # Actual data. 1000 points drawn from a normal distribution
l.1 <- lm(sort(y)~sort(x))
qqplot(x, y, xlab="Theoretical Quantiles", ylab="Actual Quantiles")
abline(coef(l.1)[1], coef(l.1)[2])
Under perfect conditions you should see a straight line when plotting the theoretical quantiles against your data. So you can do the same plotting your data against the exponential function you think it will follow.
I have two related problems.
Problem 1: I'm currently using the code below to generate a histogram overlayed with a density plot:
hist(x,prob=T,col="gray")
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(density(x))
I've pasted the data (i.e. x above) here.
I have two issues with the code as it stands:
the last tick and label (100) of the x-axis does not appear on the histogram/plot. How can I put these on?
I'd like the y-axis to be of count or frequency rather than density, but I'd like to retain the density plot as an overlay on the histogram. How can I do this?
Problem 2: using a similar solution to problem 1, I now want to overlay three density plots (not histograms), again with frequency on the y-axis instead of density. The three data sets are at:
http://pastebin.com/z5X7yTLS
http://pastebin.com/Qg8mHg6D
http://pastebin.com/aqfC42fL
Here's your first 2 questions:
myhist <- hist(x,prob=FALSE,col="gray",xlim=c(0,100))
dens <- density(x)
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(dens$x,dens$y*(1/sum(myhist$density))*length(x))
The histogram has a bin width of 5, which is also equal to 1/sum(myhist$density), whereas the density(x)$x are in small jumps, around .2 in your case (512 even steps). sum(density(x)$y) is some strange number definitely not 1, but that is because it goes in small steps, when divided by the x interval it is approximately 1: sum(density(x)$y)/(1/diff(density(x)$x)[1]) . You don't need to do this later because it's already matched up with its own odd x values. Scale 1) for the bin width of hist() and 2) for the frequency of x length(x), as DWin says. The last axis tick became visible after setting the xlim argument.
To do your problem 2, set up a plot with the correct dimensions (xlim and ylim), with type = "n", then draw 3 lines for the densities, scaled using something similar to the density line above. Think however about whether you want those semi continuous lines to reflect the heights of imaginary bars with bin width 5... You see how that might make the density lines exaggerate the counts at any particular point?
Although this is an aged thread, if anyone catches this. I would only think it is a 'good idea' to forego translating the y density to count scales based on what the user is attempting to do.
There are perfectly good reasons for using frequency as the y value. One idea in particular that comes to mind is that using counts for the y scale value can give an analyst a good idea about where to begin the 'data hunt' for stratifying heterogenous data, if a mixed distribution model cannot soundly or intuitively be applied.
In practice, overlaying a density estimate over the observed histogram can be very useful in data quality checks. For example, in the above, if I were looking at the above graphic as a single source of data with the assumption that it describes "1 thing" and I wish to model this as "1 thing", I have an issue. That is, I have heterogeneous data which may require some level of stratification. The density overlay then becomes a simple visual tool for detecting heterogeneity (apart from using log transformations to smooth between-interval variation), and a direction (locations of the mixed distributions) for stratifying the data.
I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.