test <- rep(5,20)
hist(test,freq=FALSE,breaks=5)
The vector contains 20 times the value 5. When I plot this with freq=FALSE and breaks=5 I expect to see 1 bar at x=5 with height = 1.0, because the value 5 makes up 100% of the data.
Why do I instead see 1 bar that ranges from x=0 to x=5 and has height = 0.2 ??
hist plots an estimate of the probability density when freq=FALSE or prob=TRUE, so the total area of the bars in the histogram sums to 1. Since the horizontal range of the single bar that is plotted is (0,5), it follows that the height must be 0.2 (5*0.2=1)
If you really want the histogram you were expecting (i.e. heights correspond to fraction of counts, areas don't necessarily sum to 1), you can do this:
h <- hist(test,plot=FALSE)
h$counts <- h$counts/length(test)
plot(h)
Another possibility is to force the bar widths to be equal to 1.0, e.g.
hist(test,freq=FALSE,breaks=0:10)
Or maybe you want
plot(table(test)/length(test))
or
plot(table(test)/length(test),lwd=10,lend="butt")
?
See also: How do you use hist to plot relative frequencies in R?
Related
Who can explain this to me?
If I run the following
repet <- 10000
size <- 100
p <- .5
data <- (rbinom(repet, size, p) - size * p) / sqrt(size * p * (1-p))
hist(data, freq = FALSE)
x = seq(min(data) - 1, max(data) + 1, .01)
lines(x, dnorm(x), col='green', lwd = 4)
then I get reasonable agreement of the histogram and the theoretical density (due to the Central Limit Theorem).
If I use
hist(data, breaks = 100, freq = FALSE)
the histogram is significantly different from the theoretical density.
This change in behavior happens when I increase the number of breaks from 51 to 52. Why does it happen?
Is has to do with the fact that the data you are generating from rbinom isn't continuous. It's discrete. There are only ~35 discrete values in there (with set.seed(15) and length(unique(data))). When you force the histogram to have 100 breaks, you find that many of those bin end up being empty
sum(hist(data, breaks = 100, freq = FALSE)$counts==0)
# [1] 36
So if you'll notice the second histogram has a bar, then a space (for a bar with height 0), repeating. The total area under the curve has to be the same for both histograms but because the bars in the second plot are half as wide, they need to be twice as all.
The point of all of this is to be careful when using histograms with discrete data. They are intended for continuous data. Also, the number of bins you choose can make a big difference on interpretation. If you change defaults, you should have a very good reason to do so.
Look at the values in data -- the precision is limited to tenths of a unit. Therefore, if you have too many bins, some of the bins will fall between the data points and will have a zero hit count. The others will have a correspondingly higher density.
In your experiments, there is a discontinuous effect because breaks...
is a suggestion only; the breakpoints will be set to pretty values
You can override the arbitrary behavior of breaks by precisely specifying the breaks with a vector. I demonstrate that below, along with a more direct (integer-based) histogram of the binomial results:
probability=0.5 ## probability of success per trial
trials=14 ## number of trials per result
reps=1e6 ## number of results to generate (data size)
## generate histogram of random binomial results
data <- rbinom(reps,trials,probability)
offset = 0.5 ## center histogram bins around integer data values
window <- seq(0-offset,trials+offset) ## create the vector of 'breaks'
hist(data,breaks=window)
## demonstrate the central limit theorem with a predictive curve over the histogram
population_variance = probability*(1-probability) ## from model of Bernoulli trials
prediction_variance <- population_variance / trials
y <- dnorm(seq(0,1,0.01),probability,sqrt(prediction_variance))
lines(seq(0,1,0.01)*trials,y*reps/trials, col='green', lwd=4)
Regarding the first chart shown in the question: Using repet <- 10000, the histogram should be very close to normal (the "Law of large numbers" results in convergence), and running the same experiment repeatedly (or further increasing repet) doesn't change the shape much -- despite the explicit randomness. The apparent randomness in the first chart is also an artifact of the plotting bug in question. To put it more plainly: both charts shown in the question are very wrong (because of breaks).
I am making a histogram in R using:
hist(SOME_MATRIX[,4],breaks=500,ylim = c(0,1000))
But my bars are much taller than the range I gave to the y-axis (0 - 1000). Is there a way using "hist()" to cut off the bars at a maximum value as well?
With the caveats discussed in the comments, here's how to cut off the bars at 1000:
# Save plot data in an object
x=hist(rnorm(1e5),breaks=50,ylim = c(0,1000))
# Cut off counts at 1000
x$counts[x$counts>1000] = 1000
# Re-plot histogram. Max of y-range is > 1000 to show cutoff.
plot(x, ylim=c(0,1500))
I had some problems while trying to plot a histogram to show the frequency of every value while plotting the value as well. For example, suppose I use the following code:
x <- sample(1:10,1000,replace=T)
hist(x,label=TRUE)
The result is a plot with labels over the bar, but merging the frequencies of 1 and 2 in a single bar.
Apart from separate this bar in two others for 1 and 2, I also need to put the values under each bar.
For example, with the code above I would have the number 10 under the tick at the right margin of its bar, and I needed to plot the values right under the bars.
Is there any way to do both in a single histogram with hist function?
Thanks in advance!
Calling hist silently returns information you can use to modify the plot. You can pull out the midpoints and the heights and use that information to put the labels where you want them. You can use the pos argument in text to specify where the label should be in relation to the point (thanks #rawr)
x <- sample(1:10,1000,replace=T)
## Histogram
info <- hist(x, breaks = 0:10)
with(info, text(mids, counts, labels=counts, pos=1))
delme <- exp(rnorm(1000,1.5,0.3))
boxplot(delme,log="y")
boxplot(log10(delme))
Why are the whiskers different in this 2 plots?
Thanks
Agus
I would say that in your first plot you just changed the y axis to log, so the values you plot still range between 1 and 10. In this plot the y axis is a log scale. The whiskers on this axis look different because the space between each "tick" (ie axis break) is not constant (there is more space between 2 and 4 than between 10 and 8)
In the second plot, you take the log of the values then plot them, so they range from .2 to 1, and are plotted with a linear y axis.
Look at the summary for both of the normal and log transformed dataframes
I want to plot a relative frequency histogram in R, of this data:
0.1575850
0.1378830 0.1462112 0.1303224 0.3538677 0.2497142 0.2359662 0.1647894 0.1861195
0.3957871 0.2135463 0.1584121 0.1690736 0.4232640 0.2058885 0.1615527 0.3250968
0.1529143 0.5984977 0.2334365 0.2141899 0.1495538
I want to use seq(0,1,0.2) for the argument "breaks", and set freq=FALSE to get the DENSITY (not the counts) plot. Based on what the hist function help states, I would expect that the total area of the relative frequency histogram (or the sum of $density) would be equal to one, but instead I'm getting this:
cc$density
[1] 2.5000000 2.0454545 0.4545455 0.0000000 0.0000000
Any suggestions of what could be happening? I tried to use the histogram function of {lattice}, and the histogram seems fine, but I couldn't change the size of the label and axis test using the regular aruments (cex.lab and cex.axis).
Thanks for your help and time.
Hint: sum(cc$density) == 5, and 5 * 0.2 == 1. (You can stop reading here, or...)
To calculate an area under the bar plot curve, you have to multiply the height of each bar (which is what cc$density gives you) by width of each bar, which is 0.2 in your case.