I try to specify number of bins in hist() in R to be 10, as follows
> hist(x, breaks=10)
But the number of bins is not exactly 10. I try several with other numbers of bins, and same thing happen.
?hist says breaks can specify
a single number giving the number of cells for the histogram.
So I wonder what I can do now? Thanks!
You can always create custom breakpoints
x = rnorm(500)
brks = seq(-3,3,0.1)
hist(x, breaks = brks)
Tim wrote in comments:
The problem with that is I specified brks = seq(min(x),max(x),length.out=500), but hist(x, breaks = brks) complained that some entries of x wouldn't be included in the histogram
I had the same problem. I suspect this happens because the value on the border of range is not counted. I have 2 solutions but non satisfies me in 100%.
Solution 1.
When making the sequence, set minimum a little bit lower and maximum a little bit higher.
brks = seq(min(x)*.99999,max(x)*1.00001,length.out=500)
Solution 2. Instead of hist() use a combination of cut() and barplot(). The plot looks almost the same as hist, but doesn't produce a data frame like hist().
barplot(summary(cut(data, 10)), space=0)
Related
I'd like to plot a dataset that consists of two vectors of length 100. The mean difference of the vectors being high and the variance of each of them being considerably smaller, it is quite difficult to plot both vectors and still be able to see the variation within each vector.
What I'd like to be able to manually set the breaks so that we could both see the difference between the vectors and within them.
Consider this data set
a=rnorm(100,sd=0.005)+1
b=rnorm(100,sd=0.005)+10
vec = c(a,b)
Neither plot(vec) nor plot(vec,log="y") gives satisfying results, as it is not possible to distinguish the variation within the vector (see picture).
I'd like the breaks on the y-axis to be (min(a), max(a), 5, min(b), max(b)) (and get equal distance between them). How could one achieve that?
Depending on exactly what you are trying to do, a simple transformation of the data in each part of the vector might be enough:
vec2 <- c( (a - min(a))/ (max(a)-min(a)) , 3 + (b - min(b))/ (max(b)-min(b)) )
plot(vec2, axes=F)
box()
axis(1)
axis(2, at=c(0,1,2,3,4), labels = round(c(min(a), max(a), 5, min(b), max(b)),2))
Alternative approaches might be a custom transformation in ggplot, a secondary axis in ggplot, breaking the graph into facets, or using ggbreak.
I'm analyzing numeric data with values between 1 to 7. I want to plot boxplots and show the significance across categories. My problem is that adding the labels also extends the values in the y axis. This might imply that the possible data range is up to more than 7 - which is not the best. I tried using ylim() but using it cuts off the signif labels. Is there a way to make the axis values to be 1-7, without cutting the information the should apear beyond this range?
my current plot:
when using ylim()
the desired outcome is something like that:
As mentioned in the comments, the solution is setting breaks:
gboxplot(...)+ scale_y_continuous(breaks = seq(0, 7, by = 1))
I have a really skewed data and I am want to set my histogram's last bin to include a threshold number to infinity so that my histogram will be not skewed. I know we can set xlim or coord_cartisian to zooming but I want to keep all the data.
x=data.frame(100*rbeta(10000,2,50))
ggplot(data=x,aes(x))+geom_histogram(bins=20)+scale_x_continuous(breaks =seq(1,100,by=5))
The accepted answer will get a little ugly if the aggreagted bin gets too big. You can map the values:
x <- mapvalues(x,
from = c(aggBinLow:aggBinHigh),
to = c(rep.int(aggBinLow,aggBinHigh-aggBinLow+1)))
and add a scale with distinct values:
g +
scale_x_continuous(breaks=min:aggBinLow,labels=c(sprintf("%s",min:aggBinLow-1),">aggBinLow-1"))
Use geom_histogram(breaks=c(...)) to set customised bins, where c(...) is the vector of values you want. For example c(seq(from=1,to=11,by=1),100000)
Who can explain this to me?
If I run the following
repet <- 10000
size <- 100
p <- .5
data <- (rbinom(repet, size, p) - size * p) / sqrt(size * p * (1-p))
hist(data, freq = FALSE)
x = seq(min(data) - 1, max(data) + 1, .01)
lines(x, dnorm(x), col='green', lwd = 4)
then I get reasonable agreement of the histogram and the theoretical density (due to the Central Limit Theorem).
If I use
hist(data, breaks = 100, freq = FALSE)
the histogram is significantly different from the theoretical density.
This change in behavior happens when I increase the number of breaks from 51 to 52. Why does it happen?
Is has to do with the fact that the data you are generating from rbinom isn't continuous. It's discrete. There are only ~35 discrete values in there (with set.seed(15) and length(unique(data))). When you force the histogram to have 100 breaks, you find that many of those bin end up being empty
sum(hist(data, breaks = 100, freq = FALSE)$counts==0)
# [1] 36
So if you'll notice the second histogram has a bar, then a space (for a bar with height 0), repeating. The total area under the curve has to be the same for both histograms but because the bars in the second plot are half as wide, they need to be twice as all.
The point of all of this is to be careful when using histograms with discrete data. They are intended for continuous data. Also, the number of bins you choose can make a big difference on interpretation. If you change defaults, you should have a very good reason to do so.
Look at the values in data -- the precision is limited to tenths of a unit. Therefore, if you have too many bins, some of the bins will fall between the data points and will have a zero hit count. The others will have a correspondingly higher density.
In your experiments, there is a discontinuous effect because breaks...
is a suggestion only; the breakpoints will be set to pretty values
You can override the arbitrary behavior of breaks by precisely specifying the breaks with a vector. I demonstrate that below, along with a more direct (integer-based) histogram of the binomial results:
probability=0.5 ## probability of success per trial
trials=14 ## number of trials per result
reps=1e6 ## number of results to generate (data size)
## generate histogram of random binomial results
data <- rbinom(reps,trials,probability)
offset = 0.5 ## center histogram bins around integer data values
window <- seq(0-offset,trials+offset) ## create the vector of 'breaks'
hist(data,breaks=window)
## demonstrate the central limit theorem with a predictive curve over the histogram
population_variance = probability*(1-probability) ## from model of Bernoulli trials
prediction_variance <- population_variance / trials
y <- dnorm(seq(0,1,0.01),probability,sqrt(prediction_variance))
lines(seq(0,1,0.01)*trials,y*reps/trials, col='green', lwd=4)
Regarding the first chart shown in the question: Using repet <- 10000, the histogram should be very close to normal (the "Law of large numbers" results in convergence), and running the same experiment repeatedly (or further increasing repet) doesn't change the shape much -- despite the explicit randomness. The apparent randomness in the first chart is also an artifact of the plotting bug in question. To put it more plainly: both charts shown in the question are very wrong (because of breaks).
I need to plot a vector of numbers. Let's say these numbers range from 0 to 1000. I need to make a histogram where the x axis goes from 100 to 500, and I want to specify the number of bins to be 10. How do I do this?
I know how to use xlim and break separately, but I don't know how to make a given number of bins inside the custom range.
This is a very good question actually! I was bothered by this all the time but finally your question has kicked me to finally solve it :-)
Well, in this case we cannot simply do hist(x, xlim = c(100, 500), breaks = 9), as the breaks refer to the whole range of x, not related to xlim (in other words, xlim is used only for plotting, not for computing the histogram and setting the actual breaks). This is a clear flaw of the hist function and there is no simple remedy found in the documentation.
I think the easiest way out is to "xlim" the values before they go to the hist function:
x <- runif(1000, 0, 1000) # example data
hist(x[x > 100 & x < 500], breaks = 9)
breaks should be number of cells minus one.
For more info see ?hist