I am surprised that there seems to be no question about this problem. At least I haven't found any with an accurate answer.
Suppose the easy case of rolling two dices and adding the pips shown. Possible results range from 2 to 12. Now I want to plot the histogram for this event, i.e. one bin per possible number. That would make 11 bins (2,3,4,5...12)
# Example dataset: how often did we get "2","3", "4"(1x2, 3x3, 2x4, 4x5, 8x6, 14x7, ...)
Dice <- c(2,rep(3,3),rep(4,2),rep(5,4),rep(6,8),rep(7,14),rep(8,9),rep(9,5),rep(10,4),rep(11,1),rep(12,2))
hist(Dice,breaks=seq(2,12)) # custom breaks return 10 bins (9 breaks)
hist(Dice,breaks=11) # same for automatic breaks (and for breaks=12 or 13...)
What I need is a histogram plot with 11 bins - that is one bin per possible result. How can I trick R into doing this?
Thank you!
hist(Dice,breaks=seq(1.5,12.5))
This is not an histogram per se, but you could try this:
barplot(table(Dice))
Related
This question already has answers here:
Why does the hist() function not have area one
(4 answers)
Closed 1 year ago.
I am trying to plot an histogram using R.
I decided to use the function hist() but I cannot understand why by changing the "breaks" option the sum of the density also changes.
In fact if I write
h <- hist(data, freq =F, breaks = "FD")
and then run
sum(h$density)
the result is 2 (same thing for breaks = "Scott"). While if I use
h <- hist(data, freq =F)
the result is 1 (as expected).
Summing the density values only makes sense if your bins are one unit wide. You want to sum the areas of the bars, which is the density value times the bin width. Presumably your FD bins are half the width of the default bins.
When you make a histogram and define the breaks argument, R uses some functions to generate those breaks. I want to obtain the range values for the breaks generated by the histogram such that if I made the following histogram
hist(df$foo, breaks = 5)
I want a list or data.frame that has the value ranges of the breaks:
list(c("1_lower"="<num>","1_upper"="<num2>","2_lower"="<num3>","2_upper"="<num4>"))
I hope this is possible. Any help is greatly appreciated.
According to the documentation ?hist - if you set h<-hist(...), then h$breaks will give you the breakpoints.
I have a really skewed data and I am want to set my histogram's last bin to include a threshold number to infinity so that my histogram will be not skewed. I know we can set xlim or coord_cartisian to zooming but I want to keep all the data.
x=data.frame(100*rbeta(10000,2,50))
ggplot(data=x,aes(x))+geom_histogram(bins=20)+scale_x_continuous(breaks =seq(1,100,by=5))
The accepted answer will get a little ugly if the aggreagted bin gets too big. You can map the values:
x <- mapvalues(x,
from = c(aggBinLow:aggBinHigh),
to = c(rep.int(aggBinLow,aggBinHigh-aggBinLow+1)))
and add a scale with distinct values:
g +
scale_x_continuous(breaks=min:aggBinLow,labels=c(sprintf("%s",min:aggBinLow-1),">aggBinLow-1"))
Use geom_histogram(breaks=c(...)) to set customised bins, where c(...) is the vector of values you want. For example c(seq(from=1,to=11,by=1),100000)
I have a matrix which has the following approximate dimensions: 20000 x 1. I would like to plot the values in a histogram with bins of length 0.01 from -0.05 to +0.15. However, the values in the matrix are pretty random - for eg, 0.0123421, 0.0124523, 0.124523, -0.011234, etc. Thus, I need to first count the number of values that fall into a particular bin, and then plot a histogram. For the numbers I gave, I'd have 2 values between 0.01 and 0.02, 1 between -0.02 and -0.01, and so on, which I need in a histogram. Is there an easy way to do this? I'm relatively new to R, so any help is appreciated!
As an example illustrating breaks (content summarized from an excellent post on R-bloggers which you can refer to here), lets assume that you start with some normally distributed data. In R, you can generate normal data this way using the rnorm() function:
data <-rnorm(n=1000, m=24.2, sd=2.2)
We can then generate a simple histogram using the following call:
hist(data)
Now, let's assume that you want to have coarser or finer groups for your bins. There are a number of ways to do this. You could, for example, use the breaks() option. Below is a tidy example illustrating this:
hist(data, breaks=20, main="Breaks=20")
hist(data, breaks=5, main="Breaks=5")
Now, if you want more control over exactly the breakpoints between bins, you can be more precise with the breaks() option and give it a vector of breakpoints, like this:
hist(data, breaks=c(17,20,23,26,29,32), main="Breaks is vector of breakpoints")
This dictates exactly the start and end point of each bin. Of course, you could give the breaks vector as a sequence like this to cut down on the messiness of the code:
hist(data, breaks=seq(17,32,by=3), main="Breaks is vector of breakpoints")
Note that when giving breakpoints, the default for R is that the histogram cells are right-closed (left open) intervals of the form (a,b]. You can change this with the right=FALSE option, which would change the intervals to be of the form [a,b). This is important if you have a lot of points exactly at the breakpoint.
hist(x, breaks = seq(-.05, .15, .01))
See ?hist
I'm a beginner R programmer attempting to plot a histogram of an insurance claims dataset with 100,000+ observations which is heavily skewed (mean=$61,000, median=$20,000, max value=$15M).
I've submitted the following code to graph the adj_unl_claim variable over the $0-$100,000 domain:
hist(test$adj_unl_claim, freq=FALSE, ylim=c(0,1), xlim=c(0,100000),
prob=TRUE, breaks=10, col='red')
with the result being an empty graph with axes but no histogram bars - just an empty graph.
I suspect the problem is related to the skewed nature of my data, but I've tried every combination of breaks and xlim and nothing works. Any solutions are much appreciated!
If you've set freq = FALSE, then you are getting a histogram of probability densities. These are likely much less than 1. Consequently, your histogram bars are probably printed super-tiny along the x-axis. Try again without setting the ylim, and R will automatically calculate reasonable y axis limits.
Note also that setting the xlim doesn't change the actual plot, just how much of it you see. So you might not actually see 10 breaks, if some of them fall beyond the 100000 limit in your plot. You might actually want to subset your data to exclude values over 100000 first, and then do a histogram on the reduced dataset to get the plot you want. Maybe, I'm not sure what your objective is here.
This might give you something to play with, using some of Tyler's suggestions.
> claim <- c(15000000, rexp(99999, rate = 1/400)^1.76)
> summary(claim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 4261 20080 61730 67790 15000000
>
> hs <- 100000 # highest value to show on histogram
> br <- 10 # number of bars to show on histogram
>
> hist(claim, xlim = c(0,hs), freq = FALSE, breaks = br*max(claim)/hs, col='red')
>
> length(claim[claim<hs]) / length(claim) #proportion of claims shown
[1] 0.82267
> sum(claim[claim<hs]) / sum(claim) #proportion of value shown
[1] 0.3057994
where hist produced something like
The problem with this is that although the histogram coves about 82% of the claims in this pseudo-data, it only covers about 31% of the value of the claims. So unless the only point you want to make is that most claims are small, you might want to consider a different graph.
My guess is that the real point from your data is that while most claims are fairly small, most of the cost is in the big claims. The big claims will not show up in a histogram, even if you extend the scale. Instead break the claims up into groups of differing widths, including for example 0-$1000 and $1M+, and show with a dot plot (a) what proportion of claims fall into each group and (b) what proportion of the values of claims fall into each group.
Two things to try:
hist(test$adj_unl_claim[test$adj_unl_claim < 100000])
will plot a histogram of all claims of less than $100k. This omits the tail in the interest of showing the bulk of the data. Alternatively,
hist(log(test$adj_unl_claim))
will log-transform your claim size, effectively bringing the long tail back in.
Thanks, subsetting my data did the trick. I also added two lines of code that calculate the proportion of observations in each histogram bin and then plots them out with specific y and x subsets:
k<-hist(gb2_agg$adj_unl_claim,prob=TRUE,breaks=100000)
k$counts<-k$counts/sum(k$counts)
plot(k,ylim=c(0,.02),xlim-c(0,50000),col='blue')