Understanding hist() and break intervals in R [duplicate] - r

This question already has answers here:
Exact number of bins in Histogram in R
(5 answers)
Closed 4 years ago.
I've recently started using R and I don't think I'm understanding the hist() function well. I'm currently working with a numeric vector of length 296, and I'd like to divide it up into 10 equal intervals, and produce a frequency histogram to see which values fall into each interval. I thought hist(dataset, breaks = 10) would do the job, but it's dividing it into 12 intervals instead. I obviously misunderstood what breaks does.
If I want to divide up my data into 10 intervals in my histogram, how should I go about doing that? Thank you.

As per the documentation, if you give the breaks argument a single number, it is treated as a suggestion as it gives pretty breakpoints. If you want to force it to be 10 equally spaced bins, the easiest is probably the following,
x = rnorm(50)
hist(x, breaks = seq(min(x), max(x), length.out = 11))
The length should be n+1 where n is the number of desired bins.

If you read help(hist) you will find this explanation:
breaks: one of:
• a vector giving the breakpoints between histogram cells,
• a function to compute the vector of breakpoints,
• a single number giving the number of cells for the histogram,
• a character string naming an algorithm to compute the
number of cells (see ‘Details’),
• a function to compute the number of cells.
In the last three cases the number is a suggestion only; as
the breakpoints will be set to ‘pretty’ values, the number is
limited to ‘1e6’ (with a warning if it was larger). If
‘breaks’ is a function, the ‘x’ vector is supplied to it as
the only argument (and the number of breaks is only limited
So the help specifically says that if you provide the function with a number it will only be used as a suggestion.
One possible solution is to provide the break points yourself like so:
x <- rnorm(296)
hist(x, breaks=c(-4,-3,-2,-1,0,1,2,3,4,5))
If you don't want to do that but instead want to specify the number of bins you can use the cut function
plot(cut(x, 10))

Related

R: Control number of histogram bins

I am using the hist-function to analyze some data I generated. For an analysis-assay I would like to precisely control the number of histogram bins.
I know the "break-argument" and I can see that in many cases the number of bins is in a direct relationship to the number of breaks (i.e. no_bins = no_breaks + 1).
Due to R's algorithm this is not always the case. Is there a way to force R to output a specific number of bins?
Let me know if I need to specify further details.
Best and many thanks!
From ?hist, there are several options for controlling the bins through the breaks argument.
breaks one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells
(see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only; the
breakpoints will be set to pretty values. If breaks is a function, the
x vector is supplied to it as the only argument.
For the greatest precision, you have to set the breakpoints exactly, either by supplying a vector of breakpoints, or a function to compute them. You need to cover the entire range of x with your breakpoints and there will be 1 more breakpoint than bins (i.e. no_bins + 1 = no_breaks).

What does a single number mean when passed as parameter 'breaks' in an R histogram?

I am learning to plot histograms in R, but I have some problem with parameter "breaks" for a single number. In the help, it says:
breaks: a single number giving the number of cells for the histogram
I did the following experiment:
data("women")
hist(women$weight, breaks = 7)
I expect it should give me 7 bins, but the result is not what I expected! It gives me 10 bins.
Do you know, what does breaks = 7 mean? What does it mean in the help "number of cells"?
Reading carefully breaks argument help page to the end, it says:
breaks
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only; the breakpoints will be set to pretty values. If breaks is a function, the
x vector is supplied to it as the only argument.
So, as you can notice, n is considered only a "suggestion", it probably tries to get near to that value but it depends on the input values and if they can be nicely split into n buckets (it uses function pretty to compute them).
Hence, the only way to force the number of breaks is to provide the vector of interval breakpoints between the cells.
e.g.
data("women")
n <- 7
minv <- min(women$weight)
maxv <- max(women$weight)
breaks <- c(minv, minv + cumsum(rep.int((maxv - minv) / n, n-1)), maxv)
hist(women$weight, breaks = breaks)

How to count line segment occurrences by pixel in R?

I am trying to convey the concentration of lines in 2D space by showing the number of crossings through each pixel in a grid. I am picturing something similar to a density plot, but with more intuitive units. I was drawn to the spatstat package and its line segment class (psp) as it allows you to define line segments by their end points and incorporate the entire line in calculations. However, I'm struggling to find the right combination of functions to tally these counts and would appreciate any suggestions.
As shown in the example below with 50 lines, the density function produces values in (0,140), the pixellate function tallies the total length through each pixel and takes values in (0, 0.04), and as.mask produces a binary indictor of whether a line went through each pixel. I'm hoping to see something where the scale takes integer values, say 0..10.
require(spatstat)
set.seed(1234)
numLines = 50
# define line segments
L = psp(runif(numLines),runif(numLines),runif(numLines),runif(numLines), window=owin())
# image with 2-dimensional kernel density estimate
D = density.psp(L, sigma=0.03)
# image with total length of lines through each pixel
P = pixellate.psp(L)
# binary mask giving whether a line went through a pixel
B = as.mask.psp(L)
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(L, main="L")
plot(D, main="density.psp(L)")
plot(P, main="pixellate.psp(L)")
plot(B, main="as.mask.psp(L)")
The pixellate.psp function allows you to optionally specify weights to use in the calculation. I considered trying to manipulate this to normalize the pixels to take a count of one for each crossing, but the weight is applied uniquely to each line (and not specific to the line/pixel pair). I also considered calculating a binary mask for each line and adding the results, but it seems like there should be an easier way. I know that you can sample points along a line, and then do a count of the points by pixel. However, I am concerned about getting the sampling right so that there is one and only one point per line crossing of a pixel.
Is there is a straight-forward way to do this in R? Otherwise would this be an appropriate suggestion for a future package enhancement? Is this more easily accomplished in another language such as python or matlab?
The example above and my testing has been with spatstat 1.40-0, R 3.1.2, on x86_64-w64-mingw32.
You are absolutely right that this is something to put in as a future enhancement. It will be done in one of the next versions of spatstat. It will probably be an option in pixellate.psp to count the number of crossing lines rather than measure the total length.
For now you have to do something a bit convoluted as e.g:
require(spatstat)
set.seed(1234)
numLines = 50
# define line segments
L <- psp(runif(numLines),runif(numLines),runif(numLines),runif(numLines), window=owin())
# split into individual lines and use as.mask.psp on each
masklist <- lapply(1:nsegments(L), function(i) as.mask.psp(L[i]))
# convert to 0-1 image for easy addition
imlist <- lapply(masklist, as.im.owin, na.replace = 0)
rslt <- Reduce("+", imlist)
# plot
plot(rslt, main = "")

What method does outline=FALSE use to determine outliers? [duplicate]

This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.

R question about plotting probability/density histogram the right way

I have a following matrix [500,2], so we have 500 rows and 2 columns, the left one gives us the index of X observations, and the right one gives the probability with which this X comes true, so - a typical probability density relationship.
So, my question is, how to plot the histogram the right way, so that the x-axis is the x-index, and the y-axis is the density(0.01-1.00). The bandwidth of the estimator is 0.33.
Thanks in advance!
the end of the whole data looks like this: just for a little orientation
[490,] 2.338260830 0.04858685
[491,] 2.347839477 0.04797310
[492,] 2.357418125 0.04736149
[493,] 2.366996772 0.04675206
[494,] 2.376575419 0.04614482
[495,] 2.386154067 0.04553980
[496,] 2.395732714 0.04493702
[497,] 2.405311361 0.04433653
[498,] 2.414890008 0.04373835
[499,] 2.424468656 0.04314252
[500,] 2.434047303 0.04254907
#everyone,
yes, I have made the estimation before, so.. the bandwith is what I mentioned, the data is ordered from low to high values, so respecively the probability at the beginning is 0,22, at the peak about 0,48, at the end 0,15.
The line with the density is plotted like a charm but I have to do in addition is to plot a histogram! So, how I can do this, ordering the blocks properly(ho the data to be splitted in boxes etc..)
Any suggestions?
Here is a part of the data AFTER the estimation, all values are discrete, so I assume histogram can be created.., hopefully.
[491,] 4.956164 0.2618131
[492,] 4.963014 0.2608723
[493,] 4.969863 0.2599309
[494,] 4.976712 0.2589889
[495,] 4.983562 0.2580464
[496,] 4.990411 0.2571034
[497,] 4.997260 0.2561599
[498,] 5.004110 0.2552159
[499,] 5.010959 0.2542716
[500,] 5.017808 0.2533268
[501,] 5.024658 0.2523817
Best regards,
appreciate the fast responses!(bow)
What will do the job is to create a histogram just for the indexes, grouping them in a way x25/x50 each, for instance...and compute the average probability for each 25 or 50/100/150/200/250 etc as boxes..?
Assuming the rows are in order from lowest to highest value of x, as they appear to be, you can use the default plot command, the only change you need is the type:
plot(your.data, type = 'l')
EDIT:
Ok, I'm not sure this is better than the density plot, but it can be done:
x = dnorm(seq(-1, 1, length = 500))
x.bins = rep(1:50, each = 10)
bars = aggregate(x, by = list(x.bins), FUN = sum)[,2]
barplot(bars)
In your case, replace x with the probabilities from the second column of your matrix.
EDIT2:
On second thought, this only makes sense if your 500 rows represent discrete events. If they are instead points along a continuous distribution function adding them together as I have done is incorrect. Mathematically I don't think you can produce the binned probability for a range using only a few points from within that range.
Assuming M is the matrix. wouldn't this just be :
plot(x=M[ , 1], y = M[ , 2] )
You have already done the density estimation since this is not the original data.

Resources