How to sort vector into bins in R? - r

I have a vector that consists of numbers that can take on any value between 1 and 100.
I want to sort that vector into bins of a certain size.
My logic:
1.) Divide the range (in this case, 1:100) into the amount of bins you want (lets say 10 for this example)
Result: (1, 10.9], 10.9,20.8], (20.8,30.7], (30.7,40.6], (40.6,50.5], (50.5,60.4], (60.4,70.3], (70.3,80.2], (80.2,90.1], (90.1,100]
2.) Then sort my vector
I found a handy function that almost does all this in one fell swoop: cut(). Here is my code:
> table(cut(vector, breaks = 10))
(0.959,10.9] (10.9,20.8] (20.8,30.7] (30.7,40.5] (40.5,50.4] (50.4,60.3] (60.3,70.1] (70.1,80] (80,89.9] (89.9,99.8]
175 171 117 103 82 67 54 46 39 31
Unfortunately, the intervals are different than the bins we calculated from the possible range (1:100). So I tried fixing this by adding in that range into the vector:
> table(cut(c(1,100,vector), breaks = 10))
(0.901,10.9] (10.9,20.8] (20.8,30.7] (30.7,40.6] (40.6,50.5] (50.5,60.4] (60.4,70.3] (70.3,80.2] (80.2,90.1] (90.1,100]
176 171 117 104 82 66 54 48 38 31
This almost worked perfectly except the left-most interval which starts from 0.901 for some reason.
My questions:
1.) Is there a way to do this (using cut or another function/package) without having to insert artificial data points to get the specified bin ranges?
2.) If not, why does the lower bin start from 0.901 and not 1?

Based on your response to #Allan Cameron, I understand taht you want to divide your vector in 10 bins of the same size. But when you define this number of breaks in the cut() function, the size of the intervals calculated by the function, are different accros the groups. As #akrun sad, this occurs because of the method of calculus that the function uses on this case you define only the number's of breaks.
I do not know if there is a way to avoid this in the function. But I think it will be easier if you define the bins as you want as #Gregor Thomas suggested. Here is an example of how I would approach your desire:
vec <- sample(1:100, size = 500, replace = T)
# Here I suppose that you want to divide the data in
# intervals of the same length
breaks <- seq(min(vec), max(vec), by = 9.9)
cut(vec, breaks = breaks)
Other option, would be the cut_interval() function from ggplot2 package, that cut's the vector in n groups with the same length.
library(ggplot2)
cut_interval(vec, n = 10)

why does the lower bin start from 0.901 and not 1?
The answer is the first bit of the Details section of the ?cut help page:
When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals.
That .1% adjustment is the reason your lower bound is 0.901 --- the upper bound isn't adjusted because it is a closed, ], not open ) interval on that end.
If you'd like to use other breaks, you can specify exact breaks however you want. Perhaps this:
my_breaks = seq(1, 100, length.out = 11) ## for n bins, you need n+1 breaks
my_breaks
# [1] 1.0 10.9 20.8 30.7 40.6 50.5 60.4 70.3 80.2 90.1 100.0
cut(vector, breaks = my_breaks, include.lowest = TRUE)
But I actually think Allan's suggestion of 0:10 * 10 might be what you really want. I wouldn't dismiss it too quickly:
table(cut(1:100, breaks = 0:10*10))
# (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
# 10 10 10 10 10 10 10 10 10 10

Related

I want bins of 5 mm everytime, in an R script that analyzes different normal distributions

Right off the bat, I'm a newbie to R and maybe I'm misunderstanding the concept of what my code does vs what I want it to do. Here is the code I've written so far.
his <- hist(x, breaks=seq(floor(min(x)) , ceiling(max(x)) + ifelse(ceiling(max(x)) %% 5 != 0, 5, 0), 5)
Here is some sample data:
Autonr X
1 -12
2 -6
3 -17
4 8
5 -11
6 -10
7 10
8 -22
I'm not able to upload one of the histograms that did work, but it should show bins of 5, no matter how large the spread of the data. The amount of bins should therefore be flexible.
The idea of the code above is to make sure that the outer ranges of my data always fall within neatly defined 5mm bins. Maybe I lost oversight. but I can't seem to understand why this does not always work. In some cases it does, but with other datasets it doesn't.
I get: some 'x' not counted; maybe 'breaks' do not span range of 'x'.
Any help would be greatly appreciated, as I don't want to have to tinker around with my breaks and bins everytime I get a new dataset to run through this.
Rather than passing a vector of breaks, you can supply a single value, in this case the calculation of how many bins are needed given the range of the data and a bindwidth of 5.
# Generate uniformly distributed dummy data between 0 and 53
set.seed(5)
x <- runif(1000, 0, 53)
# Plot histogram with binwidths of 5.
hist(x, breaks = ceiling(diff(range(x)) / 5 ))
For the sake of completeness, here is another approach which uses breaks_width() from the scales package. scales is part of Hadley Wickham's ggplot2 ecosphere.
# create sample data
set.seed(5)
x <- runif(1000, 17, 53)
# plot histogram
hist(x, breaks = scales::breaks_width(5)(range(x)))
Explanation
scales::breaks_width(5) creates a function which is then called with the vector of minimum and maximum values of x as returned by range(x). So,
scales::breaks_width(5)(range(x))
returns a vector of breaks with the required bin size of 5:
[1] 15 20 25 30 35 40 45 50 55
The nice thing is that we have full control of the bins. E.g., we can ask for weird things like a bin size of 3 which is offset by 2:
scales::breaks_width(3, 2)(range(x))
[1] 17 20 23 26 29 32 35 38 41 44 47 50 53 56

R - Generating frequency table from a table of pre-defined bins

I need to generate a cumulative frequency plot of some bubble size data (I have >1000000 objects). In geology the way we do this is by using geometric binning.
I calculate the bins using the following method:
smallest object value aka 0.0015mm * 10^0.1 = upper limit of bin 1, the upper limit of each succcessive bin is generated by multiplying the lower limit by 10^0.1
Bin 1: 0.0015 - 0.001888388
Bin 2: 0.00188388 - 0.002377340
I tried writing a while loop to generate these as breakpoints in R but it wasnt working. So I generated my bins in Excel and now have a table with bins that range from my smallest object to my largest with bins sized appropriately.
What I now want to do is read this into R and use it to find the frequency of objects in each bin. I can't find how to do this - possibly because in most disciplines you dont set your bins like this.
I am fairly new to R so am trying to keep my code fairly simple.
Thanks,
B
The easiest option is to use ?cut. Here's an example with randomly generated data.
# generate data
set.seed(666)
runif(min=0, max=100, n=1000) -> x
# create arbitrary cutpoints (these should be replaced by the ones generated by your geometric method)
cutpoints <- c(0, 1, 10, 11, 12, 15, 20, 50, 90, 99, 99.01, 100)
table(cut(x, cutpoints))
(0,1] (1,10] (10,11] (11,12] (12,15] (15,20]
9 92 13 10 27 45
(20,50] (50,90] (90,99] (99,99.01] (99.01,100]
310 399 87 0 8
Also note include.lowest parameter in cut defaults to FALSE:
include.lowest: logical, indicating if an ‘x[i]’ equal to the lowest
(or highest, for ‘right = FALSE’) ‘breaks’ value should be
included.

Grouping a Data Column in R

I have a data frame of 48 samples of zinc concentration reading. I am trying to group the data as Normal, High and Low (0 -30 low, 31 - 80 normal and above 80 high) and proceed to plot a scatter plot with different pch for each group.
here is the first 5 entries
sample concentration
1 1 71.1
2 2 131.7
3 3 13.9
4 4 31.7
5 5 6.4
THANKS
In general please try to include sample data in by using dput(head(data)) and pasting the output into your question.
What you want is called binning (grouping is a very different operation in R terms).
The standard function here is cut :
numericVector <- runif(100, min = 1, max = 100 ) # sample vector
cut(numericVector, breaks = c(0,30,81,Inf),right = TRUE, include.lowest = FALSE, labels = c("0-30","31-80","80 or higher"))
Please check the function documentation to adapt the parameters to your specific case.

cutting a variable into pieces in R

I'm trying to cut() my data D into 3 pieces: [0-4], [5-12], [13-40] (see pic below). But I wonder how to exactly define my breaks in cut to achieve that?
Here is my data and R code:
D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)
table(cut(D$time, breaks = c(0, 5, 9, 12))) ## what should breaks be?
# (0,5] (5,9] (9,12] # cuts not how I want the 3 pieces .
# 228 37 10
The notation (a,b] means ">a and <=b".
So, to get your desired result, just define the cuts so you get the grouping that you want, including a lower and upper bound:
table(cut(D$time, breaks=c(-1, 4, 12, 40)))
## (-1,4] (4,12] (12,40]
## 319 47 20
You may also find it helpful to look at the two arguments right=FALSE, which changes the endpoints of the intervals from (a,b] to [a,b), and include.lowest, which includes the lowest breaks value (in the OP's example, this is [0,5] with closed brackets on the lower bound). You can also use infinity. Here's an example with a couple of those options put to use:
table(cut(D$time, breaks = c(-Inf, 4, 12, Inf), include.lowest=TRUE))
## [-Inf,4] (4,12] (12, Inf]
## 319 47 20
This produces the right buckets, but the interval notation would need tweaking. Assuming all times are integers. Might need to tweak the labels manually - each time you have an right-open interval notation, replace the factor label with a closed interval notation. Use your best string 'magic'
Personally, I like to make sure all possibilities are covered. Perhaps future data from this process might exceed 40? I like to put an upper bound of +Inf in all my cuts. This prevents NA from creeping into the data.
What cut needs is a 'whole numbers only` option.
F=cut(D$time,c(0,5,13,40),include.lowest = TRUE,right=FALSE)
# the below levels hard coded but you could write a loop to turn all labels
# of the form [m,n) into [m,n-1]
levels(F)[1:2]=c('[0,4]','[5,12]')
Typically there would be more analysis before final results are obtained, so I wouldn't sweat the labels too much until the work is closer to complete.
Here are my results
> table(F)
F
[0,4] [5,12] [13,40]
319 47 20
R can compare integers to floats, like in
> 6L >= 8.5
[1] FALSE
Thus you can use floats as breaks in cut such as in
table(cut(D$time, breaks = c(-.5, 4.5, 12.5, 40.5)))
For integers this fullfills your bucket definition of [0-4], [5-12], [13-40] without you having to think to much about square brackets against round brackets.
A fancy alternative would be clustering around the mean of you buckets as in
D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)
D$cluster <- kmeans(D$time, center = c(4/2, (5+12)/2, (13+40)/2))$cluster
plot(D$time, rnorm(nrow(D)), col=D$cluster)
You shoud add two aditional arguments right and include.lowest to your code!
table(cut(D$time, breaks = c(0, 5, 13, 40), right=FALSE, include.lowest = TRUE))
In the case of right=FALSE the intervals should be closed on the left and open on the right such that you would have your desired result. include.lowest=TRUE causes that your highest break value (here 40) is included to the last interval.
Result:
[0,5) [5,13) [13,40]
319 47 20
Vice versa you can write:
table(cut(D$time, breaks = c(0, 4, 12, 40), right=TRUE, include.lowest = TRUE))
with the result:
[0,4] (4,12] (12,40]
319 47 20
Both mean exact what you looking for:
[0,4] [5,12] [13,40]
319 47 20

R dtw package: cumulative cost matrix decreases at some points along the path?

I am exploring the results of Dynamic Time Warping as implemented in the dtw package. While doing some sanity checks I came across a result which I cannot rationalize. At some points along the warp path, the cumulative distance appears to decrease. Example below:
mat= structure(c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01,0.01,0.02,0.03,0.04,0.06,0.09,0.11,0.13,0.16,0.18,0.2,0.22,0.24,0.24,0.22,0.22,0.22,0.22,0.21,0.2,0.19,0.2,0.23,0.29,0.34,0.41,0.51,0.62,0.73,0.82,0.9,0.95,1,1,1,0.92,0.92,0.89,0.89,0.84,0.79,0.7,0.53,0.37,0.23,0.17,0.13,0.11,0.09,0.08,0.07,0.07,0.07,0.07,0.07,0.07,0.08,0.08,0.08,0.09,0.1,0.13,0.15,0.19,0.22,0.27,0.29,0.34,0.35,0.36,0.35,0.38,0.37,0.37,0.32,0.3,0.26,0.24,0.21,0.19,0.17,0.15,0.14,0.12,0.1,0.09,0.09,0.08,0.08,0.07,0.07,0.07,0.07,0.06,0.06,0.06,0.05,0.05,0.05,0.05,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.03,0.04,0.04,0.04,0.03,0.03,0.03,0.04,0.04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01,0.01,0.01,0.02,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.1,0.12,0.12,0.13,0.14,0.15,0.17,0.19,0.2,0.21,0.22,0.24,0.23,0.24,0.26,0.3,0.32,0.33,0.35,0.39,0.44,0.49,0.55,0.61,0.67,0.71,0.76,0.83,0.9,0.97,1,0.99,0.86,0.68,0.5,0.41,0.33,0.28,0.23,0.2,0.17,0.15,0.13,0.12,0.1,0.1,0.1,0.11,0.11,0.11,0.11,0.11,0.11,0.11,0.13,0.15,0.17,0.18,0.2,0.21,0.24,0.25,0.28,0.29,0.32,0.35,0.36,0.34,0.32,0.3,0.3,0.28,0.26,0.23,0.22,0.19,0.17,0.15,0.14,0.12,0.1,0.09,0.09,0.08,0.08,0.07,0.07,0.07,0.06,0.06,0.05,0.05,0.05,0.05,0.05,0.05,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04,0.04),.Dim=c(149L,2L))
tw = dtw(mat[,1], mat[,2], keep.internals = T, step.pattern = asymmetricP05)
.
d.phi = tw$costMatrix[ cbind(tw$index1, tw$index2) ]
which(diff(d.phi) < 0)
# 45 50 53 54 61 70 72 73 80 81 101 115 117 120 124 125 129 139 184 189 191 193
plot(diff(d.phi))
This should not be the case, as d_phi is a sum of non-negative distance measures, multiplied by m which takes values 0 or 1.
I doubt this is an implementation problem with the dtw package, so where am I making a mistake?
Another sanity check (taken from the reference below) plots the path on top of the costMatrix. Below is plotted indices 45:55 in which we see 45, 50, 53, and 54 have decreasing cumulative cost (from above diff(d.phi)). The first transition is diff(d.phi)[45].
i = 45:55
i1 = tw$index1[i]
i2 = tw$index2[i]
r= range(c(i1,i2))
s = r[1]:r[2]
ccm <- tw$costMatrix[s,s]
image(x=1:nrow(ccm),y=1:ncol(ccm),ccm)
text(row(ccm),col(ccm),label=round(ccm,3))
lines(i1-r[1]+1,i2-r[1]+1)
If this is the actual path taken by the DP algorithm, how can the cumulative distance along this path decrease at those points?
http://cran.r-project.org/web/packages/dtw/vignettes/dtw.pdf
This is due to the use of a "multi-step" recursion like asymmetricP05. Such a pattern allows the warping path to be composed of long segments, e.g. knight's moves.
To verify the monotonicity, you should only consider the starting positions of each of the "knight's moves" - not all of the cells passed through. The index1 and index2 properties do include the intermediate cells (to provide a smoother curve), which explains your observation.
To convince yourself: (1) try another, more intuitive, pattern like asymmetric; and (2) note how the stepsTaken property has a different length than index1/2.

Resources