cutting a variable into pieces in R

cutting a variable into pieces in R - r

I'm trying to cut() my data D into 3 pieces: [0-4], [5-12], [13-40] (see pic below). But I wonder how to exactly define my breaks in cut to achieve that?
Here is my data and R code:
D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)
table(cut(D$time, breaks = c(0, 5, 9, 12))) ## what should breaks be?
# (0,5] (5,9] (9,12] # cuts not how I want the 3 pieces .
# 228 37 10

The notation (a,b] means ">a and <=b".
So, to get your desired result, just define the cuts so you get the grouping that you want, including a lower and upper bound:
table(cut(D$time, breaks=c(-1, 4, 12, 40)))
## (-1,4] (4,12] (12,40]
## 319 47 20
You may also find it helpful to look at the two arguments right=FALSE, which changes the endpoints of the intervals from (a,b] to [a,b), and include.lowest, which includes the lowest breaks value (in the OP's example, this is [0,5] with closed brackets on the lower bound). You can also use infinity. Here's an example with a couple of those options put to use:
table(cut(D$time, breaks = c(-Inf, 4, 12, Inf), include.lowest=TRUE))
## [-Inf,4] (4,12] (12, Inf]
## 319 47 20

This produces the right buckets, but the interval notation would need tweaking. Assuming all times are integers. Might need to tweak the labels manually - each time you have an right-open interval notation, replace the factor label with a closed interval notation. Use your best string 'magic'
Personally, I like to make sure all possibilities are covered. Perhaps future data from this process might exceed 40? I like to put an upper bound of +Inf in all my cuts. This prevents NA from creeping into the data.
What cut needs is a 'whole numbers only` option.
F=cut(D$time,c(0,5,13,40),include.lowest = TRUE,right=FALSE)
# the below levels hard coded but you could write a loop to turn all labels
# of the form [m,n) into [m,n-1]
levels(F)[1:2]=c('[0,4]','[5,12]')
Typically there would be more analysis before final results are obtained, so I wouldn't sweat the labels too much until the work is closer to complete.
Here are my results
> table(F)
F
[0,4] [5,12] [13,40]
319 47 20

R can compare integers to floats, like in
> 6L >= 8.5
[1] FALSE
Thus you can use floats as breaks in cut such as in
table(cut(D$time, breaks = c(-.5, 4.5, 12.5, 40.5)))
For integers this fullfills your bucket definition of [0-4], [5-12], [13-40] without you having to think to much about square brackets against round brackets.
A fancy alternative would be clustering around the mean of you buckets as in
D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)
D$cluster <- kmeans(D$time, center = c(4/2, (5+12)/2, (13+40)/2))$cluster
plot(D$time, rnorm(nrow(D)), col=D$cluster)

You shoud add two aditional arguments right and include.lowest to your code!
table(cut(D$time, breaks = c(0, 5, 13, 40), right=FALSE, include.lowest = TRUE))
In the case of right=FALSE the intervals should be closed on the left and open on the right such that you would have your desired result. include.lowest=TRUE causes that your highest break value (here 40) is included to the last interval.
Result:
[0,5) [5,13) [13,40]
319 47 20
Vice versa you can write:
table(cut(D$time, breaks = c(0, 4, 12, 40), right=TRUE, include.lowest = TRUE))
with the result:
[0,4] (4,12] (12,40]
319 47 20
Both mean exact what you looking for:
[0,4] [5,12] [13,40]
319 47 20

Related

How to sort vector into bins in R?

I have a vector that consists of numbers that can take on any value between 1 and 100.
I want to sort that vector into bins of a certain size.
My logic:
1.) Divide the range (in this case, 1:100) into the amount of bins you want (lets say 10 for this example)
Result: (1, 10.9], 10.9,20.8], (20.8,30.7], (30.7,40.6], (40.6,50.5], (50.5,60.4], (60.4,70.3], (70.3,80.2], (80.2,90.1], (90.1,100]
2.) Then sort my vector
I found a handy function that almost does all this in one fell swoop: cut(). Here is my code:
> table(cut(vector, breaks = 10))
(0.959,10.9] (10.9,20.8] (20.8,30.7] (30.7,40.5] (40.5,50.4] (50.4,60.3] (60.3,70.1] (70.1,80] (80,89.9] (89.9,99.8]
175 171 117 103 82 67 54 46 39 31
Unfortunately, the intervals are different than the bins we calculated from the possible range (1:100). So I tried fixing this by adding in that range into the vector:
> table(cut(c(1,100,vector), breaks = 10))
(0.901,10.9] (10.9,20.8] (20.8,30.7] (30.7,40.6] (40.6,50.5] (50.5,60.4] (60.4,70.3] (70.3,80.2] (80.2,90.1] (90.1,100]
176 171 117 104 82 66 54 48 38 31
This almost worked perfectly except the left-most interval which starts from 0.901 for some reason.
My questions:
1.) Is there a way to do this (using cut or another function/package) without having to insert artificial data points to get the specified bin ranges?
2.) If not, why does the lower bin start from 0.901 and not 1?

Based on your response to #Allan Cameron, I understand taht you want to divide your vector in 10 bins of the same size. But when you define this number of breaks in the cut() function, the size of the intervals calculated by the function, are different accros the groups. As #akrun sad, this occurs because of the method of calculus that the function uses on this case you define only the number's of breaks.
I do not know if there is a way to avoid this in the function. But I think it will be easier if you define the bins as you want as #Gregor Thomas suggested. Here is an example of how I would approach your desire:
vec <- sample(1:100, size = 500, replace = T)
# Here I suppose that you want to divide the data in
# intervals of the same length
breaks <- seq(min(vec), max(vec), by = 9.9)
cut(vec, breaks = breaks)
Other option, would be the cut_interval() function from ggplot2 package, that cut's the vector in n groups with the same length.
library(ggplot2)
cut_interval(vec, n = 10)

why does the lower bin start from 0.901 and not 1?
The answer is the first bit of the Details section of the ?cut help page:
When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals.
That .1% adjustment is the reason your lower bound is 0.901 --- the upper bound isn't adjusted because it is a closed, ], not open ) interval on that end.
If you'd like to use other breaks, you can specify exact breaks however you want. Perhaps this:
my_breaks = seq(1, 100, length.out = 11) ## for n bins, you need n+1 breaks
my_breaks
# [1] 1.0 10.9 20.8 30.7 40.6 50.5 60.4 70.3 80.2 90.1 100.0
cut(vector, breaks = my_breaks, include.lowest = TRUE)
But I actually think Allan's suggestion of 0:10 * 10 might be what you really want. I wouldn't dismiss it too quickly:
table(cut(1:100, breaks = 0:10*10))
# (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
# 10 10 10 10 10 10 10 10 10 10

I want bins of 5 mm everytime, in an R script that analyzes different normal distributions

Right off the bat, I'm a newbie to R and maybe I'm misunderstanding the concept of what my code does vs what I want it to do. Here is the code I've written so far.
his <- hist(x, breaks=seq(floor(min(x)) , ceiling(max(x)) + ifelse(ceiling(max(x)) %% 5 != 0, 5, 0), 5)
Here is some sample data:
Autonr X
1 -12
2 -6
3 -17
4 8
5 -11
6 -10
7 10
8 -22
I'm not able to upload one of the histograms that did work, but it should show bins of 5, no matter how large the spread of the data. The amount of bins should therefore be flexible.
The idea of the code above is to make sure that the outer ranges of my data always fall within neatly defined 5mm bins. Maybe I lost oversight. but I can't seem to understand why this does not always work. In some cases it does, but with other datasets it doesn't.
I get: some 'x' not counted; maybe 'breaks' do not span range of 'x'.
Any help would be greatly appreciated, as I don't want to have to tinker around with my breaks and bins everytime I get a new dataset to run through this.

Rather than passing a vector of breaks, you can supply a single value, in this case the calculation of how many bins are needed given the range of the data and a bindwidth of 5.
# Generate uniformly distributed dummy data between 0 and 53
set.seed(5)
x <- runif(1000, 0, 53)
# Plot histogram with binwidths of 5.
hist(x, breaks = ceiling(diff(range(x)) / 5 ))

For the sake of completeness, here is another approach which uses breaks_width() from the scales package. scales is part of Hadley Wickham's ggplot2 ecosphere.
# create sample data
set.seed(5)
x <- runif(1000, 17, 53)
# plot histogram
hist(x, breaks = scales::breaks_width(5)(range(x)))
Explanation
scales::breaks_width(5) creates a function which is then called with the vector of minimum and maximum values of x as returned by range(x). So,
scales::breaks_width(5)(range(x))
returns a vector of breaks with the required bin size of 5:
[1] 15 20 25 30 35 40 45 50 55
The nice thing is that we have full control of the bins. E.g., we can ask for weird things like a bin size of 3 which is offset by 2:
scales::breaks_width(3, 2)(range(x))
[1] 17 20 23 26 29 32 35 38 41 44 47 50 53 56

cut function produces uneven first break

I'm exploring the use of the cut function and am trying to cut the following basic vector into 10 breaks. I'm able to do it, but I'm confused as to why my initial break occurs at -0.1 rather than 0:
test_vec <- 0:10
test_vec2 <- cut(test_vec, breaks = 10)
test_vec2
yields:
(-0.01,1] (-0.01,1] (1,2] (2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10]
Why does this produce 2 instances of (-0.01,1] (-0.01,1] and the lower number does not start at 0?

tl;dr to get what you might want, you'll probably need to specify breaks explicitly, and include.lowest=TRUE:
cut(x,breaks=0:10,include.lowest=TRUE)
The issue is probably this, from the "Details" of ?cut:
When ‘breaks’ is specified as a single number, the range of the
data is divided into ‘breaks’ pieces of equal length, and then the
outer limits are moved away by 0.1% of the range to ensure that
the extreme values both fall within the break intervals.
Since the range is (0,10), the outer limits are (-0.01, 10.01); as #Onyambu suggests, the results are asymmetric because the value at 0 lies on the left-hand boundary (not included) whereas the value at 10 lies on the right-hand boundary (included).
The (apparent) asymmetry is due to formatting; if you follow the code below (the core of base:::cut.default(), you'll see that the top break is actually at 10.01, but gets formatted as "10" because the default number of digits is 3 ...
x <- 0:10
breaks <- 10
dig <- 3
nb <- as.integer(breaks+1)
dx <- diff(rx <- range(x, na.rm = TRUE))
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)
ch.br <- formatC(0 + breaks, digits = dig, width = 1L)

R - Generating frequency table from a table of pre-defined bins

I need to generate a cumulative frequency plot of some bubble size data (I have >1000000 objects). In geology the way we do this is by using geometric binning.
I calculate the bins using the following method:
smallest object value aka 0.0015mm * 10^0.1 = upper limit of bin 1, the upper limit of each succcessive bin is generated by multiplying the lower limit by 10^0.1
Bin 1: 0.0015 - 0.001888388
Bin 2: 0.00188388 - 0.002377340
I tried writing a while loop to generate these as breakpoints in R but it wasnt working. So I generated my bins in Excel and now have a table with bins that range from my smallest object to my largest with bins sized appropriately.
What I now want to do is read this into R and use it to find the frequency of objects in each bin. I can't find how to do this - possibly because in most disciplines you dont set your bins like this.
I am fairly new to R so am trying to keep my code fairly simple.
Thanks,
B

The easiest option is to use ?cut. Here's an example with randomly generated data.
# generate data
set.seed(666)
runif(min=0, max=100, n=1000) -> x
# create arbitrary cutpoints (these should be replaced by the ones generated by your geometric method)
cutpoints <- c(0, 1, 10, 11, 12, 15, 20, 50, 90, 99, 99.01, 100)
table(cut(x, cutpoints))
(0,1] (1,10] (10,11] (11,12] (12,15] (15,20]
9 92 13 10 27 45
(20,50] (50,90] (90,99] (99,99.01] (99.01,100]
310 399 87 0 8
Also note include.lowest parameter in cut defaults to FALSE:
include.lowest: logical, indicating if an ‘x[i]’ equal to the lowest
(or highest, for ‘right = FALSE’) ‘breaks’ value should be
included.

R radarchart: free axis to enhance records display?

I am trying to display my data using radarchart {fmsb}. The values of my records are highly variable. Therefore, low values are not visible on final plot.
Is there a was to "free" axis per each record, to visualize data independently of their scale?
Dummy example:
df<-data.frame(n = c(100, 0,0.3,60,0.3),
j = c(100,0, 0.001, 70,7),
v = c(100,0, 0.001, 79, 3),
z = c(100,0, 0.001, 80, 99))
n j v z
1 100.0 100.0 100.000 100.000 # max
2 0.0 0.0 0.000 0.000 # min
3 0.3 0.001 0.001 0.001 # small values -> no visible on final chart!!
4 60.0 0.001 79.000 80.000
5 0.3 0.0 3.000 99.000
Create radarchart
require(fmsb)
radarchart(df, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
Result: (only rows #2 and #3 are visible, row #1 with low values is not visible !!)
How to make visible all records (rows), i.e. how to "free" axis for any of my records? Thank you a lot,

If you want to be sure to see all 4 dimensions whatever the differences, you'll need a logarithmic scale.
As by design of the radar chart we cannot have negative values we are restricted on our choice of base by the range of values and by our number of segments (axis ticks).
If we want an integer base the minimum we can choose is:
seg0 <- 5 # your initial choice, could be changed
base <- ceiling(
max(apply(df[-c(1,2),],MARGIN = 1,max) / apply(df[-c(1,2),],MARGIN = 1,min))
^(1/(seg0-1))
)
Here we have a base 5.
Let's normalize and transform our data.
First we normalize the data by setting the maximum to 1 for all series,then we apply our logarithmic transformation, that will set the maximum of each series to seg0 (n for black, z for others) and the minimum among all series between 1 and 2 (here the v value of the black series).
df_normalized <- as.data.frame(df[-c(1,2),]/apply(df[-c(1,2),],MARGIN = 1,max))
df_transformed <- rbind(rep(seg0,4),rep(0,4),log(df_normalized,base) + seg0)
radarchart(df_transformed, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = seg0, centerzero = T,maxmin=T)
If we look at the green series we see:
j and v have same order of magnitude
n is about 5^2 = 25 times smaller than j (5 i the value of the base, ^2 because 2 segments)
v is about 5^2 = 25 times (again) smaller than z
If we look at the black series we see that n is about 3.5^5 times bigger than the other dimensions.
If we look at the red series we see that the order of magnitude is the same among all dimensions.

Maybe a workaround for your problem:
If you would transform your data before running radarchart
(e.g. logarithm, square root ..) then you could also visualise small values.

Here an example using a cubic root transformation:
library(specmine)
df.c<-data.frame(cubic_root_transform(df)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
EDIT:
If you want to zoom the small values even more you can do that with a higher order of the root.
e.g.
t<-5 # for fifth order root
df.t <- data.frame(apply(df, 2, function(x) FUN=x^(1/t))) # transform dataset
radarchart(df.t, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
You can adjust the "zoom" as you want by changing the value of t
So you should find a visualization that is suitable for you.

Here is an example using 10-th root transformation:
library(specmine)
df.c<-data.frame((df)^(1/10)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
You can try n-th root for find the one that is best for you. N grows, the root of a number nearby zero grows faster.