I need to generate a cumulative frequency plot of some bubble size data (I have >1000000 objects). In geology the way we do this is by using geometric binning.
I calculate the bins using the following method:
smallest object value aka 0.0015mm * 10^0.1 = upper limit of bin 1, the upper limit of each succcessive bin is generated by multiplying the lower limit by 10^0.1
Bin 1: 0.0015 - 0.001888388
Bin 2: 0.00188388 - 0.002377340
I tried writing a while loop to generate these as breakpoints in R but it wasnt working. So I generated my bins in Excel and now have a table with bins that range from my smallest object to my largest with bins sized appropriately.
What I now want to do is read this into R and use it to find the frequency of objects in each bin. I can't find how to do this - possibly because in most disciplines you dont set your bins like this.
I am fairly new to R so am trying to keep my code fairly simple.
Thanks,
B
The easiest option is to use ?cut. Here's an example with randomly generated data.
# generate data
set.seed(666)
runif(min=0, max=100, n=1000) -> x
# create arbitrary cutpoints (these should be replaced by the ones generated by your geometric method)
cutpoints <- c(0, 1, 10, 11, 12, 15, 20, 50, 90, 99, 99.01, 100)
table(cut(x, cutpoints))
(0,1] (1,10] (10,11] (11,12] (12,15] (15,20]
9 92 13 10 27 45
(20,50] (50,90] (90,99] (99,99.01] (99.01,100]
310 399 87 0 8
Also note include.lowest parameter in cut defaults to FALSE:
include.lowest: logical, indicating if an ‘x[i]’ equal to the lowest
(or highest, for ‘right = FALSE’) ‘breaks’ value should be
included.
Related
Problem:
Say I have a numeric vector x of observations of some distance (in R). For instance it could be the throwing length of a number of people.
x <- c(3,3,3,7,7,7,7,8,8,12,15,15,15,20,30)
h <- hist(x, breaks = 30, xlim = c(1,30))
I then want to define a set S of "selectors" (ranges) that select as much of my observations as possible and at the same time span as little distance as possible (the cost is the sum of ranges in S). Each selector si range must be at least 3 (its resolution).
Example:
In the toy data x I could put the first selector s1 from [6;8] which will select 4+2 observations (distance 7 and 8), use 3 distances and select 6/15 observations in total ([7;9] would give the same but for simplicity I put the selector midpoint in the max frequency). Next would be adding s2 [14;16] (6 distance and select 9/15). In summary, S would be build along the steps:
[6;8] (3, 6/15) #s1
[6;8], [14;16] (6, 9/15) #s2
[3;8], [14;16] (9, 12/15) #Extending s1 (cheapest)
[3;8], [12;16] (11, 13/15) #Extending s2
[3;8], [12;16], [29;31], (14, 14/15) #s3
[3;8], [12;20], [29;31], (18, 15/15) #Extending s2
One would stop the iterations when a certain total distance is used (sum of S) or when a certain fraction of data is covered by S. Or plot the sum of S against fraction of data covered and decide from that.
For very huge data (100,000s clustered observations in 1,000,000s of distance space) I could probably be more sloppy by increasing the minimum steps allowed (above it is 1, maybe try 100) and decreasing the resolution (above its 3, one could try maybe 1000).
Since its equivalent of maximizing the area under density(x) while minimizing the ranges of x, my intuition is that one could approximate the steps described (for time and memory considerations) using density() and optim() . Maybe its even a well known maximization/minimization problem.
Any suggestions that could get me started would be very appreciated.
Right off the bat, I'm a newbie to R and maybe I'm misunderstanding the concept of what my code does vs what I want it to do. Here is the code I've written so far.
his <- hist(x, breaks=seq(floor(min(x)) , ceiling(max(x)) + ifelse(ceiling(max(x)) %% 5 != 0, 5, 0), 5)
Here is some sample data:
Autonr X
1 -12
2 -6
3 -17
4 8
5 -11
6 -10
7 10
8 -22
I'm not able to upload one of the histograms that did work, but it should show bins of 5, no matter how large the spread of the data. The amount of bins should therefore be flexible.
The idea of the code above is to make sure that the outer ranges of my data always fall within neatly defined 5mm bins. Maybe I lost oversight. but I can't seem to understand why this does not always work. In some cases it does, but with other datasets it doesn't.
I get: some 'x' not counted; maybe 'breaks' do not span range of 'x'.
Any help would be greatly appreciated, as I don't want to have to tinker around with my breaks and bins everytime I get a new dataset to run through this.
Rather than passing a vector of breaks, you can supply a single value, in this case the calculation of how many bins are needed given the range of the data and a bindwidth of 5.
# Generate uniformly distributed dummy data between 0 and 53
set.seed(5)
x <- runif(1000, 0, 53)
# Plot histogram with binwidths of 5.
hist(x, breaks = ceiling(diff(range(x)) / 5 ))
For the sake of completeness, here is another approach which uses breaks_width() from the scales package. scales is part of Hadley Wickham's ggplot2 ecosphere.
# create sample data
set.seed(5)
x <- runif(1000, 17, 53)
# plot histogram
hist(x, breaks = scales::breaks_width(5)(range(x)))
Explanation
scales::breaks_width(5) creates a function which is then called with the vector of minimum and maximum values of x as returned by range(x). So,
scales::breaks_width(5)(range(x))
returns a vector of breaks with the required bin size of 5:
[1] 15 20 25 30 35 40 45 50 55
The nice thing is that we have full control of the bins. E.g., we can ask for weird things like a bin size of 3 which is offset by 2:
scales::breaks_width(3, 2)(range(x))
[1] 17 20 23 26 29 32 35 38 41 44 47 50 53 56
I'm trying to cut() my data D into 3 pieces: [0-4], [5-12], [13-40] (see pic below). But I wonder how to exactly define my breaks in cut to achieve that?
Here is my data and R code:
D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)
table(cut(D$time, breaks = c(0, 5, 9, 12))) ## what should breaks be?
# (0,5] (5,9] (9,12] # cuts not how I want the 3 pieces .
# 228 37 10
The notation (a,b] means ">a and <=b".
So, to get your desired result, just define the cuts so you get the grouping that you want, including a lower and upper bound:
table(cut(D$time, breaks=c(-1, 4, 12, 40)))
## (-1,4] (4,12] (12,40]
## 319 47 20
You may also find it helpful to look at the two arguments right=FALSE, which changes the endpoints of the intervals from (a,b] to [a,b), and include.lowest, which includes the lowest breaks value (in the OP's example, this is [0,5] with closed brackets on the lower bound). You can also use infinity. Here's an example with a couple of those options put to use:
table(cut(D$time, breaks = c(-Inf, 4, 12, Inf), include.lowest=TRUE))
## [-Inf,4] (4,12] (12, Inf]
## 319 47 20
This produces the right buckets, but the interval notation would need tweaking. Assuming all times are integers. Might need to tweak the labels manually - each time you have an right-open interval notation, replace the factor label with a closed interval notation. Use your best string 'magic'
Personally, I like to make sure all possibilities are covered. Perhaps future data from this process might exceed 40? I like to put an upper bound of +Inf in all my cuts. This prevents NA from creeping into the data.
What cut needs is a 'whole numbers only` option.
F=cut(D$time,c(0,5,13,40),include.lowest = TRUE,right=FALSE)
# the below levels hard coded but you could write a loop to turn all labels
# of the form [m,n) into [m,n-1]
levels(F)[1:2]=c('[0,4]','[5,12]')
Typically there would be more analysis before final results are obtained, so I wouldn't sweat the labels too much until the work is closer to complete.
Here are my results
> table(F)
F
[0,4] [5,12] [13,40]
319 47 20
R can compare integers to floats, like in
> 6L >= 8.5
[1] FALSE
Thus you can use floats as breaks in cut such as in
table(cut(D$time, breaks = c(-.5, 4.5, 12.5, 40.5)))
For integers this fullfills your bucket definition of [0-4], [5-12], [13-40] without you having to think to much about square brackets against round brackets.
A fancy alternative would be clustering around the mean of you buckets as in
D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)
D$cluster <- kmeans(D$time, center = c(4/2, (5+12)/2, (13+40)/2))$cluster
plot(D$time, rnorm(nrow(D)), col=D$cluster)
You shoud add two aditional arguments right and include.lowest to your code!
table(cut(D$time, breaks = c(0, 5, 13, 40), right=FALSE, include.lowest = TRUE))
In the case of right=FALSE the intervals should be closed on the left and open on the right such that you would have your desired result. include.lowest=TRUE causes that your highest break value (here 40) is included to the last interval.
Result:
[0,5) [5,13) [13,40]
319 47 20
Vice versa you can write:
table(cut(D$time, breaks = c(0, 4, 12, 40), right=TRUE, include.lowest = TRUE))
with the result:
[0,4] (4,12] (12,40]
319 47 20
Both mean exact what you looking for:
[0,4] [5,12] [13,40]
319 47 20
I am trying to display my data using radarchart {fmsb}. The values of my records are highly variable. Therefore, low values are not visible on final plot.
Is there a was to "free" axis per each record, to visualize data independently of their scale?
Dummy example:
df<-data.frame(n = c(100, 0,0.3,60,0.3),
j = c(100,0, 0.001, 70,7),
v = c(100,0, 0.001, 79, 3),
z = c(100,0, 0.001, 80, 99))
n j v z
1 100.0 100.0 100.000 100.000 # max
2 0.0 0.0 0.000 0.000 # min
3 0.3 0.001 0.001 0.001 # small values -> no visible on final chart!!
4 60.0 0.001 79.000 80.000
5 0.3 0.0 3.000 99.000
Create radarchart
require(fmsb)
radarchart(df, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
Result: (only rows #2 and #3 are visible, row #1 with low values is not visible !!)
How to make visible all records (rows), i.e. how to "free" axis for any of my records? Thank you a lot,
If you want to be sure to see all 4 dimensions whatever the differences, you'll need a logarithmic scale.
As by design of the radar chart we cannot have negative values we are restricted on our choice of base by the range of values and by our number of segments (axis ticks).
If we want an integer base the minimum we can choose is:
seg0 <- 5 # your initial choice, could be changed
base <- ceiling(
max(apply(df[-c(1,2),],MARGIN = 1,max) / apply(df[-c(1,2),],MARGIN = 1,min))
^(1/(seg0-1))
)
Here we have a base 5.
Let's normalize and transform our data.
First we normalize the data by setting the maximum to 1 for all series,then we apply our logarithmic transformation, that will set the maximum of each series to seg0 (n for black, z for others) and the minimum among all series between 1 and 2 (here the v value of the black series).
df_normalized <- as.data.frame(df[-c(1,2),]/apply(df[-c(1,2),],MARGIN = 1,max))
df_transformed <- rbind(rep(seg0,4),rep(0,4),log(df_normalized,base) + seg0)
radarchart(df_transformed, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = seg0, centerzero = T,maxmin=T)
If we look at the green series we see:
j and v have same order of magnitude
n is about 5^2 = 25 times smaller than j (5 i the value of the base, ^2 because 2 segments)
v is about 5^2 = 25 times (again) smaller than z
If we look at the black series we see that n is about 3.5^5 times bigger than the other dimensions.
If we look at the red series we see that the order of magnitude is the same among all dimensions.
Maybe a workaround for your problem:
If you would transform your data before running radarchart
(e.g. logarithm, square root ..) then you could also visualise small values.
Here an example using a cubic root transformation:
library(specmine)
df.c<-data.frame(cubic_root_transform(df)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
EDIT:
If you want to zoom the small values even more you can do that with a higher order of the root.
e.g.
t<-5 # for fifth order root
df.t <- data.frame(apply(df, 2, function(x) FUN=x^(1/t))) # transform dataset
radarchart(df.t, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
You can adjust the "zoom" as you want by changing the value of t
So you should find a visualization that is suitable for you.
Here is an example using 10-th root transformation:
library(specmine)
df.c<-data.frame((df)^(1/10)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
You can try n-th root for find the one that is best for you. N grows, the root of a number nearby zero grows faster.
Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.