Equal frequency discretization in R

Equal frequency discretization in R - r

I'm having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the 'infotheo' package, but after some testing I found that the algorithm is broken. 'dprep' seems to no longer be supported on CRAN.
EDIT :
For clarity, I do not need to seperate the values between the bins. I really want equal frequency, it doesn't matter if one value ends up in two bins. Eg :
c(1,3,2,1,2,2)
should give a bin c(1,1,2) and one c(2,2,3)

EDIT : given your real goal, why don't you just do (corrected) :
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :
x <- rpois(50,5)
y <- EqualFreq2(x,15)
table(y)
split(x,y)
Original answer:
You can easily just use cut() for this :
EqualFreq <-function(x,n,include.lowest=TRUE,...){
nx <- length(x)
id <- round(c(1,(1:(n-1))*(nx/n),nx))
breaks <- sort(x)[id]
if( sum(duplicated(breaks))>0 stop("n is too large.")
cut(x,breaks,include.lowest=include.lowest,...)
}
Which gives :
set.seed(12345)
x <- rnorm(50)
table(EqualFreq(x,5))
[-2.38,-0.886] (-0.886,-0.116] (-0.116,0.586] (0.586,0.937] (0.937,2.2]
10 10 10 10 10
x <- rpois(50,5)
table(EqualFreq(x,5))
[1,3] (3,5] (5,6] (6,7] (7,11]
10 13 11 6 10
As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.

This sort of thing is also quite easily solved by using (abusing?) the conditioning plot infrastructure from lattice, in particular function co.intervals():
cutEqual <- function(x, n, include.lowest = TRUE, ...) {
stopifnot(require(lattice))
cut(x, co.intervals(x, n, 0)[c(1, (n+1):(n*2))],
include.lowest = include.lowest, ...)
}
Which reproduces #Joris' excellent answer:
> set.seed(12345)
> x <- rnorm(50)
> table(cutEqual(x, 5))
[-2.38,-0.885] (-0.885,-0.115] (-0.115,0.587] (0.587,0.938] (0.938,2.2]
10 10 10 10 10
> y <- rpois(50, 5)
> table(cutEqual(y, 5))
[0.5,3.5] (3.5,5.5] (5.5,6.5] (6.5,7.5] (7.5,11.5]
10 13 11 6 10
In the latter, discrete, case the breaks are different although they have the same effect; the same observations are in the same bins.

How about?
a <- rnorm(50)
> table(Hmisc::cut2(a, m = 10))
[-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775)
10 10 10 10
[ 0.9775, 2.5677]
10

The classInt library is created "for choosing univariate class intervals for mapping or other graphics purposes". You can just do:
dataset <- c(1,3,2,1,2,2)
library(classInt)
classIntervals(dataset, 2, style = 'quantile')
where 2 is the number of bins you want and the quantile style provides quantile breaks. There are several styles available for this function: "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust",
"bclust", "fisher", or "jenks". Check docs for more info.

Here is a function that handle the error :'breaks' are not unique, and automatically select the closest n_bins value to the one you setted up.
equal_freq <- function(var, n_bins)
{
require(ggplot2)
n_bins_orig=n_bins
res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)})
while(grepl("'breaks' are not unique", res[1]) & n_bins>1)
{
n_bins=n_bins-1
res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)})
}
if(n_bins_orig != n_bins)
warning(sprintf("It's not possible to calculate with n_bins=%s, setting n_bins in: %s.", n_bins_orig, n_bins))
return(res)
}
Example:
equal_freq(mtcars$carb, 10)
Which retrieves the binned variable and the following warning:
It's not possible to calculate with n_bins=10, setting n_bins in: 5.

Here is a one liner solution inspired by #Joris' answer:
x <- rpois(50,5)
binSize <- 5
desiredFrequency = floor(length(x)/binSize)
split(sort(x), rep(1:binSize, rep(desiredFrequency, binSize)))

Here's another solution using mltools.
set.seed(1)
x <- round(rnorm(20), 2)
x.binned <- mltools::bin_data(x, bins = 5, binType = "quantile")
table(x.binned)
x.binned
[-2.21, -0.622) [-0.622, 0.1) [0.1, 0.526) [0.526, 0.844) [0.844, 1.6]
4 4 4 4 4

We can use package cutr with feature what = "rough", the look of labels can be customized to taste :
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
smart_cut(c(1, 3, 2, 1, 2, 2), 2, "rough", brackets = NULL, sep="-")
# [1] 1-2 2-3 1-2 1-2 2-3 2-3
# Levels: 1-2 < 2-3

Related

In R, How to apply if statement in matrix

recently I am trying to mimic a game.
I am going to throw 2 dice at the same time. If the sum of 2 dice is greater than or equals to 10, I win 1 point.
If it is lower than 10, I lose 1 point. I will do this for 1000 times.
At the very beginning, I draw 2000 random samples with set.seed (1234)
set.seed(1234)
d = sample(c(1:6), size = 2000, replace = T)
d
And then, I turn it into a matrix, and sum each row
a = matrix(d, nrow=1000, ncol=2, byrow=T)
t = rowSums(a)
t
Now, I have 1000 elements (sum of two dice each time). I would like to create a vector X to calculate the point that I can get.
However, how can I apply if statement to create vector X in this time?
Thank you very much

Do you mean this?
X <- ifelse(t>=10,1,-1)
or
X <- 2*(t>=10)-1

Using case_when
library(dplyr)
case_when(t >= 10 ~ 1, TRUE ~ -1)

You could assign a temporary variable and assign points by comparing the values.
tmp <- t
t[tmp >= 10] <- 1
t[tmp < 10] <- -1
Or without a temporary variable.
t1 <- c(-1, 1)[(t >= 10) + 1]

How to find 2 local maximum values of y for a given range

#!My question requires me to find 2 local maximum for y using this equation:
x <-seq(-5,5,length =10001)
y<-(10 *((x-1)^2 )^(1/3))/(x^2+9)
plot(x,y)
I believe I can get one max value by either max(y) or y[which.max(y)].
But I can't find the other max value since the graph have 2 peaks.
y[which.max(y)]
#> [1] 1.637347
max(y)
#> [1] 1.637347
I tried finding the second local maximum using this way but not sure if it's correct
y[which.max(x>2)]
#> [1] 0.7695067
Created on 2019-03-24 by the reprex
package (v0.2.0).

Here is a rather roundabout way of getting it done (correctly?).
#Library kader for cuberoot() function
library('kader')
#Declaring x and calculating y
varX <- seq(-5, 5, length = 10001)
crootX <- kader:::cuberoot((varX-1)^2)
actY <- (10*crootX)/(varX^2+9)
#Storage variable for the maxima
outs <- c()
#Looping through y values looking for those values that are greater than both the preceding AND succeeding values
for(i in 1:length(actY)){
if(actY[i] > actY[i-1] && actY[i] > actY[i+1]){
outs[i] <- actY[i]
}
}
#Subsetting said values and output
outs <- subset(outs, !is.na(outs))
outs
#[1] 1.6373473 0.8818895

fminsearch on a single variable

Using R's help page example on fminsearch as a starting point:
# Rosenbrock function
rosena <- function(x, a) 100*(x[2]-x[1]^2)^2 + (a-x[1])^2 # min: (a, a^2)
fminsearch(rosena, c(-1.2, 1), a = sqrt(2))
# x = (1.414214 2.000010) , fval = 1.239435e-11
I want to evaluate something like this but with only one variable such as:
rosena <- function(x, a) 100*(x[1]-x[1]^2)^2 + (a-x[1])^2
but when I run
fminsearch(rosena, c(1), a = sqrt(2))
It gives the error: Error in X[2:d1, ] : incorrect number of dimensions
fminsearch seems to want a vector of length greater than or equal to 2, but no less, however for this example, the vector requires length 1
Note: fminsearch is in the "pracma" package

It looks like a bug in the pracma package.
The anms function is dropping a dimension upon a subscript, relevant excerpts:
d <- length(x0) # i.e. 1
d1 <- d + 1 # i.e. 2
...
X <- matrix(0, nrow = d1, ncol = d)
...
X <- X[o, ] # could put drop = FALSE here
I think you should post a bug with the author of the package.

Extend a vector by randomly increasing values in R

In this example I'm trying to generate a random time series for 3 individuals at 4 time points (below x contains the 1st timepoints for each individual). I want the values to be randomly increasing rather than decreasing in time. Below is my current solution.
set.seed(0)
x <- rnorm(3)
x
[1] 1.2629543 -0.3262334 1.3297993
y <- c(x,
x*runif(1,.8,1.2),
x*runif(1,.9,1.3),
x*runif(1,1,1.4))
y
[1] 1.2629543 -0.3262334 1.3297993 1.4642135 -0.3782206 1.5417106 1.6138915 -0.4168839 1.6993107 1.5967772
[11] -0.4124631 1.6812906
This has some problems.
For each individual the same coefficient is used for calculating the values for same timepoint resulting in identical trends. How could I get a random coefficient for each multiplication? I could use lapply but then the vector will be "grouped" by individuals not by timepoints.
I don't wish to write the formulas for last timepoints separately and be so precise. Exact coefficients are not important, I just need the values to have a tendency to slightly increase but occasional decreasing should also be allowed. How could I extend the vector more "effectively"?
How to make negative values to also increase?
I managed to solve this thanks to Federico Manigrasso. The solution is below.
TimeSer <- function(num.id, years, init.val) {
df <- data.frame(id = factor(rep(1:num.id, length(years))),
year = rep(years, each = num.id))
yrs <- length(years) - 1
minim <- seq(-.1, by = -.1, len = yrs)
maxim <- seq(.4, by = .4, len = yrs)
val <- list(init.val)
for (i in 1:yrs) {
val[[i + 1]] <- unlist(lapply(init.val, function (x) {
x + (x * runif(1, minim[i], maxim[i]))
}))
}
df$val <- unlist(val)
df
}
df <- TimeSer(num.id = 3, years = 2006:2016, init.val = rnorm(3,1e5, 1e5))
Visual representation of the results:
num.id <- length(unique(df$id))
par(mfrow=c(1,num.id))
lapply(1:num.id, function(x) {
plot(unique(df$year), df$val[df$id == x], type = 'l', col = x)
})

I suggest to put the output in a list, It a lot less messy and you can transform into a vector later (using unlist).
This is how I would rewrite your code
x<-rnorm(3)
time<-3
output<-list(x) #init output list with initial data
par1<-c(0.8,0.9,1)
par2<-c(1.2,1.3,1.4)
for( i in 1:time){
a<-unlist(lapply(x,function(x){x+runif(1,par1[i],par2[i])}))
output[[i+1]]<-a
x<-a
}
let me know if this solves all your problems..

Avoiding empty and small groups when using pretty_breaks with cut2

I'm working with variables resembling the data val values created below:
# data --------------------------------------------------------------------
data("mtcars")
val <- c(mtcars$wt, 10.55)
I'm cutting this variable in the following manner:
# Cuts --------------------------------------------------------------------
cut_breaks <- pretty_breaks(n = 10, eps.correct = 0)(val)
res <- cut2(x = val, cuts = cut_breaks)
which produces the following results:
> table(res)
res
[ 1, 2) [ 2, 3) [ 3, 4) [ 4, 5) [ 5, 6) 6 7 8 9 [10,11]
4 8 16 1 3 0 0 0 0 1
In the created output I would like to change the following:
I'm not interested in creating grups with one value. Ideally, I would like to for each group to have at least 3 / 4 values. Paradoxically, I can leave with groups having 0 values as those will dropped later on when mergining on my real data
Any changes to the cutting mechanism, have to work on a variable with integer values
The cuts have to be pretty. I'm trying to avoid something like 1.23 - 2.35. Even if those values would be most sensible considering the distribution.
In effect, what I'm trying to achieve is this: try to make more or less even pretty group and if getting a really tiny group then bump it together with the next group, do not worry about empty groups.
Full code
For convenience, the full code is available below:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c("scales", "Hmisc"),
character.only = TRUE)
# data --------------------------------------------------------------------
data("mtcars") val <- c(mtcars$wt, 10.55)
# Cuts --------------------------------------------------------------------
cut_breaks <- pretty_breaks(n = 10, eps.correct = 0)(val) res <-
cut2(x = val, cuts = cut_breaks)
What I've tried
First approach
I tried to play with the eps.correct = 0 value in the pretty_breaks like in the code:
cut_breaks <- pretty_breaks(n = cuts, eps.correct = 0)(variable)
but none of the values gets me anwhere were close
Second approach
I've also tried using the m= 5 argument in the cut2 function but I keep on arriving at the same result.
Comment replies
My breaks function
I tried the mybreaks function but I would have to put some work into it to get nice cuts for more bizzare variables. Broadly speaking, pretty_breaks cuts well for me, juts the tiny groups that occur from time to time are not desired.
> set.seed(1); require(scales)
> mybreaks <- function(x, n, r=0) {
+ unique(round(quantile(x, seq(0, 1, length=n+1)), r))
+ }
> x <- runif(n = 100)
> pretty_breaks(n = 5)(x)
[1] 0.0 0.2 0.4 0.6 0.8 1.0
> mybreaks(x = x, n = 5)
[1] 0 1

You could use the quantile() function as a relatively easy way to get similar numbers of observations in each of your groups.
For example, here's a function that takes a vector of values x, a desired number of groups n, and a desired rounding off point r for the breaks, and gives you suggested cut points.
mybreaks <- function(x, n, r=0) {
unique(round(quantile(x, seq(0, 1, length=n+1)), r))
}
cut_breaks <- mybreaks(val, 5)
res <- cut(val, cut_breaks, include.lowest=TRUE)
table(res)
[2,3] (3,4] (4,11]
8 16 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Equal frequency discretization in R - r

How about? a <- rnorm(50) > table(Hmisc::cut2(a, m = 10)) [-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775) 10 10 10 10 [ 0.9775, 2.5677] 10

Here is a one liner solution inspired by #Joris' answer: x <- rpois(50,5) binSize <- 5 desiredFrequency = floor(length(x)/binSize) split(sort(x), rep(1:binSize, rep(desiredFrequency, binSize)))

Here's another solution using mltools. set.seed(1) x <- round(rnorm(20), 2) x.binned <- mltools::bin_data(x, bins = 5, binType = "quantile") table(x.binned) x.binned [-2.21, -0.622) [-0.622, 0.1) [0.1, 0.526) [0.526, 0.844) [0.844, 1.6] 4 4 4 4 4

We can use package cutr with feature what = "rough", the look of labels can be customized to taste : # devtools::install_github("moodymudskipper/cutr") library(cutr) smart_cut(c(1, 3, 2, 1, 2, 2), 2, "rough", brackets = NULL, sep="-") # [1] 1-2 2-3 1-2 1-2 2-3 2-3 # Levels: 1-2 < 2-3

Related

In R, How to apply if statement in matrix

How to find 2 local maximum values of y for a given range

fminsearch on a single variable

Extend a vector by randomly increasing values in R

Avoiding empty and small groups when using pretty_breaks with cut2

Categories

Resources