R Looping through two vectors - r

Good day,
I need a function that creates increasing ID's for two parameters. I came up with this function which works fine, but I want it to be vectorized and I cannot seem to avoid a Big O factor of N². Are there any 'better' ways to do this?
Standard function:
threshold <- 3
calculateID <- function(p, r) {
return((p-1) * threshold + r)
}
calculateID(1, 1) #returns 1
calculateID(1, 2) #returns 2
calculateID(1, 3) #returns 3
calculateID(2, 1) #returns 4
#.....
calculateID(5, 3) #returns 15
Vectorized function, I would like to give the two parameters as vectors so the function only has to be called once:
threshold <- 3
calculateIDVectorized <- function(p, r) {
return(unlist(
lapply(p, function(x) {
lapply(r, function(y) {
(x-1) * threshold + y
})
})
))
}
calculateIDVectorized(c(1, 2, 3, 4, 5), c(1, 2, 3)) # should return 1-15
To clarify: I want that every p and r argument is used so you should always get a result of length(p * r)

You can use outer:
calculateIDVectorized <- function(p, r) as.vector(t(outer(p, r, calculateID)))
calculateIDVectorized(c(1, 2, 3, 4, 5), c(1, 2, 3))
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Since the OP was interested in fast computation, I compared the solutions:
library(microbenchmark)
p <- c(1:500) # using larger data set
r <- c(1:20)
threshhold = length(r) # parameterizing threshold
m = microbenchmark(
tidy= crossing(p, r) %>%
rowwise %>%
transmute(out = calculateID(p, r)) %>%
pull(out),
dcv = do.call(Vectorize(calculateID),unname(rev(expand.grid(r,p)))),
numbering = rev(expand.grid(r,p)) %>%
arrange(Var2, Var1) %>%
transmute(out = row_number()) %>%
pull(out),
hybrid = rev(expand.grid(r,p)) %>%
rowwise() %>%
transmute(out = calculateID(Var2, Var1)) %>%
pull(out),
outer = as.vector(t(outer(p, r, calculateID))),
outer_c = c(t(outer(p, r, calculateID))),
david = rep((p - 1), each = length(r)) * threshold + r
)
m
# Unit: microseconds
# expr min lq mean median uq max neval
# tidy 45441.869 47370.776 52123.6770 49482.1970 54158.4285 116780.840 100
# dcv 16259.935 17156.225 19641.6731 17897.8885 21576.0865 55489.586 100
# numbering 5947.147 6379.337 7127.5125 6576.3560 6952.3205 12005.854 100
# hybrid 44124.099 45856.210 51531.9480 47642.5405 52225.0600 175778.380 100
# outer 106.655 120.711 141.1137 128.9665 143.2465 265.072 100
# outer_c 117.811 137.446 152.5958 142.1315 155.9650 327.101 100
# david 223.125 230.711 257.5622 241.8675 260.6100 920.164 100
So it looks like the options using outer() are fastest with as.vector() edging out c(). #DavidArenburg's solution is also right up with the solutions using outer().
I added a hybrid option using dplyr::transmute() because rev(expand.grid()) was significantly faster thatn crossing(), which appears to be marginally faster than the straight dplyr route, but still not as fast as the do.call(Vectorize... or the others.
another option (added above) would be to arrange the data frame and create id's using dplyr::row_number() or 1:nrow(). This option would work if all the combinations for p and r are present and unique, but would fail with non-sequential values.

Another base R option using do.call + Vectorize + expand.grid
> do.call(Vectorize(calculateID),unname(rev(expand.grid(r,p))))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data
p <- c(1, 2, 3, 4, 5)
r <- c(1, 2, 3)

An option with tidyverse
library(dplyr)
library(tidyr)
crossing(p, r) %>%
rowwise %>%
transmute(out = calculateID(p, r)) %>%
pull(out)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Related

Finding a list of previous max values in order of a vector R

I want to find a list of previous max. So for a vector:
3, 2,2,3,4,3,9,5,2,3,4,6,120,1
The first max is 3, the second max is 4 (because, 4>3), then 9(because 9>4) and then 120 (120>9)
So, as an output I would need the position:
1,5,7,13
Is there anyway to do this without a for loop?
```
vector<-c(3, 2,2,3,4,3,9,5,2,3,4,6,120,1)
results<-1
max<-3
for(i in 2:length(vector)){
if(vector[i]>max{
results<-c(results, i)
max<-vector[i]}
else {next}
}
```
This can be done with run-length encoding:
vec <- c(3,2,2,3,4,3,9,5,2,3,4,6,120,1)
r <- rle(cummax(vec))
c(1, 1+cumsum(r$lengths)[-length(r$lengths)])
# [1] 1 5 7 13
And a variant from #user20650 that is shorter, more succinct (thanks!):
which(as.logical(c(1, diff(cummax(vec)))))
# [1] 1 5 7 13
Maybe another solution with dplyr and tibble:
library(dplyr)
library(tibble)
cummax(vector) %>%
enframe() %>%
group_by(value) %>%
slice_head() %>%
pull(name)
[1] 1 5 7 13
Another way is to use a recursive function
findAllMaximums <- function(data, index = 1, results = c()){
if(index == length(data)) return(results)
if(index==1) return(findAllMaximums(data, index + 1, index))
if(data[index] > max(data[results])) results = append(results, index)
return(findAllMaximums(data, index + 1, results))
}
vector<-c(3, 2,2,3,4,3,9,5,2,3,4,6,120,1)
print(findAllMaximums(vector))
sapply(split(1:length(vector), cummax(vector)), `[`, 1)
## 3 4 9 120 <- the names of the result vector (=max values)
## 1 5 7 13 <- the values (=indexes)
Take only the firsts of the cummax() grouping.

Creating a Basic R Dice Rolling Function to Sum Dice Values

I'm trying to write a function that combines up to 4 (fair 6 sided) dice rolls to create a specific value (named 'target.mountain') as many times as possible given the numbers shown on the dice.
Then return these values along with any that aren't used in said combination. If the other numbers that aren't used to form the 'target.mountain' can sum to be within the range (5-10) then do so.
So as an example say I roll 4,3,2,5 and my target.mountain value is 9
I would do
4 + 5 -> 9 and as 2 + 3 = 5 my function would return 9, 5
Another example could be
Roll = (2,3,6,4) --> (6 + 3), (4 + 2) --> 9, 6
Once these values have been found then list so it appears like
[1] 9, 5 (example 1)
[1] 9, 6 (example 2)
How do I go about doing this?
If you have ever played the board game 'Mountain Goats' then that may shed some light on how I need the dice to work as I just cannot figure it out!
Let's make the problem a bit harder, say 5 dice.
library(tidyverse)
rolls <- sample(1:6,replace = TRUE, size = 5)
target.mountain <- 7
#Make all possible combinations of the dice:
map_dfr(seq_along(rolls),~ combn(seq_along(rolls),.x,simplify = FALSE) %>%
map(~tibble(dice = list(.), sum = sum(rolls[.]), rolls = list(rolls[.]),length = length(.)))) %>%
#filter to only those combinations which equal the target
filter(sum == target.mountain) %>%
#Now make all possible combinations of the sets that equal the target
{map2(.x = list(.), .y = nrow(.) %>% map(.x = seq(.), .f = combn,x=.,simplify = FALSE) %>% unlist(recursive = FALSE),
~.x[unlist(.y),])} %>%
#Subset to non-overlapping sets
subset(map_lgl(.,~length(reduce(.x$dice,union))==length(unlist(.x$dice)))) -> part1
map(part1, as.data.frame)
#[[1]]
# dice sum rolls length
#1 1, 3 7 3, 4 2
#
#[[2]]
# dice sum rolls length
#1 4, 5 7 6, 1 2
#
#[[3]]
# dice sum rolls length
#1 2, 3, 5 7 2, 4, 1 3
#
#[[4]]
# dice sum rolls length
#1 1, 3 7 3, 4 2
#2 4, 5 7 6, 1 2
From here you can apply whatever rules you want:
part1 %>%
#subset to the largest number of sets
subset(map_dbl(.,nrow) == max(map_dbl(.,nrow))) %>%
#subset to the fewest number of total dice
subset(map_dbl(.,~sum(.x$length)) == min(map_dbl(.,~sum(.x$length)))) %>%
#if there are still ties, pick the first
`[[`(1) -> part2
as.data.frame(part2)
# dice sum rolls length
#1 1, 3 7 3, 4 2
#2 4, 5 7 6, 1 2
possible solution to the problem
target.mountain = 9
dice <- c(4,3,2,5)
library(tidyverse)
fn <- function(target.mountain, dice){
fltr <- map(seq_along(dice), ~combn(dice, .x, sum) == target.mountain)
out <- map(seq_along(dice), ~combn(dice, .x))
sum_target <- map2(out, fltr, ~.x[, .y]) %>%
purrr::discard(.x = ., function(x) length(x) == 0) %>%
keep(.x = ., .p = function(x) length(x) == min(lengths(.))) %>%
flatten_dbl()
no_sum_target <- dice[!(dice %in% sum_target)]
result <- toString(c(sum(sum_target), no_sum_target))
return(result)
}
fn(target.mountain = target.mountain, dice = dice)
#> [1] "9, 3, 2"
Created on 2021-03-29 by the reprex package (v1.0.0)

Get indices of repeated instances of elements of a vector in other vector (both very large)

I have two vectors, one (A) of about 100 million non-unique elements (integers), the other (B) of 1 million of the same, unique, elements. I am trying to get a list containing the indices of the repeated instances of each element of B in A.
A <- c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2)
B <- 1:3
# would result in this:
[[1]]
[1] 2 3 4 6 7
[[2]]
[1] 1 5 10
[[3]]
[1] 8 9
I first, naively, tried this:
b_indices <- lapply(B, function(b) which(A == b))
which is horribly inefficient, and apparently wouldn't complete in a few years.
The second thing I tried was to create a list of empty vectors, indexed with all elements of B, and to then loop through A, appending the index to the corresponding vector for each element in A. Although technically O(n), I'm not sure about the time to repeatedly append elements. This approach would apparently take ~ 2-3 days, which is still too slow...
Is there anything that could work faster?
This is fast:
A1 <- order(A, method = "radix")
split(A1, A[A1])
#$`1`
#[1] 2 3 4 6 7
#
#$`2`
#[1] 1 5 10
#
#$`3`
#[1] 8 9
B <- seq_len(1e6)
set.seed(42)
A <- sample(B, 1e8, TRUE)
system.time({
A1 <- order(A, method = "radix")
res <- split(A1, A[A1])
})
# user system elapsed
#8.650 1.056 9.704
data.table is arguably the most efficient way of dealing with Big Data in R and it would even let you avoid having to use that 1 million length vector all together!
require(data.table)
a <- data.table(x=rep(c("a","b","c"),each=3))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: a 1,2,3
2: b 4,5,6
3: c 7,8,9
Using your example data:
a <- data.table(x=c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: 2 1, 5,10
2: 1 2,3,4,6,7
3: 3 8,9
Add this to your benchmarks. I dare say it should be significantly faster than using the built-in functions, if you test it at scale. The bigger the data the better the relative performance of data.table in my experience.
In my benchmark it only takes about 46% as long as order on my Debian laptop and only 5% as long as order on my Windows laptop with 8GB RAM and a 2.x GHz CPU.
B <- seq_len(1e6)
set.seed(42)
A <- data.table(x = sample(B, 1e8, TRUE))
system.time({
+ res <- A[ , list( yidx = list(.I) ) , by = x ]
+ })
user system elapsed
4.25 0.22 4.50
We can also use dplyr
library(dplyr)
data_frame(A) %>%
mutate(B = row_number()) %>%
group_by(A) %>%
summarise(B = list(B)) %>%
.$B
#[[1]]
#[1] 2 3 4 6 7
#[[2]]
#[1] 1 5 10
#[[3]]
#[1] 8 9
In a smaller dataset of 1e5 size, it gives system.time
# user system elapsed
# 0.01 0.00 0.02
but with larger example as showed in the other post, it is slower. However, this is dplyr...

Find top deciles from dataframe by group

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20
Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.
You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20
The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.

Split vector in quantile in R [duplicate]

I see a lot of questions and answers re order and sort. Is there anything that sorts vectors or data frames into groupings (like quartiles or deciles)? I have a "manual" solution, but there's likely a better solution that has been group-tested.
Here's my attempt:
temp <- data.frame(name=letters[1:12], value=rnorm(12), quartile=rep(NA, 12))
temp
# name value quartile
# 1 a 2.55118169 NA
# 2 b 0.79755259 NA
# 3 c 0.16918905 NA
# 4 d 1.73359245 NA
# 5 e 0.41027113 NA
# 6 f 0.73012966 NA
# 7 g -1.35901658 NA
# 8 h -0.80591167 NA
# 9 i 0.48966739 NA
# 10 j 0.88856758 NA
# 11 k 0.05146856 NA
# 12 l -0.12310229 NA
temp.sorted <- temp[order(temp$value), ]
temp.sorted$quartile <- rep(1:4, each=12/4)
temp <- temp.sorted[order(as.numeric(rownames(temp.sorted))), ]
temp
# name value quartile
# 1 a 2.55118169 4
# 2 b 0.79755259 3
# 3 c 0.16918905 2
# 4 d 1.73359245 4
# 5 e 0.41027113 2
# 6 f 0.73012966 3
# 7 g -1.35901658 1
# 8 h -0.80591167 1
# 9 i 0.48966739 3
# 10 j 0.88856758 4
# 11 k 0.05146856 2
# 12 l -0.12310229 1
Is there a better (cleaner/faster/one-line) approach? Thanks!
There's a handy ntile function in package dplyr. It's flexible in the sense that you can very easily define the number of *tiles or "bins" you want to create.
Load the package (install first if you haven't) and add the quartile column:
library(dplyr)
temp$quartile <- ntile(temp$value, 4)
Or, if you want to use dplyr syntax:
temp <- temp %>% mutate(quartile = ntile(value, 4))
Result in both cases is:
temp
# name value quartile
#1 a -0.56047565 1
#2 b -0.23017749 2
#3 c 1.55870831 4
#4 d 0.07050839 2
#5 e 0.12928774 3
#6 f 1.71506499 4
#7 g 0.46091621 3
#8 h -1.26506123 1
#9 i -0.68685285 1
#10 j -0.44566197 2
#11 k 1.22408180 4
#12 l 0.35981383 3
data:
Note that you don't need to create the "quartile" column in advance and use set.seed to make the randomization reproducible:
set.seed(123)
temp <- data.frame(name=letters[1:12], value=rnorm(12))
The method I use is one of these or Hmisc::cut2(value, g=4):
temp$quartile <- with(temp, cut(value,
breaks=quantile(value, probs=seq(0,1, by=0.25), na.rm=TRUE),
include.lowest=TRUE))
An alternate might be:
temp$quartile <- with(temp, factor(
findInterval( val, c(-Inf,
quantile(val, probs=c(0.25, .5, .75)), Inf) , na.rm=TRUE),
labels=c("Q1","Q2","Q3","Q4")
))
The first one has the side-effect of labeling the quartiles with the values, which I consider a "good thing", but if it were not "good for you", or the valid problems raised in the comments were a concern you could go with version 2. You can use labels= in cut, or you could add this line to your code:
temp$quartile <- factor(temp$quartile, levels=c("1","2","3","4") )
Or even quicker but slightly more obscure in how it works, although it is no longer a factor, but rather a numeric vector:
temp$quartile <- as.numeric(temp$quartile)
I'll add the data.table version for anyone else Googling it (i.e., #BondedDust's solution translated to data.table and pared down a tad):
library(data.table)
setDT(temp)
temp[ , quartile := cut(value,
breaks = quantile(value, probs = 0:4/4),
labels = 1:4, right = FALSE)]
Which is much better (cleaner, faster) than what I had been doing:
temp[ , quartile :=
as.factor(ifelse(value < quantile(value, .25), 1,
ifelse(value < quantile(value, .5), 2,
ifelse(value < quantile(value, .75), 3, 4))]
Note, however, that this approach requires the quantiles to be distinct, e.g. it will fail on rep(0:1, c(100, 1)); what to do in this case is open ended so I leave it up to you.
Adapting dplyr::ntile to take advantage of data.table optimizations provides a faster solution.
library(data.table)
setDT(temp)
temp[order(value) , quartile := floor( 1 + 4 * (.I-1) / .N)]
Probably doesn't qualify as cleaner, but it's faster and one-line.
Timing on bigger data set
Comparing this solution to ntile and cut for data.table as proposed by #docendo_discimus and #MichaelChirico.
library(microbenchmark)
library(dplyr)
set.seed(123)
n <- 1e6
temp <- data.frame(name=sample(letters, size=n, replace=TRUE), value=rnorm(n))
setDT(temp)
microbenchmark(
"ntile" = temp[, quartile_ntile := ntile(value, 4)],
"cut" = temp[, quartile_cut := cut(value,
breaks = quantile(value, probs = seq(0, 1, by=1/4)),
labels = 1:4, right=FALSE)],
"dt_ntile" = temp[order(value), quartile_ntile_dt := floor( 1 + 4 * (.I-1)/.N)]
)
Gives:
Unit: milliseconds
expr min lq mean median uq max neval
ntile 608.1126 647.4994 670.3160 686.5103 691.4846 712.4267 100
cut 369.5391 373.3457 375.0913 374.3107 376.5512 385.8142 100
dt_ntile 117.5736 119.5802 124.5397 120.5043 124.5902 145.7894 100
You can use the quantile() function, but you need to handle rounding/precision when using cut(). So
set.seed(123)
temp <- data.frame(name=letters[1:12], value=rnorm(12), quartile=rep(NA, 12))
brks <- with(temp, quantile(value, probs = c(0, 0.25, 0.5, 0.75, 1)))
temp <- within(temp, quartile <- cut(value, breaks = brks, labels = 1:4,
include.lowest = TRUE))
Giving:
> head(temp)
name value quartile
1 a -0.56047565 1
2 b -0.23017749 2
3 c 1.55870831 4
4 d 0.07050839 2
5 e 0.12928774 3
6 f 1.71506499 4
Sorry for being a bit late to the party. I wanted to add my one liner using cut2 as I didn't know max/min for my data and wanted the groups to be identically large. I read about cut2 in an issue which was marked as duplicate (link below).
library(Hmisc) #For cut2
set.seed(123) #To keep answers below identical to my random run
temp <- data.frame(name=letters[1:12], value=rnorm(12), quartile=rep(NA, 12))
temp$quartile <- as.numeric(cut2(temp$value, g=4)) #as.numeric to number the factors
temp$quartileBounds <- cut2(temp$value, g=4)
temp
Result:
> temp
name value quartile quartileBounds
1 a -0.56047565 1 [-1.265,-0.446)
2 b -0.23017749 2 [-0.446, 0.129)
3 c 1.55870831 4 [ 1.224, 1.715]
4 d 0.07050839 2 [-0.446, 0.129)
5 e 0.12928774 3 [ 0.129, 1.224)
6 f 1.71506499 4 [ 1.224, 1.715]
7 g 0.46091621 3 [ 0.129, 1.224)
8 h -1.26506123 1 [-1.265,-0.446)
9 i -0.68685285 1 [-1.265,-0.446)
10 j -0.44566197 2 [-0.446, 0.129)
11 k 1.22408180 4 [ 1.224, 1.715]
12 l 0.35981383 3 [ 0.129, 1.224)
Similar issue where I read about cut2 in detail
temp$quartile <- ceiling(sapply(temp$value,function(x) sum(x-temp$value>=0))/(length(temp$value)/4))
Try this function
getQuantileGroupNum <- function(vec, group_num, decreasing=FALSE) {
if(decreasing) {
abs(cut(vec, quantile(vec, probs=seq(0, 1, 1 / group_num), type=8, na.rm=TRUE), labels=FALSE, include.lowest=T) - group_num - 1)
} else {
cut(vec, quantile(vec, probs=seq(0, 1, 1 / group_num), type=8, na.rm=TRUE), labels=FALSE, include.lowest=T)
}
}
> t1 <- runif(7)
> t1
[1] 0.4336094 0.2842928 0.5578876 0.2678694 0.6495285 0.3706474 0.5976223
> getQuantileGroupNum(t1, 4)
[1] 2 1 3 1 4 2 4
> getQuantileGroupNum(t1, 4, decreasing=T)
[1] 3 4 2 4 1 3 1
I would like to propose a version, which seems to be more robust, since I ran into a lot of problems using quantile() in the breaks option cut() on my dataset.
I am using the ntile function of plyr, but it also works with ecdf as input.
temp[, `:=`(quartile = .bincode(x = ntile(value, 100), breaks = seq(0,100,25), right = TRUE, include.lowest = TRUE)
decile = .bincode(x = ntile(value, 100), breaks = seq(0,100,10), right = TRUE, include.lowest = TRUE)
)]
temp[, `:=`(quartile = .bincode(x = ecdf(value)(value), breaks = seq(0,1,0.25), right = TRUE, include.lowest = TRUE)
decile = .bincode(x = ecdf(value)(value), breaks = seq(0,1,0.1), right = TRUE, include.lowest = TRUE)
)]
Is that correct?
Take care with ntile() if your original values are clustered at some values. To create equally sized groups, it will allocate rows with the same original value into different groups. This may not be desirable.
I had a case where scores of individuals were clustered at certain values and it was important that individuals with the same original score were placed in the same group (e.g. allocating students to groups based on test score). ntile() allocated individuals with the same score to different groups (unfair in this case), but cut() with quantile() does not (but groups are only approximately equal in size).
library(dplyr)
library(reshape2)
library(ggplot2)
# awkward data: cannot be fairly and equally divided into quartiles or quintiles
# (similar results are obtained from more realistic cases of clustered values)
example <- data.frame(id = 1:49, x = c(rep(1:7, each=7))) %>%
mutate(ntileQuartile = ntile(x, 4),
cutQuartile = cut(x, breaks=quantile(x, seq(0, 1, by=1/4)),
include.lowest=T, label=1:4),
ntileQuintile = ntile(x, 5),
cutQuintile = cut(x, breaks=quantile(x, seq(0, 1, by=1/5)),
include.lowest=T, label=1:5))
# graph: x axis is original score, colour is group allocation
# ntile creates equal groups, but some values of original score are split
# into separate groups. cut creates different sized groups, but score
# exactly determines the group.
melt(example, id.vars=c("id", "x"),
variable.name = "method", value.name="groupNumber") %>%
ggplot(aes(x, fill=groupNumber)) +
geom_histogram(colour="black", bins=13) +
facet_wrap(vars(method))
There is possibly a quicker way, but I would do:
a <- rnorm(100) # Our data
q <- quantile(a) # You can supply your own breaks, see ?quantile
# Define a simple function that checks in which quantile a number falls
getQuant <- function(x)
{
for (i in 1:(length(q)-1))
{
if (x>=q[i] && x<q[i+1])
break;
}
i
}
# Apply the function to the data
res <- unlist(lapply(as.matrix(a), getQuant))

Resources