Merging two data.tables that don't have common columns - r

I want to merge two data.tables that don't have a common column, so I would end up with N1*N2 rows, where N1 and N2 are the number of rows in each dataframe.
Doing this with base R works:
A <- data.frame(id = 1:6, value = 19:24)
B <- data.frame(value2 = c(25, 25, 26, 26), value3 = 4:5)
A
#> id value
#> 1 1 19
#> 2 2 20
#> 3 3 21
#> 4 4 22
#> 5 5 23
#> 6 6 24
B
#> value2 value3
#> 1 25 4
#> 2 25 5
#> 3 26 4
#> 4 26 5
merge(A, B, all = TRUE)
#> id value value2 value3
#> 1 1 19 25 4
#> 2 2 20 25 4
#> 3 3 21 25 4
#> 4 4 22 25 4
#> 5 5 23 25 4
#> 6 6 24 25 4
#> 7 1 19 25 5
#> 8 2 20 25 5
#> 9 3 21 25 5
#> 10 4 22 25 5
#> 11 5 23 25 5
#> 12 6 24 25 5
#> 13 1 19 26 4
#> 14 2 20 26 4
#> 15 3 21 26 4
#> 16 4 22 26 4
#> 17 5 23 26 4
#> 18 6 24 26 4
#> 19 1 19 26 5
#> 20 2 20 26 5
#> 21 3 21 26 5
#> 22 4 22 26 5
#> 23 5 23 26 5
#> 24 6 24 26 5
But if I now have two data.tables and not dataframes anymore, it errors:
library(data.table)
A <- data.table(id = 1:6, value = 19:24)
B <- data.table(value2 = c(25, 25, 26, 26), value3 = 4:5)
merge(A, B, all = TRUE)
#> Error in merge.data.table(A, B, all = TRUE): A non-empty vector of column names for `by` is required.
How can I reproduce the base R behavior with data.table (without necessarily using merge())?

You are looking for a cross-join. In data.table, there is a CJ function but it only works with one data set, otherwise you can do:
res <- setkey(A[, c(k=1, .SD)], k)[B[, c(k=1, .SD)], allow.cartesian = TRUE][, k := NULL]
res
id value value2 value3
1: 1 19 25 4
2: 2 20 25 4
3: 3 21 25 4
4: 4 22 25 4
5: 5 23 25 4
6: 6 24 25 4
7: 1 19 25 5
8: 2 20 25 5
9: 3 21 25 5
10: 4 22 25 5
11: 5 23 25 5
12: 6 24 25 5
13: 1 19 26 4
14: 2 20 26 4
15: 3 21 26 4
16: 4 22 26 4
17: 5 23 26 4
18: 6 24 26 4
19: 1 19 26 5
20: 2 20 26 5
21: 3 21 26 5
22: 4 22 26 5
23: 5 23 26 5
24: 6 24 26 5
id value value2 value3
Note the alternative dplyr solution:
dplyr::cross_join(A, B)

An alternative coming from this GitHub issue on the data.table repo:
library(data.table)
A <- data.table(id = 1:6, value = 19:24)
B <- data.table(value2 = c(25, 25, 26, 26), value3 = 4:5)
CJDT <- function(...) {
Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
}
CJDT(A, B)
#> id value value2 value3
#> 1: 1 19 25 4
#> 2: 2 20 25 4
#> 3: 3 21 25 4
#> 4: 4 22 25 4
#> 5: 5 23 25 4
#> 6: 6 24 25 4
#> 7: 1 19 25 5
#> 8: 2 20 25 5
#> 9: 3 21 25 5
#> 10: 4 22 25 5
#> 11: 5 23 25 5
#> 12: 6 24 25 5
#> 13: 1 19 26 4
#> 14: 2 20 26 4
#> 15: 3 21 26 4
#> 16: 4 22 26 4
#> 17: 5 23 26 4
#> 18: 6 24 26 4
#> 19: 1 19 26 5
#> 20: 2 20 26 5
#> 21: 3 21 26 5
#> 22: 4 22 26 5
#> 23: 5 23 26 5
#> 24: 6 24 26 5
#> id value value2 value3
Created on 2023-02-06 with reprex v2.0.2

A[, as.list(B), names(A)]
results
id value value2 value3
1: 1 19 25 4
2: 1 19 25 5
3: 1 19 26 4
4: 1 19 26 5
5: 2 20 25 4
6: 2 20 25 5
7: 2 20 26 4
8: 2 20 26 5
9: 3 21 25 4
10: 3 21 25 5
11: 3 21 26 4
12: 3 21 26 5
13: 4 22 25 4
14: 4 22 25 5
15: 4 22 26 4
16: 4 22 26 5
17: 5 23 25 4
18: 5 23 25 5
19: 5 23 26 4
20: 5 23 26 5
21: 6 24 25 4
22: 6 24 25 5
23: 6 24 26 4
24: 6 24 26 5
data
A <- data.table(id = 1:6, value = 19:24)
B <- data.table(value2 = c(25, 25, 26, 26), value3 = 4:5)

Related

How to number each element in column conditionally on elements in other column in a dataset

I have a large dataset with thousands of measurements. What I want is to assign a visit number to each measurement so that all three consecutive measurements fall under the same visit number. After three consecutive measurements, the visit number increases. So the first three measurements are visit 1, the fourth to sixth measurements are visit 2, and so on. When there are only two or less measurements left, I want to mark the visit as missing.
Example dataset
DF <- data.frame(ID = rep("ID01", 10),
M = 1:10)
What I want:
DF$V <- c(rep(1:3, each = 3), NA)
Is there a way to this automatically?
Thanks for any help.
Update: What if each measurement contains numerous other measurements? So that:
DF <- data.frame(ID = rep("ID01", 50),
M0 = sample(50),
M = rep(1:10, each = 5))
What I want:
DF$V <- c(rep(rep(1:3, each = 3), each = 5), rep(NA, 5))
Even when the length of each level of DF$M changes (and thus is not fixed at n <- 15). E.g. length(DF$M == 1) = 21, length(DF$M == 2) = 26 etc.
Again, thanks for any help.
A possible solution, which, thanks to a comment of #DarrenTsai, is now more concise (thanks, #DarrenTsai!):
library(dplyr)
n <- 15
DF %>%
group_by(ID) %>%
mutate(V = rep(1:(n() %/% n), each = n)[1:n()]) %>%
ungroup
#> ID M0 M V
#> 1 ID01 20 1 1
#> 2 ID01 13 1 1
#> 3 ID01 41 1 1
#> 4 ID01 21 1 1
#> 5 ID01 45 1 1
#> 6 ID01 10 2 1
#> 7 ID01 17 2 1
#> 8 ID01 43 2 1
#> 9 ID01 5 2 1
#> 10 ID01 4 2 1
#> 11 ID01 37 3 1
#> 12 ID01 22 3 1
#> 13 ID01 14 3 1
#> 14 ID01 23 3 1
#> 15 ID01 39 3 1
#> 16 ID01 33 4 2
#> 17 ID01 42 4 2
#> 18 ID01 26 4 2
#> 19 ID01 31 4 2
#> 20 ID01 1 4 2
#> 21 ID01 48 5 2
#> 22 ID01 49 5 2
#> 23 ID01 18 5 2
#> 24 ID01 29 5 2
#> 25 ID01 2 5 2
#> 26 ID01 15 6 2
#> 27 ID01 8 6 2
#> 28 ID01 32 6 2
#> 29 ID01 7 6 2
#> 30 ID01 27 6 2
#> 31 ID01 11 7 3
#> 32 ID01 9 7 3
#> 33 ID01 36 7 3
#> 34 ID01 50 7 3
#> 35 ID01 34 7 3
#> 36 ID01 40 8 3
#> 37 ID01 24 8 3
#> 38 ID01 16 8 3
#> 39 ID01 46 8 3
#> 40 ID01 3 8 3
#> 41 ID01 47 9 3
#> 42 ID01 19 9 3
#> 43 ID01 28 9 3
#> 44 ID01 6 9 3
#> 45 ID01 38 9 3
#> 46 ID01 35 10 NA
#> 47 ID01 25 10 NA
#> 48 ID01 44 10 NA
#> 49 ID01 12 10 NA
#> 50 ID01 30 10 NA
UPDATED
The following solution is to work in the case of variable lengths of the levels of DF$M. This solution is based on the following ideas:
Calculate the maximum of rows across all groups of M.
For each group of M, append rows to match the maximum of rows referred to above.
Use the previous solution (whose code is above) to accomplish the OP goal.
library(dplyr)
DF <- DF %>%
slice(-30) # removes row 30, to force variable lengths in df$M
DF %>%
mutate(idaux = row_number()) %>%
add_count(M, name = "aux") %T>%
{m <<- max(.$aux)} %>%
group_by(M) %>%
slice(c(1:n(), rep(n(), m - n()))) %>%
ungroup %>%
group_by(ID) %>%
mutate(V = rep(1:(n() %/% (3*m)), each = 3*m)[1:n()]) %>%
ungroup %>%
distinct %>%
select(ID, M0, M, V) %>%
as.data.frame()
#> ID M0 M V
#> 1 ID01 18 1 1
#> 2 ID01 22 1 1
#> 3 ID01 3 1 1
#> 4 ID01 17 1 1
#> 5 ID01 40 1 1
#> 6 ID01 20 2 1
#> 7 ID01 48 2 1
#> 8 ID01 39 2 1
#> 9 ID01 25 2 1
#> 10 ID01 49 2 1
#> 11 ID01 42 3 1
#> 12 ID01 36 3 1
#> 13 ID01 11 3 1
#> 14 ID01 5 3 1
#> 15 ID01 37 3 1
#> 16 ID01 30 4 2
#> 17 ID01 45 4 2
#> 18 ID01 1 4 2
#> 19 ID01 50 4 2
#> 20 ID01 46 4 2
#> 21 ID01 15 5 2
#> 22 ID01 16 5 2
#> 23 ID01 47 5 2
#> 24 ID01 14 5 2
#> 25 ID01 27 5 2
#> 26 ID01 8 6 2
#> 27 ID01 34 6 2
#> 28 ID01 9 6 2
#> 29 ID01 7 6 2
#> 30 ID01 43 7 3
#> 31 ID01 24 7 3
#> 32 ID01 29 7 3
#> 33 ID01 13 7 3
#> 34 ID01 23 7 3
#> 35 ID01 26 8 3
#> 36 ID01 2 8 3
#> 37 ID01 21 8 3
#> 38 ID01 38 8 3
#> 39 ID01 28 8 3
#> 40 ID01 6 9 3
#> 41 ID01 44 9 3
#> 42 ID01 19 9 3
#> 43 ID01 32 9 3
#> 44 ID01 4 9 3
#> 45 ID01 12 10 NA
#> 46 ID01 10 10 NA
#> 47 ID01 35 10 NA
#> 48 ID01 33 10 NA
#> 49 ID01 41 10 NA
A {data.table} solution:
library(data.table)
setDT(DF)
n <- 15
DF[
,
V := c(rep(x = seq_len(.N %/% n), each = n), rep(NA, times = .N %% n)),
by = "ID"
]
DF
#> ID M0 M V
#> 1: ID01 50 1 1
#> 2: ID01 30 1 1
#> 3: ID01 34 1 1
#> 4: ID01 2 1 1
#> 5: ID01 39 1 1
#> 6: ID01 15 2 1
#> 7: ID01 41 2 1
#> 8: ID01 24 2 1
#> 9: ID01 47 2 1
#> 10: ID01 49 2 1
#> 11: ID01 8 3 1
#> 12: ID01 42 3 1
#> 13: ID01 46 3 1
#> 14: ID01 28 3 1
#> 15: ID01 1 3 1
#> 16: ID01 4 4 2
#> 17: ID01 45 4 2
#> 18: ID01 43 4 2
#> 19: ID01 37 4 2
#> 20: ID01 26 4 2
#> 21: ID01 13 5 2
#> 22: ID01 20 5 2
#> 23: ID01 27 5 2
#> 24: ID01 22 5 2
#> 25: ID01 38 5 2
#> 26: ID01 10 6 2
#> 27: ID01 12 6 2
#> 28: ID01 48 6 2
#> 29: ID01 35 6 2
#> 30: ID01 44 6 2
#> 31: ID01 31 7 3
#> 32: ID01 14 7 3
#> 33: ID01 40 7 3
#> 34: ID01 23 7 3
#> 35: ID01 19 7 3
#> 36: ID01 3 8 3
#> 37: ID01 21 8 3
#> 38: ID01 5 8 3
#> 39: ID01 9 8 3
#> 40: ID01 7 8 3
#> 41: ID01 25 9 3
#> 42: ID01 36 9 3
#> 43: ID01 33 9 3
#> 44: ID01 29 9 3
#> 45: ID01 17 9 3
#> 46: ID01 11 10 NA
#> 47: ID01 16 10 NA
#> 48: ID01 6 10 NA
#> 49: ID01 32 10 NA
#> 50: ID01 18 10 NA
#> ID M0 M V
Created on 2022-07-20 by the reprex package (v2.0.1)
we can use rle function
r <- rle(DF$M)$lengths
l <- unlist(Map(\(x,y) rep(x , each = 3*y) , 1:3 , r))[1:(nrow(DF)-r[length(r)])]
repna <- rep(NA , nrow(DF) - length(l))
DF$v <- c(l , repna)
output
ID M0 M v
1 ID01 3 1 1
2 ID01 47 1 1
3 ID01 46 1 1
4 ID01 11 1 1
5 ID01 37 1 1
6 ID01 18 2 1
7 ID01 29 2 1
8 ID01 16 2 1
9 ID01 32 2 1
10 ID01 2 2 1
11 ID01 21 3 1
12 ID01 31 3 1
13 ID01 19 3 1
14 ID01 17 3 1
15 ID01 41 3 1
16 ID01 30 4 2
17 ID01 22 4 2
18 ID01 5 4 2
19 ID01 44 4 2
20 ID01 43 4 2
21 ID01 27 5 2
22 ID01 23 5 2
23 ID01 33 5 2
24 ID01 26 5 2
25 ID01 38 5 2
26 ID01 20 6 2
27 ID01 39 6 2
28 ID01 50 6 2
29 ID01 40 6 2
30 ID01 28 6 2
31 ID01 35 7 3
32 ID01 6 7 3
33 ID01 24 7 3
34 ID01 14 7 3
35 ID01 42 7 3
36 ID01 48 8 3
37 ID01 9 8 3
38 ID01 49 8 3
39 ID01 7 8 3
40 ID01 1 8 3
41 ID01 4 9 3
42 ID01 13 9 3
43 ID01 10 9 3
44 ID01 34 9 3
45 ID01 45 9 3
46 ID01 36 10 NA
47 ID01 12 10 NA
48 ID01 8 10 NA
49 ID01 25 10 NA
50 ID01 15 10 NA

tidyverse: binding list elements of same dimension

Using reduce(bind_cols), the list elements of same dimension may be combined. However, I would like to know how to combine only same dimension (may be specified dimesion in some way) elements from a list which may have elements of different dimension.
library(tidyverse)
df1 <- data.frame(A1 = 1:10, A2 = 10:1)
df2 <- data.frame(B = 11:30)
df3 <- data.frame(C = 31:40)
ls1 <- list(df1, df3)
ls1
[[1]]
A1 A2
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
[[2]]
C
1 31
2 32
3 33
4 34
5 35
6 36
7 37
8 38
9 39
10 40
ls1 %>%
reduce(bind_cols)
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
ls2 <- list(df1, df2, df3)
ls2
[[1]]
A1 A2
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
[[2]]
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
[[3]]
C
1 31
2 32
3 33
4 34
5 35
6 36
7 37
8 38
9 39
10 40
ls2 %>%
reduce(bind_cols)
Error: Can't recycle `..1` (size 10) to match `..2` (size 20).
Run `rlang::last_error()` to see where the error occurred.
Question
Looking for a function to combine all data.frames in a list with an argument of number of rows.
One option could be:
map(split(lst, map_int(lst, NROW)), bind_cols)
$`10`
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
$`20`
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
You can use -
n <- 1:max(sapply(ls2, nrow))
res <- do.call(cbind, lapply(ls2, `[`, n, ,drop = FALSE))
res
# A1 A2 B C
#1 1 10 11 31
#2 2 9 12 32
#3 3 8 13 33
#4 4 7 14 34
#5 5 6 15 35
#6 6 5 16 36
#7 7 4 17 37
#8 8 3 18 38
#9 9 2 19 39
#10 10 1 20 40
#NA NA NA 21 NA
#NA.1 NA NA 22 NA
#NA.2 NA NA 23 NA
#NA.3 NA NA 24 NA
#NA.4 NA NA 25 NA
#NA.5 NA NA 26 NA
#NA.6 NA NA 27 NA
#NA.7 NA NA 28 NA
#NA.8 NA NA 29 NA
#NA.9 NA NA 30 NA
A little-bit shorter with purrr::map_dfc
purrr::map_dfc(ls2, `[`, n, , drop = FALSE)
We can use cbind.fill from rowr
library(rowr)
do.call(cbind.fill, c(ls2, fill = NA))
A base R option using tapply + sapply
tapply(
ls2,
sapply(ls2, nrow),
function(x) do.call(cbind, x)
)
gives
$`10`
A1 A2 C
1 1 10 31
2 2 9 32
3 3 8 33
4 4 7 34
5 5 6 35
6 6 5 36
7 7 4 37
8 8 3 38
9 9 2 39
10 10 1 40
$`20`
B
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
12 22
13 23
14 24
15 25
16 26
17 27
18 28
19 29
20 30
You may also use if inside reduce if you want to combine similar elements of list (case: when first item in list has priority)
df1 <- data.frame(A1 = 1:10, A2 = 10:1)
df2 <- data.frame(B = 11:30)
df3 <- data.frame(C = 31:40)
ls1 <- list(df1, df3)
ls2 <- list(df1, df2, df3)
library(tidyverse)
reduce(ls2, ~if(nrow(.x) == nrow(.y)){bind_cols(.x, .y)} else {.x})
#> A1 A2 C
#> 1 1 10 31
#> 2 2 9 32
#> 3 3 8 33
#> 4 4 7 34
#> 5 5 6 35
#> 6 6 5 36
#> 7 7 4 37
#> 8 8 3 38
#> 9 9 2 39
#> 10 10 1 40
Created on 2021-06-09 by the reprex package (v2.0.0)
Here's another tidyverse option.
We're creating a dummy ID in each data.frame based on the row_number(), then joining all data.frames by the dummy ID, and then dropping the dummy ID.
ls2 %>%
map(., ~mutate(.x, id = row_number())) %>%
reduce(full_join, by = "id") %>%
select(-id)
This gives us:
A1 A2 B C
1 1 10 11 31
2 2 9 12 32
3 3 8 13 33
4 4 7 14 34
5 5 6 15 35
6 6 5 16 36
7 7 4 17 37
8 8 3 18 38
9 9 2 19 39
10 10 1 20 40
11 NA NA 21 NA
12 NA NA 22 NA
13 NA NA 23 NA
14 NA NA 24 NA
15 NA NA 25 NA
16 NA NA 26 NA
17 NA NA 27 NA
18 NA NA 28 NA
19 NA NA 29 NA
20 NA NA 30 NA
We can also use Reduce function from base R:
lst <- list(df1, df2, df3)
# First we create id number for each underlying data set
lst |>
lapply(\(x) {x$id <- 1:nrow(x);
x
}
) -> ls2
Reduce(function(x, y) if(nrow(x) == nrow(y)){
merge(x, y, by = "id")
} else {
x
}, ls2)
id A1 A2 C
1 1 1 10 31
2 2 2 9 32
3 3 3 8 33
4 4 4 7 34
5 5 5 6 35
6 6 6 5 36
7 7 7 4 37
8 8 8 3 38
9 9 9 2 39
10 10 10 1 40

Is there any method to sort the matrix by both column and row in R?

could you guys help me?
I have a matrix like this. the first column and row are the IDs.
I need to sort it by column and row ID like this.
Thanks!
Two thoughts:
mat <- matrix(1:25, nr=5, dimnames=list(c('4',3,5,2,1), c('4',3,5,2,1)))
mat
# 4 3 5 2 1
# 4 1 6 11 16 21
# 3 2 7 12 17 22
# 5 3 8 13 18 23
# 2 4 9 14 19 24
# 1 5 10 15 20 25
If you want a strictly alphabetic ordering, then this will work:
mat[order(rownames(mat)),order(colnames(mat))]
# 1 2 3 4 5
# 1 25 20 10 5 15
# 2 24 19 9 4 14
# 3 22 17 7 2 12
# 4 21 16 6 1 11
# 5 23 18 8 3 13
This will not work well if the names are intended to be ordered numerically:
mat <- matrix(1:30, nr=3, dimnames=list(c('2',1,3), c('4',3,5,2,1,6,7,8,9,10)))
mat
# 4 3 5 2 1 6 7 8 9 10
# 2 1 4 7 10 13 16 19 22 25 28
# 1 2 5 8 11 14 17 20 23 26 29
# 3 3 6 9 12 15 18 21 24 27 30
mat[order(rownames(mat)),order(colnames(mat))]
# 1 10 2 3 4 5 6 7 8 9
# 1 14 29 11 5 2 8 17 20 23 26
# 2 13 28 10 4 1 7 16 19 22 25
# 3 15 30 12 6 3 9 18 21 24 27
(1, 10, 2, ...) For that, you need a slight modification:
mat[order(as.numeric(rownames(mat))),order(as.numeric(colnames(mat)))]
# 1 2 3 4 5 6 7 8 9 10
# 1 14 11 5 2 8 17 20 23 26 29
# 2 13 10 4 1 7 16 19 22 25 28
# 3 15 12 6 3 9 18 21 24 27 30

r - use dplyr::group_by in combination with purrr::pmap

I have the following dataframe:
df <- data.frame(a = c(1:20),
b = c(2:21),
c = as.factor(c(rep(1,5), rep(2,10), rep(3,5))))
and I want to do the following:
df1 <- df %>% group_by(c) %>% mutate(a = lead(b))
but originally I have many variables to which I need to apply the lead() function in combination with group_by() on multiple variables. I'm trying the purrr::pmap() to achieve this:
df2 <- pmap(list(df[,1],df[,2],df[,3]), function(x,y,z) group_by(z) %>% lead(y))
Unfortunately this results in error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('integer', 'numeric')"
You can do this with mutate_at and named arguments to funs(), which creates new columns instead of overwriting them. Note that this does nothing to a but you can rename the columns after this as desired.
df <- data.frame(
a = c(1:20),
b = c(2:21),
b2 = 3:22,
b3 = 4:23,
c = as.factor(c(rep(1, 5), rep(2, 10), rep(3, 5)))
)
library(tidyverse)
df %>%
group_by(c) %>%
mutate_at(vars(starts_with("b")), funs(lead = lead(.)))
#> # A tibble: 20 x 8
#> # Groups: c [3]
#> a b b2 b3 c b_lead b2_lead b3_lead
#> <int> <int> <int> <int> <fct> <int> <int> <int>
#> 1 1 2 3 4 1 3 4 5
#> 2 2 3 4 5 1 4 5 6
#> 3 3 4 5 6 1 5 6 7
#> 4 4 5 6 7 1 6 7 8
#> 5 5 6 7 8 1 NA NA NA
#> 6 6 7 8 9 2 8 9 10
#> 7 7 8 9 10 2 9 10 11
#> 8 8 9 10 11 2 10 11 12
#> 9 9 10 11 12 2 11 12 13
#> 10 10 11 12 13 2 12 13 14
#> 11 11 12 13 14 2 13 14 15
#> 12 12 13 14 15 2 14 15 16
#> 13 13 14 15 16 2 15 16 17
#> 14 14 15 16 17 2 16 17 18
#> 15 15 16 17 18 2 NA NA NA
#> 16 16 17 18 19 3 18 19 20
#> 17 17 18 19 20 3 19 20 21
#> 18 18 19 20 21 3 20 21 22
#> 19 19 20 21 22 3 21 22 23
#> 20 20 21 22 23 3 NA NA NA
Created on 2018-09-07 by the reprex package (v0.2.0).

Segregate data into 4 equal percentage chunks

I need to segregate the data into 4 equal chunks based on percentage in descending order based on Qty_ordered . I tried using 'bins.quantiles'function (from the binr package) in R but not working. Any other methods which can be used?
Input
SL.No Item Qty_Ordered
1 VT25 2
2 VT58 4
3 VT40 10
4 VT58 2
5 VT 69 12
6 VT 67 6
7 VT45 21
8 VT 25 16
9 VT 40 24
10 VT98 10
11 VT78 18
12 VT40 6
13 VT 25 26
14 VT85 6
15 VT78 10
16 VT25 4
17 VT40 15
18 VT69 24
Output
SL.No Item Qty Ordered Class
1 VT25 2 1
4 VT58 2 1
2 VT58 4 1
16 VT25 4 1
6 VT 67 6 2
12 VT40 6 2
14 VT85 6 2
3 VT40 10 2
10 VT98 10 2
15 VT78 10 3
5 VT 69 12 3
17 VT40 15 3
8 VT 25 16 3
11 VT78 18 3
7 VT45 21 4
9 VT 40 24 4
18 VT69 24 4
13 VT 25 26 4
Maybe this?
library(data.table)
test <- fread(input = "SL.No Item Qty_Ordered
1 VT25 2
2 VT58 4
3 VT40 10
4 VT58 2
5 VT69 12
6 VT67 6
7 VT45 21
8 VT25 16
9 VT40 24
10 VT98 10
11 VT78 18
12 VT40 6
13 VT25 26
14 VT85 6
15 VT78 10
16 VT25 4
17 VT40 15
18 VT69 24", header = T)
setorder(test, Qty_Ordered)
test[, Class := .I %/% ((.N+1)/4) + 1]
test
# SL.No Item Qty_Ordered Class
# 1: 1 VT25 2 1
# 2: 4 VT58 2 1
# 3: 2 VT58 4 1
# 4: 16 VT25 4 1
# 5: 6 VT67 6 2
# 6: 12 VT40 6 2
# 7: 14 VT85 6 2
# 8: 3 VT40 10 2
# 9: 10 VT98 10 2
# 10: 15 VT78 10 3
# 11: 5 VT69 12 3
# 12: 17 VT40 15 3
# 13: 8 VT25 16 3
# 14: 11 VT78 18 3
# 15: 7 VT45 21 4
# 16: 9 VT40 24 4
# 17: 18 VT69 24 4
# 18: 13 VT25 26 4
Here's a way using the tidyverse
library(tidyverse)
df <- read.table(text = "SL.No Item Qty_Ordered
1 VT25 2
2 VT58 4
3 VT40 10
4 VT58 2
5 VT69 12
6 VT67 6
7 VT45 21
8 VT25 16
9 VT40 24
10 VT98 10
11 VT78 18
12 VT40 6
13 VT25 26
14 VT85 6
15 VT78 10
16 VT25 4
17 VT40 15
18 VT69 24",header = T)
df %>%
mutate(Class = findInterval(x = Qty_Ordered, vec = quantile(Qty_Ordered),rightmost.closed = T)) %>%
arrange(Class)

Resources