Count common sets of items between different customers

Count common sets of items between different customers - r

I have data on customers and the different products they have purchased:
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B
I would like to check which sets of products that occur together across different customers. I want to get the count for product combinations of different lengths. For example, the product combination A and B together occurs in three different customers; the product group A, B and C occurs in one customer. And so on for all different sets of 2 or more products in the data. Something like:
Product Group Number
A, B, C 1
D, E, F 1
A, B, D 1
A, B 3
Thus, I'm counting the A, B combination in customers who only have product A and B (e.g. customer 4), and in customers who have A and B, but also any other product (e.g. customer 1, who has A, B and C).
Does anyone have any ideas how to do that with either a tidyverse or base R approach? I feel like it ought to be pretty trivial - maybe pivot_wider first, then count?
I have found this question and answer that can do what I need for pairs of products, but I need to count combinations also for more products than two.

If you have the possibility to use a non-base package, you can use a tool dedicated for the task of finding item sets: arules::apriori. It is much faster on larger data sets.
library(arules)
# coerce data frame to binary incidence matrix
# use apriori to get "frequent itemsets"
r = apriori(data = as.matrix(table(dat) > 0),
# set: type of association mined, minimal support needed of an item set,
# minimal number of items per item set
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
# coerce itemset to data.frame, select relevant rows and columns
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
# items count
# 4 {B,C} 1
# 5 {A,C} 1
# 6 {E,F} 1
# 7 {D,E} 1
# 10 {D,F} 1
# 13 {B,D} 1
# 14 {A,D} 1
# 15 {A,B} 3
# 25 {A,B,C} 1
# 26 {D,E,F} 1
# 35 {A,B,D} 1
Timing on larger data set: 10000 customers with up to 6 products each. apriori is quite a lot faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 38.95475 39.8621 41.44454 40.67313 41.05565 57.64655 20
# f_allan(dat) 4578.20595 4622.2363 4664.57187 4654.58713 4679.78119 4924.22537 20
# f_jay(dat) 2799.10516 2939.9727 2995.90038 2971.24127 2999.82019 3444.70819 20
# f_uwe_dt(dat) 2943.26219 3007.1212 3028.37550 3027.46511 3060.38380 3076.25664 20
# f_uwe_dplyr(dat) 6339.03141 6375.7727 6478.77979 6448.56399 6521.54196 6816.09911 20
10000 customers with up to 10 products each. apriori is several hundred times faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 58.40093 58.95241 59.71129 59.63988 60.43591 61.21082 20
# f_jay(dat) 52824.67760 53369.78899 53760.43652 53555.69881 54049.91600 55605.47980 20
# f_uwe_dt(dat) 22612.87954 22820.12012 22998.85072 22974.32710 23220.00390 23337.22815 20
# f_uwe_dplyr(dat) 26083.20240 26255.88861 26445.49295 26402.67887 26659.81195 27046.83491 20
On the larger data set, Allan's code gave warnings (In rawToBits(as.raw(x)) : out-of-range values treated as 0 in coercion to raw) on the toy data, which seemed to affect the result. Thus, it is not included in the second benchmark.
Data and benchmark code:
set.seed(3)
n_cust = 10000
n_product = sample(2:6, n_cust, replace = TRUE) # 2:10 in second run
dat = data.frame(
Customer = rep(1:n_cust, n_product),
Product = unlist(lapply(n_product, function(n) sample(letters[1:6], n)))) # 1:10 in 2nd run
library(microbenchmark)
res = microbenchmark(f_henrik(dat),
f_allan(dat),
f_jay(dat),
f_uwe_dt(dat),
f_uwe_dplyr(dat),
times = 20L)
Check for equality:
henrik = f_henrik(dat)
allan = f_allan(dat)
jay = f_jay(dat)
uwe_dt = f_uwe_dt(dat)
uwe_dplyr = f_uwe_dplyr(dat)
# change outputs to common format for comparison
# e.g. string format, column names, order
henrik$items = substr(henrik$items, 2, nchar(henrik$items) - 1)
henrik$items = gsub(",", ", ", henrik$items)
l = list(
henrik = henrik, allan = allan, jay = jay, uwe_dt = uwe_dt, uwe_dplyr = uwe_dplyr)
l = lapply(l, function(d){
d = setNames(as.data.frame(d), c("items", "count"))
d = d[order(d$items), ]
row.names(d) = NULL
d
})
all.equal(l[["henrik"]], l[["allan"]])
# TRUE
all.equal(l[["henrik"]], l[["jay"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dt"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dplyr"]])
# TRUE
Functions:
f_henrik = function(dat){
r = apriori(data = as.matrix(table(dat) > 0),
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
}
f_allan = function(dat){
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
}
f_jay = function(dat){
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
}
f_uwe_dt = function(dat){
setorder(setDT(dat), Customer, Product)
dat[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
}
f_uwe_dplyr = function(dat){
dat %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
}

If you define a little helper function that gets all multiple groupings:
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
then you can do this nicely in a tidyverse pipe:
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
#> # A tibble: 11 x 2
#> # Groups: Product_group [11]
#> Product_group n
#> <chr> <int>
#> 1 A, B 3
#> 2 A, B, C 1
#> 3 A, B, D 1
#> 4 A, C 1
#> 5 A, D 1
#> 6 B, C 1
#> 7 B, D 1
#> 8 D, E 1
#> 9 D, E, F 1
#> 10 D, F 1
#> 11 E, F 1

For the sake of completeness, here is a solution in data.table syntax which can be translated to dplyr syntax as well.
For both implementations, the core idea is the same:
sort by Product (which is an important step which has been neglected by the other answers posted so far)
For each Customer, create the product groups by using combn() with varying lengths m. Product.Group is a kind of natural key created by concatenating the included products using the toString() function.
Here, we can see why sorting Product is important : products B, A as well as A, B should appear in the same product group A, B.
Finally, count the number of occurrences by Product.Group
data.table version
library(data.table)
setorder(setDT(df), Customer, Product)
df[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
Product.Group N
1: A, B 3
2: A, C 1
3: B, C 1
4: A, B, C 1
5: D, E 1
6: D, F 1
7: E, F 1
8: D, E, F 1
9: A, D 1
10: B, D 1
11: A, B, D 1
dplyr version
library(dplyr)
df %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
Product.Group n
<chr> <int>
1 A, B 3
2 A, B, C 1
3 A, B, D 1
4 A, C 1
5 A, D 1
6 B, C 1
7 B, D 1
8 D, E 1
9 D, E, F 1
10 D, F 1
11 E, F 1
Data
library(data.table)
df <- fread("
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")

You could split the data along customers, then get all combinations of product-pairs and triples using combn. Then find matches using %in% with outer, create data frame by collapsing products using toString and finally discard elements which are zero.
# a <- aggregate(Product ~ Customer, dat, I)$Product ## old solution
# if (is.matrix(a)) a <- as.data.frame(t(a)) ## old solution
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
# p.group number
# 1 A, B 3
# 2 A, C 1
# 3 A, D 1
# 6 B, C 1
# 7 B, D 1
# 13 D, E 1
# 14 D, F 1
# 15 E, F 1
# 16 A, B, C 1
# 17 A, B, D 1
# 35 D, E, F 1
Data
dat <- read.table(header=TRUE, text="Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")

Related

Defining indices for row sequences more succintly

I have a dataframe like this:
set.seed(123)
df <- data.frame(A = sample(LETTERS[1:5], 50, replace = TRUE),
B = sample(LETTERS[1:5], 50, replace = TRUE))
I want to filter the dataframe on two parameters: (i) the target rows that match a certain criterion and (ii) a certain number of rows that precede the target rows. Specifically, I want to filter rows where A == "A" & B == "A" as well as the five rows preceding the target row. I can do this with a two-step operation: first by defining a function, and second by using the function as input for slice:
Sequ <- function(col1, col2) {
# get row indices of target row with function `which`
inds <- which(col1 == "A" & col2 == "A")
# sort row indices of the rows before target row AND target row itself
sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds)))
}
library(dplyr)
df %>%
slice(Sequ(col1 = A, col2 = B))
A B
1 D C
2 D B
3 C B
4 C D
5 B B
6 A A
7 E B
8 E D
9 D C
10 D D
11 A A
12 C C
13 D E
14 B E
15 B E
16 B A
17 A A
18 C D
19 C B
20 B D
21 A B
22 A A
But surely there must be a more efficient replacement for this part: sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds))). In case I want to filter not just the preceding 5 but, say, 10 or 100 rows this way of defining each index individually becomes quickly impractical. How can this part be coded more economically?

1) Define bothA which takes a matrix and returns TRUE if any row is all A's. Then use rollapply to apply it as a moving window.
library(zoo)
bothA <- function(x) any(rowSums(rbind(x) == "A") == 2)
ok <- rollapply(df, 6, bothA, align = "left", partial = TRUE, by.column = FALSE)
df[ok, ]
2) or in a pipe
df %>%
filter(rollapply(., 6, bothA, align = "left", partial = TRUE, by.column = FALSE))
3) This also works:
ok <- rollapply(rowSums(df == "A") == 2, 6, any, align = "left", partial = TRUE)
df[ok, ]

Here is a dplyr solution that can be directly used in a pipe, with no need for filter.
Sequ <- function(x, col1, col2, value = "A"){
x %>%
mutate(grp = lag(cumsum({{col1}} == value & {{col2}} == value), default = 0)) %>%
group_by(grp) %>%
slice_tail(n = 5) %>%
ungroup() %>%
select(-grp)
}
df %>% Sequ(A, B)
## A tibble: 23 x 2
# A B
# <chr> <chr>
# 1 B D
# 2 C C
# 3 E A
# 4 D B
# 5 A A
# 6 C D
# 7 E E
# 8 C E
# 9 C C
#10 A A
## … with 13 more rows

One dplyr and purrr solution could be:
df %>%
filter(!row_number() %in% unlist(map(which(A == "A" & B == "A"), ~ (.x-5):.x)))

How to relocate several columns in one step using dplyr::relocate?

I would like to reorder some columns to come after a particular other column using dplyr::relocate. Here is a MWE:
a <- letters[1:3]
b <- letters[4:6]
c <- letters[7:9]
d <- letters[10:12]
mytib <- tibble::tibble(a,b,c,d)
# A tibble: 3 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 a d g j
# 2 b e h k
# 3 c f i l
mytib %>%
relocate(c, .after = a)
This example works but is there a way that I could, with one relocate command, move c after a and, say, d after b?
I tried the following without success:
mytib %>%
relocate(c(c, d), .after(c(a, b)))
Edit 1: I explicitly ask about relocate because functions like select do not work for large datasets where all I know is after which column (name) I want to insert a column.
Edit 2: This is my expected output:
# A tibble: 3 x 4
# a c b d
# <chr> <chr> <chr> <chr>
# 1 a g d j
# 2 b h e k
# 3 c i f l

As dplyr::relocate itself apparently doesn't allow relocating in pairs, you can "hack" this behavior by preparing a list of column pairs like the ones you describe ("c after a" & "d after b") and reduce over that list, passing your df in as an .init value and in each reduce-step relocating one pair.
Like this:
library(dplyr)
library(purrr)
df_relocated <- reduce(
.x = list(c('c','a'), c('d','b')),
.f = ~ relocate(.x, .y[1], .after = .y[2]),
.init = mytib
)
This produces a tibble just as you expect it:
> df_relocated
# A tibble: 3 x 4
a c b d
<chr> <chr> <chr> <chr>
1 a g d j
2 b h e k
3 c i f l

In case you want to work with two lists, where element 1 of list 2 should relocated after element 1 of list 1 and so forth, this would be a solution:
reduce2(
.x = c("a", "b"),
.y = c("c", "d"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = mytib
)

Creating Nodes and Edges Dataframes from Tidy Dataframes

I have a data frame that's of this structure:
df <- data.frame(var1 = c(1,1,1,2,2,3,3,3,3),
cat1 = c("A","B","D","B","C","D","E","B","A"))`
> df
var1 cat1
1 1 A
2 1 B
3 1 D
4 2 B
5 2 C
6 3 D
7 3 E
8 3 B
9 3 A
And I am looking to create both nodes and edges data frames from it, so that I can draw a network graph, using VisNetwork. This network will show the number/strength of connections between the different cat1 values, as grouped by the var1 value.
I have the nodes data frame sorted:
nodes <- data.frame(id = unique(df$cat1))
> nodes
id
1 A
2 B
3 D
4 C
5 E
What I'd like help with is how to process df in the following manner:
for each distinct value of var1 in df, tally up the group of nodes that are common to that value of var1 to give an edges dataframe that ultimately looks like the one below. Note that I'm not bothered about the direction of flow along the edges. Just that they are connected is all I need.
> edges
from to value
1 A B 2
2 A D 2
3 A E 1
4 B C 1
5 B D 2
6 B E 1
7 D E 1
With thanks in anticipation,
Nevil
Update: I found here a similar problem, and have adapted that code to give, which is getting close to what I want, but not quite there...
> df %>% group_by(var1) %>%
filter(n()>=2) %>% group_by(var1) %>%
do(data.frame(t(combn(.$cat1, 2,function(x) sort(x))),
stringsAsFactors=FALSE))
# A tibble: 10 x 3
# Groups: var1 [3]
var1 X1 X2
<dbl> <chr> <chr>
1 1. A B
2 1. A D
3 1. B D
4 2. B C
5 3. D E
6 3. B D
7 3. A D
8 3. B E
9 3. A E
10 3. A B

I don't know if there is already a suitable function to achieve this task. Here is a detailed procedure to do it. Whith this, you should be able to define you own function. Hope it helps!
# create an adjacency matrix
mat <- table(df)
mat <- t(mat) %*% mat
as.table(mat) # look at your adjacency matrix
# since the network is not directed, we can consider only the (strictly) upper triangular matrix
mat[lower.tri(mat, diag = TRUE)] <- 0
as.table(mat) # look at the new adjacency matrix
library(dplyr)
edges <- as.data.frame(as.table(mat))
edges <- filter(edges, Freq != 0)
colnames(edges) <- c("from", "to", "value")
edges <- arrange(edges, from)
edges # output
# from to value
#1 A B 2
#2 A D 2
#3 A E 1
#4 B C 1
#5 B D 2
#6 B E 1
#7 D E 1

here's a couple other ways...
in base R...
values <- unique(df$var1[duplicated(df$var1)])
do.call(rbind,
lapply(values, function(i) {
nodes <- as.character(df$cat1[df$var1 == i])
edges <- combn(nodes, 2)
data.frame(from = edges[1, ],
to = edges[2, ],
value = i,
stringsAsFactors = F)
})
)
in tidyverse...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
filter(n() >= 2) %>%
mutate(cat1 = as.character(cat1)) %>%
summarise(edges = list(data.frame(t(combn(cat1, 2)), stringsAsFactors = F))) %>%
unnest(edges) %>%
select(from = X1, to = X2, value = var1)
in tidyverse using tidyr::complete...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
mutate(i.cat1 = cat1) %>%
complete(cat1, i.cat1) %>%
filter(cat1 < i.cat1) %>%
select(from = cat1, to = i.cat1, value = var1)
in tidyverse using tidyr::expand...
library(dplyr)
library(tidyr)
df %>%
group_by(var1) %>%
mutate(cat1 = as.character(cat1)) %>%
expand(cat1, to = cat1) %>%
filter(cat1 < to) %>%
select(from = cat1, to, value = var1)

parent child structure in R dataframe

I have a csv that contains an org structure as follows plus some additional columns. I use R to create charts and it works great !.
The challenge is when trying to create the charts for a subset manager and its children/grandchildren.
Is there any filtering that is possible in dplr or any alternative package?
Sample format:
emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3
The subset I need is for manager "C"
Scenario 1:emp_id==C should contain all nodes of 'D','D1','D2','E','E1','F'
expected structure:
manager,all_children
C D
C D1
C D2
C E
C E1
C F
Scenario 2:emp_id==C should contain all above nodes but retain mgr_id structure for 'D','E'
expected structure:
manager,all_children
C D
C E
C F
D D1
D D2
E E1

Consider the base package with by which creates a df list for every level of mgr_id (not just C):
SCENARIO 1
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("mgr_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub("*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 C D1
# 5 C D2
# 6 C E1
SCENARIO 2 (where the selected columns change in grandchild and then first column rename)
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("emp_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub(".*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
names(grandchild)[1] <- "mgr_id"
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 D D1
# 5 D D2
# 6 E E1

Here is one solution using functions from dplyr and data.table. dt3 is the output for scenario 1, while dt4 is the output for scenario 2.
# Load packages
library(dplyr)
library(data.table)
# Create example data frame
dt <- read.table(text = "emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
# Filter levels lower than 1
filter(nest_id > 1) %>%
mutate(group_id = ifelse(nest_id > 2, 0, 1)) %>%
# Create "run_id", which will be used to fill manager label
mutate(run_id = rleid(group_id)) %>%
mutate(run_id = ifelse(run_id %% 2 == 0, run_id - 1, run_id)) %>%
group_by(run_id) %>%
mutate(manager = first(emp_id)) %>%
# Select for manager C
filter(manager %in% "C") %>%
ungroup() %>%
# Remove rows if manager == emp_id
filter(manager != emp_id) %>%
rename(all_children = emp_id)
# Scenario 1
dt3 <- dt2 %>% select(manager, all_children)
# Scenario 2
dt4 <- dt2 %>%
select(manager = mgr_id, all_children) %>%
arrange(manager, all_children)

How can I replace a factor levels with the top n levels (by number of occurances)

This question is related to How can I replace a factor levels with the top n levels (by some metric), plus [other]?. As a metric I want to use the number of occurrences of the factor. I know I can do it by making a list of the occurrences, but I was wondering if there is a prettier way.
Example:
library(data.table);
library(plyr);
fac <- data.table(score = as.factor(c(3,4,5,3,3,3,5)));
ocCnt <- data.table(lapply(fac,count)$score);
fac$occurrence <- 0;
for(i in 1:length(fac$score)){fac$occurrence[i]<-ocCnt[x==fac$score[i]]$freq};
Then I could use the function described in the referenced question/answer:
hotfactor= function(fac,by,n=10,o="other") {
levels(fac)[rank(-xtabs(by~fac))[levels(fac)]>n] <- o
fac
}
To continue the example, if we want only to see the most popular factor we do:
hotfactor(fac$score,fac$occurrence,1);
To get the answer:
[1] 3 other other 3 3 3 other
Levels: 3 other
So my question is, can I do this without having to add a list which counts the occurrences?
Note that I want to do this for the n most popular factors (not just for the most popular factor).

Use table and which.max:
score <- factor(c(3,4,5,3,3,3,5))
levels(score)[- which.max(table(score))] <- "other"
#[1] 3 other other 3 3 3 other
#Levels: 3 other
Obviously this breaks ties by taking the first maximum value.
If you want to keep the top two levels:
score <- factor(c(3, 4,5,3,3,3,5), levels =c(4,3,5))
levels(score)[!levels(score) %in% names(sort(table(score), decreasing = TRUE)[1:2])] <- "other"
#[1] 3 other 5 3 3 3 5
#Levels: other 3 5

If you don't know how many levels you need to group say, 90% of your data and are willing to use dplyr, you could do something along the following lines:
library(dplyr)
df <- data.frame(
f = factor(mapply(rep, letters[1:5], 2^(1:5)) %>% unlist(use.names = F))
)
df %>%
count(f, sort = T) %>%
mutate(p = cumsum(n) / nrow(df))
# A tibble: 5 x 3
# f n p
# <fctr> <int> <dbl>
# 1 e 32 0.5161290
# 2 d 16 0.7741935
# 3 c 8 0.9032258
# 4 b 4 0.9677419
# 5 a 2 1.0000000
(top <- df %>%
count(f, sort = T) %>%
mutate(p = cumsum(n) / nrow(df)) %>%
filter(cumall(p < .91)) %>%
select(f) %>%
unlist(use.names = F))
# [1] e d c
# Levels: a b c d e
levels(df$f) <- factor(c(levels(df$f), 'z'))
df$f[!df$f %in% top] <- 'z'
df %>%
count(f, sort = T) %>%
mutate(p = cumsum(n) / nrow(df))
# A tibble: 4 x 3
# f n p
# <fctr> <int> <dbl>
# 1 e 32 0.5161290
# 2 d 16 0.7741935
# 3 c 8 0.9032258
# 4 z 6 1.0000000

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count common sets of items between different customers - r

Related

Defining indices for row sequences more succintly

How to relocate several columns in one step using dplyr::relocate?

Creating Nodes and Edges Dataframes from Tidy Dataframes

parent child structure in R dataframe

How can I replace a factor levels with the top n levels (by number of occurances)

Categories

Resources