How to relocate several columns in one step using dplyr::relocate? - r

I would like to reorder some columns to come after a particular other column using dplyr::relocate. Here is a MWE:
a <- letters[1:3]
b <- letters[4:6]
c <- letters[7:9]
d <- letters[10:12]
mytib <- tibble::tibble(a,b,c,d)
# A tibble: 3 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 a d g j
# 2 b e h k
# 3 c f i l
mytib %>%
relocate(c, .after = a)
This example works but is there a way that I could, with one relocate command, move c after a and, say, d after b?
I tried the following without success:
mytib %>%
relocate(c(c, d), .after(c(a, b)))
Edit 1: I explicitly ask about relocate because functions like select do not work for large datasets where all I know is after which column (name) I want to insert a column.
Edit 2: This is my expected output:
# A tibble: 3 x 4
# a c b d
# <chr> <chr> <chr> <chr>
# 1 a g d j
# 2 b h e k
# 3 c i f l

As dplyr::relocate itself apparently doesn't allow relocating in pairs, you can "hack" this behavior by preparing a list of column pairs like the ones you describe ("c after a" & "d after b") and reduce over that list, passing your df in as an .init value and in each reduce-step relocating one pair.
Like this:
library(dplyr)
library(purrr)
df_relocated <- reduce(
.x = list(c('c','a'), c('d','b')),
.f = ~ relocate(.x, .y[1], .after = .y[2]),
.init = mytib
)
This produces a tibble just as you expect it:
> df_relocated
# A tibble: 3 x 4
a c b d
<chr> <chr> <chr> <chr>
1 a g d j
2 b h e k
3 c i f l

In case you want to work with two lists, where element 1 of list 2 should relocated after element 1 of list 1 and so forth, this would be a solution:
reduce2(
.x = c("a", "b"),
.y = c("c", "d"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = mytib
)

Related

Count common sets of items between different customers

I have data on customers and the different products they have purchased:
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B
I would like to check which sets of products that occur together across different customers. I want to get the count for product combinations of different lengths. For example, the product combination A and B together occurs in three different customers; the product group A, B and C occurs in one customer. And so on for all different sets of 2 or more products in the data. Something like:
Product Group Number
A, B, C 1
D, E, F 1
A, B, D 1
A, B 3
Thus, I'm counting the A, B combination in customers who only have product A and B (e.g. customer 4), and in customers who have A and B, but also any other product (e.g. customer 1, who has A, B and C).
Does anyone have any ideas how to do that with either a tidyverse or base R approach? I feel like it ought to be pretty trivial - maybe pivot_wider first, then count?
I have found this question and answer that can do what I need for pairs of products, but I need to count combinations also for more products than two.
If you have the possibility to use a non-base package, you can use a tool dedicated for the task of finding item sets: arules::apriori. It is much faster on larger data sets.
library(arules)
# coerce data frame to binary incidence matrix
# use apriori to get "frequent itemsets"
r = apriori(data = as.matrix(table(dat) > 0),
# set: type of association mined, minimal support needed of an item set,
# minimal number of items per item set
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
# coerce itemset to data.frame, select relevant rows and columns
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
# items count
# 4 {B,C} 1
# 5 {A,C} 1
# 6 {E,F} 1
# 7 {D,E} 1
# 10 {D,F} 1
# 13 {B,D} 1
# 14 {A,D} 1
# 15 {A,B} 3
# 25 {A,B,C} 1
# 26 {D,E,F} 1
# 35 {A,B,D} 1
Timing on larger data set: 10000 customers with up to 6 products each. apriori is quite a lot faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 38.95475 39.8621 41.44454 40.67313 41.05565 57.64655 20
# f_allan(dat) 4578.20595 4622.2363 4664.57187 4654.58713 4679.78119 4924.22537 20
# f_jay(dat) 2799.10516 2939.9727 2995.90038 2971.24127 2999.82019 3444.70819 20
# f_uwe_dt(dat) 2943.26219 3007.1212 3028.37550 3027.46511 3060.38380 3076.25664 20
# f_uwe_dplyr(dat) 6339.03141 6375.7727 6478.77979 6448.56399 6521.54196 6816.09911 20
10000 customers with up to 10 products each. apriori is several hundred times faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 58.40093 58.95241 59.71129 59.63988 60.43591 61.21082 20
# f_jay(dat) 52824.67760 53369.78899 53760.43652 53555.69881 54049.91600 55605.47980 20
# f_uwe_dt(dat) 22612.87954 22820.12012 22998.85072 22974.32710 23220.00390 23337.22815 20
# f_uwe_dplyr(dat) 26083.20240 26255.88861 26445.49295 26402.67887 26659.81195 27046.83491 20
On the larger data set, Allan's code gave warnings (In rawToBits(as.raw(x)) : out-of-range values treated as 0 in coercion to raw) on the toy data, which seemed to affect the result. Thus, it is not included in the second benchmark.
Data and benchmark code:
set.seed(3)
n_cust = 10000
n_product = sample(2:6, n_cust, replace = TRUE) # 2:10 in second run
dat = data.frame(
Customer = rep(1:n_cust, n_product),
Product = unlist(lapply(n_product, function(n) sample(letters[1:6], n)))) # 1:10 in 2nd run
library(microbenchmark)
res = microbenchmark(f_henrik(dat),
f_allan(dat),
f_jay(dat),
f_uwe_dt(dat),
f_uwe_dplyr(dat),
times = 20L)
Check for equality:
henrik = f_henrik(dat)
allan = f_allan(dat)
jay = f_jay(dat)
uwe_dt = f_uwe_dt(dat)
uwe_dplyr = f_uwe_dplyr(dat)
# change outputs to common format for comparison
# e.g. string format, column names, order
henrik$items = substr(henrik$items, 2, nchar(henrik$items) - 1)
henrik$items = gsub(",", ", ", henrik$items)
l = list(
henrik = henrik, allan = allan, jay = jay, uwe_dt = uwe_dt, uwe_dplyr = uwe_dplyr)
l = lapply(l, function(d){
d = setNames(as.data.frame(d), c("items", "count"))
d = d[order(d$items), ]
row.names(d) = NULL
d
})
all.equal(l[["henrik"]], l[["allan"]])
# TRUE
all.equal(l[["henrik"]], l[["jay"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dt"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dplyr"]])
# TRUE
Functions:
f_henrik = function(dat){
r = apriori(data = as.matrix(table(dat) > 0),
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
}
f_allan = function(dat){
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
}
f_jay = function(dat){
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
}
f_uwe_dt = function(dat){
setorder(setDT(dat), Customer, Product)
dat[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
}
f_uwe_dplyr = function(dat){
dat %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
}
If you define a little helper function that gets all multiple groupings:
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
then you can do this nicely in a tidyverse pipe:
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
#> # A tibble: 11 x 2
#> # Groups: Product_group [11]
#> Product_group n
#> <chr> <int>
#> 1 A, B 3
#> 2 A, B, C 1
#> 3 A, B, D 1
#> 4 A, C 1
#> 5 A, D 1
#> 6 B, C 1
#> 7 B, D 1
#> 8 D, E 1
#> 9 D, E, F 1
#> 10 D, F 1
#> 11 E, F 1
For the sake of completeness, here is a solution in data.table syntax which can be translated to dplyr syntax as well.
For both implementations, the core idea is the same:
sort by Product (which is an important step which has been neglected by the other answers posted so far)
For each Customer, create the product groups by using combn() with varying lengths m. Product.Group is a kind of natural key created by concatenating the included products using the toString() function.
Here, we can see why sorting Product is important : products B, A as well as A, B should appear in the same product group A, B.
Finally, count the number of occurrences by Product.Group
data.table version
library(data.table)
setorder(setDT(df), Customer, Product)
df[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
Product.Group N
1: A, B 3
2: A, C 1
3: B, C 1
4: A, B, C 1
5: D, E 1
6: D, F 1
7: E, F 1
8: D, E, F 1
9: A, D 1
10: B, D 1
11: A, B, D 1
dplyr version
library(dplyr)
df %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
Product.Group n
<chr> <int>
1 A, B 3
2 A, B, C 1
3 A, B, D 1
4 A, C 1
5 A, D 1
6 B, C 1
7 B, D 1
8 D, E 1
9 D, E, F 1
10 D, F 1
11 E, F 1
Data
library(data.table)
df <- fread("
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")
You could split the data along customers, then get all combinations of product-pairs and triples using combn. Then find matches using %in% with outer, create data frame by collapsing products using toString and finally discard elements which are zero.
# a <- aggregate(Product ~ Customer, dat, I)$Product ## old solution
# if (is.matrix(a)) a <- as.data.frame(t(a)) ## old solution
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
# p.group number
# 1 A, B 3
# 2 A, C 1
# 3 A, D 1
# 6 B, C 1
# 7 B, D 1
# 13 D, E 1
# 14 D, F 1
# 15 E, F 1
# 16 A, B, C 1
# 17 A, B, D 1
# 35 D, E, F 1
Data
dat <- read.table(header=TRUE, text="Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")

Combine string elements within a list into one variable in R

I have a data frame (df) with one variable that's a list containing string vectors (mylist).
v1 = c("a", "b", "c")
v2 = c("d", "e", "f", "g", "h")
v3 = c("x", "y", "z", "k")
df = tibble(id = seq(1:3), mylist = list(v1, v2, v3))
How can I combine the elements of mylist into a single variable for each row? I want my data to look like this:
id mylist
1 "a b c"
2 "d e f g h"
3 "x y z k"
One dplyr option could be:
df %>%
rowwise() %>%
mutate(mylist = Reduce(paste, mylist))
id mylist
<int> <chr>
1 1 a b c
2 2 d e f g h
3 3 x y z k
A base R option would be to use collapse the list elements using sapply() and paste():
df$mylist <- sapply(mylist, paste, collapse = " ")
df
# A tibble: 3 x 2
id mylist
<int> <chr>
1 1 a b c
2 2 d e f g h
3 3 x y z k
Or, using dplyr with purrr::map_chr():
library(purrr)
library(dplyr)
df %>%
mutate(mylist = map_chr(mylist, paste, collapse = " "))
An option is to unnest and do a group_by paste
library(dplyr)
library(tidyr)
library(stringr)
df %>%
# // expand the dataset by unnesting the column
unnest(c(mylist)) %>%
# // grouped by id
group_by(id) %>%
# // paste the elements of mylist to a single string
summarise(mylist = str_c(mylist, collapse=' '))
# A tibble: 3 x 2
# id mylist
# <int> <chr>
#1 1 a b c
#2 2 d e f g h
#3 3 x y z k

Solution on R group by issue _ multiple combination

I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6

How to transform a table of an hierarchical network to an edge list efficiently in R

I know there are plenty of open and answered questions about generating edge lists but I found none that fits my case:
I have a table that represents a strictly hierarchical network and I want to convert it to one that has 3 columns: source node, target node and type of interaction. The table is fairly redundant with the first column listing all first level nodes, second column being all corresponding second level nodes, and so on:
first second third
1 A F L
2 B F L
3 C G L
4 D F L
5 E G L
6 L L L
For a visualization of the network see the image below:
The table I want looks like this:
source target level
1 F L third
2 G L third
3 L L third
4 A F second
5 B F second
6 C G second
7 D F second
8 E G second
9 L L second
So far, I only had to do it once with fairly few levels, so I used the following clumsy approach with dplyr:
library(dplyr)
example.df <- data.frame(
"first" = c("A", "B", "C", "D", "E", "L"),
"second" = c("F", "F", "G", "F", "G", "L"),
"third" = c("L", "L", "L", "L", "L", "L")
)
name.v <- c("source","target")
third.df <- example.df %>%
group_by(second) %>%
summarise(third = unique(third))
names(third.df) <- name.v
second.df <- example.df %>%
group_by(first) %>%
summarise(second = unique(second))
names(second.df) <- name.v
hier.df <- bind_rows("third" = third.df, "second"= second.df, .id = "level") %>%
select(source, target, level)
# using igraph to generate the image
library(igraph)
hier.graph <- graph_from_data_frame(hier.df)
plot(hier.graph)
Obviously, this would scale pretty badly. So for programming, I would prefer to find a more lean approach, e.g. in base R.
You could try using map_df here
Your data
df <- read.table(text=" first second third
A F L
B F L
C G L
D F L
E G L
L L L", header=TRUE, stringsAsFactors=FALSE)
Solution
library(purrr)
map_df(2:ncol(df), ~select(df, (.x-1):.x) %>% setNames(c("source", "target")), .id="id") %>%
group_by(id) %>%
distinct() %>%
ungroup() %>%
mutate(id = colnames(df)[as.numeric(id)+1])
# A tibble: 9 x 3
# id source target
# <chr> <chr> <chr>
# 1 second A F
# 2 second B F
# 3 second C G
# 4 second D F
# 5 second E G
# 6 second L L
# 7 third F L
# 8 third G L
# 9 third L L
This scales with arbitrary number of columns
set.seed(1)
new_df <- as_tibble(matrix(sample(LETTERS, 25, replace=FALSE), ncol=5)) %>%
setNames(c("first", "second", "third", "fourth", "fifth"))
myfun <- function(data) {
map_df(2:ncol(data), ~select(data, (.x-1):.x) %>% setNames(c("source", "target")), .id="id") %>%
group_by(id) %>%
distinct() %>%
ungroup() %>%
mutate(id = colnames(data)[as.numeric(id)+1])
}
myfun(new_df)
# A tibble: 20 x 3
# id source target
# <chr> <chr> <chr>
# 1 second G S
# 2 second J W
# 3 second N M
# 4 second U L
# 5 second E B
# 6 third S D
# 7 third W C
# 8 third M Y
# 9 third L V
# 10 third B X
# 11 fourth D F
# 12 fourth C H
# 13 fourth Y I
# 14 fourth V P
# 15 fourth X K
# 16 fifth F Z
# 17 fifth H Q
# 18 fifth I O
# 19 fifth P A
# 20 fifth K R
igraph's as_data_frame() will take care of this for you. what can be "edges", "vertices" or "both", which will return both vertices and edges in a list of data.frames.
?igraph::as_data_frame
igraph::as_data_frame(x = hier.graph, what = "edges") %>%
`colnames<-`(c("source", "target", "level"))
# source target level
# 1 F L third
# 2 G L third
# 3 L L third
# 4 A F second
# 5 B F second
# 6 C G second
# 7 D F second
# 8 E G second
# 9 L L second
Maybe this is clumsy too, but columns 1 & 2 make the second level connections; columns 2 & 3 are the third level. Just separate them and use rbind.
SL = unique(cbind(TAB[,1:2], "second"))
names(SL) = c("source", "target", "level")
TL = unique(cbind(TAB[,2:3], "third"))
names(TL) = c("source", "target", "level")
rbind(TL, SL)
source target level
1 F L third
3 G L third
6 L L third
11 A F second
2 B F second
31 C G second
4 D F second
5 E G second
61 L L second
DATA
TAB = read.table(text="first second third
1 A F L
2 B F L
3 C G L
4 D F L
5 E G L
6 L L L",
header=TRUE)

dplyr concat columns stored in variable (mutate and non standard evaluation)

I would like to concatenate an arbitrary number of columns in a dataframe based on a variable cols_to_concat
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = c("a", "b", "c")
To achieve the desired result with this specific value of cols_to_concat I could do this:
df %>%
dplyr::mutate(concat = paste0(a, b, c))
But I need to generalise this, using syntax a bit like this
# (DOES NOT WORK)
df %>%
dplyr::mutate(concat = paste0(cols))
I'd like to use the new NSE approach of dplyr 0.7.0, if this is appropriate, but can't figure out the correct syntax.
You can perform this operation using only the tidyverse if you'd like to stick to those packages and principles. You can do it by using either mutate() or unite_(), which comes from the tidyr package.
Using mutate()
library(dplyr)
df <- tibble(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat <- c("a", "b", "c")
df %>% mutate(new_col = do.call(paste0, .[cols_to_concat]))
# A tibble: 3 × 4
a b c new_col
<chr> <chr> <chr> <chr>
1 a d g adg
2 b e h beh
3 c f i cfi
Using unite_()
library(tidyr)
df %>% unite_(col='new_col', cols_to_concat, sep="", remove=FALSE)
# A tibble: 3 × 4
new_col a b c
* <chr> <chr> <chr> <chr>
1 adg a d g
2 beh b e h
3 cfi c f i
EDITED July 2020
As of dplyr 1.0.0, it appears that across() and c_across() are replacing the underscore verbs (e.g. unite_) and scoped variants like mutate_if(), mutate_at() and mutate_all(). Below is an example using that convention. Not the most concise, but still an option that promises to be more extensible.
Using c_across()
library(dplyr)
df <- tibble(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat <- c("a", "b", "c")
df %>%
rowwise() %>%
mutate(new_col = paste0(c_across(all_of(cols_to_concat)), collapse=""))
#> # A tibble: 3 x 4
#> # Rowwise:
#> a b c new_col
#> <chr> <chr> <chr> <chr>
#> 1 a d g adg
#> 2 b e h beh
#> 3 c f i cfi
Created on 2020-07-08 by the reprex package (v0.3.0)
You can try syms from rlang:
library(dplyr)
packageVersion('dplyr')
#[1] ‘0.7.0’
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = c("a", "b", "c")
library(rlang)
cols_quo <- syms(cols_to_concat)
df %>% mutate(concat = paste0(!!!cols_quo))
# or
df %>% mutate(concat = paste0(!!!syms(cols_to_concat)))
# # A tibble: 3 x 4
# a b c concat
# <chr> <chr> <chr> <chr>
# 1 a d g adg
# 2 b e h beh
# 3 c f i cfi
You can do the following:
library(dplyr)
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = lapply(list("a", "b", "c"), as.name)
q <- quos(paste0(!!! cols_to_concat))
df %>%
dplyr::mutate(concat = !!! q)

Resources