Concatenate strings by group with dplyr [duplicate] - r

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 5 years ago.
i have a dataframe that looks like this
> data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'))
> data
foo bar
1 1 a
2 1 b
3 2 a
4 3 b
5 3 c
6 3 d
I would like to create a new column bars_by_foo which is the concatenation of the values of bar by foo. So the new data should look like this:
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd
I was hoping that the following would work:
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
But that code gives me an error
Error: incompatible types, expecting a character vector.
What am I doing wrong?

You could simply do
data %>%
group_by(foo) %>%
mutate(bars_by_foo = paste0(bar, collapse = ""))
Without any helper functions

It looks like there's a bit of an issue with the mutate function - I've found that it's a better approach to work with summarise when you're grouping data in dplyr (that's no way a hard and fast rule though).
paste function also introduces whitespace into the result so either set sep = 0 or use just use paste0.
Here is my code:
p <- function(v) {
Reduce(f=paste0, x = v)
}
data %>%
group_by(foo) %>%
summarise(bars_by_foo = p(as.character(bar))) %>%
merge(., data, by = 'foo') %>%
select(foo, bar, bars_by_foo)
Resulting in..
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd

You can try this:
agg <- aggregate(bar~foo, data = data, paste0, collapse="")
df <- merge(data, agg, by = "foo", all = T)
colnames(df) <- c(colnames(data), "bars_by_foo") # optional
# foo bar bars_by_foo
# 1 1 a ab
# 2 1 b ab
# 3 2 a a
# 4 3 b bcd
# 5 3 c bcd
# 6 3 d bcd

Your function works if you ensure that bar are all characters and not levels of a factor.
data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'),
stringsAsFactors = FALSE)
library("dplyr")
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
Source: local data frame [6 x 3]
Groups: foo [3]
foo bar bars_by_foo
<dbl> <chr> <chr>
1 1 a a b
2 1 b a b
3 2 a a
4 3 b b c d
5 3 c b c d
6 3 d b c d

Related

R merge multiple lines of strings by ID variable [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 5 years ago.
i have a dataframe that looks like this
> data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'))
> data
foo bar
1 1 a
2 1 b
3 2 a
4 3 b
5 3 c
6 3 d
I would like to create a new column bars_by_foo which is the concatenation of the values of bar by foo. So the new data should look like this:
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd
I was hoping that the following would work:
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
But that code gives me an error
Error: incompatible types, expecting a character vector.
What am I doing wrong?
You could simply do
data %>%
group_by(foo) %>%
mutate(bars_by_foo = paste0(bar, collapse = ""))
Without any helper functions
It looks like there's a bit of an issue with the mutate function - I've found that it's a better approach to work with summarise when you're grouping data in dplyr (that's no way a hard and fast rule though).
paste function also introduces whitespace into the result so either set sep = 0 or use just use paste0.
Here is my code:
p <- function(v) {
Reduce(f=paste0, x = v)
}
data %>%
group_by(foo) %>%
summarise(bars_by_foo = p(as.character(bar))) %>%
merge(., data, by = 'foo') %>%
select(foo, bar, bars_by_foo)
Resulting in..
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd
You can try this:
agg <- aggregate(bar~foo, data = data, paste0, collapse="")
df <- merge(data, agg, by = "foo", all = T)
colnames(df) <- c(colnames(data), "bars_by_foo") # optional
# foo bar bars_by_foo
# 1 1 a ab
# 2 1 b ab
# 3 2 a a
# 4 3 b bcd
# 5 3 c bcd
# 6 3 d bcd
Your function works if you ensure that bar are all characters and not levels of a factor.
data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'),
stringsAsFactors = FALSE)
library("dplyr")
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
Source: local data frame [6 x 3]
Groups: foo [3]
foo bar bars_by_foo
<dbl> <chr> <chr>
1 1 a a b
2 1 b a b
3 2 a a
4 3 b b c d
5 3 c b c d
6 3 d b c d

R: Repeating row of dataframe with respect to multiple count columns

I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2

Using pipe operation in R properly

I was examining below code
library(dplyr)
DF = data.frame('A' = 1:3, 'B' =2:4)
Condition = 'A'
fn1 = function(x) x + 3
fn2 = function(x) x + 5
DF %>% mutate('aa' = 3:5) %>%
{if (Condition == 'A') {
bb = . %>% mutate('A1' = fn1(A), 'B1' = fn1(B))
bb
} else {
bb = . %>% mutate('A1' = fn2(A), 'B1' = fn2(B))
bb
}
}
Basically, I have 2 similar functions fn1 and fn2. Now based on some condition, I want to use one of these functions.
Above implementation is throwing below error -
Functional sequence with the following components:
1. mutate(., A1 = fn1(A), B1 = fn1(B))
Use 'functions' to extract the individual functions.
Can you please help be how to properly write the pipe sequence to execute above code?
We could use across within mutate
library(dplyr)
DF %>%
mutate(aa = 3:5, across(c(A, B), ~ if(Condition == 'A') fn1(.)
else fn2(.), .names = "{.col}1"))
-output
A B aa A1 B1
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
Also, an option is to get the functions in a list and convert the logical vector to numeric index for subsetting
DF %>%
mutate(aa = 3:5,
across(c(A, B), ~ list(fn2, fn1)[[1 + (Condition == 'A')]](.),
.names = "{.col}1"))
-output
A B aa A1 B1
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
Based on the comments, if we need a custom name for the new columns, create a named vector and replace with str_replace_all
library(stringr)
nm1 <- setNames(c("XXX", "YYY"), names(DF)[1:2])
DF %>%
mutate(aa = 3:5,
across(c(A, B), ~ list(fn2, fn1)[[1 + (Condition == 'A')]](.),
.names = "{str_replace_all(.col, nm1)}"))
A B aa XXX YYY
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7

Including map() function to tabulate each element in a character vector returns an error

I'd like to tabulate the frequencies of each unitary element in a character vector. This is vector contains the answers to a set of items in a survey, with this structure "ADCDAB...", being "A" the answer to the first item, "D" to the second one, etc.
I'd like to process the data with purrr::map combined with base string functions.
p1 <- strsplit(substr(test$answer),"")
map(p1,table)
However, if I include the code with dplyr, the systems returns an error message:
test %>%
mutate(p1=strsplit(answer,"")) %>%
map(p1,table)
the system returns the following error message:
Error: Index 1 must have length 1, not 10
What's wrong with the second syntax?
A dummy dataset
structure(list(answer = c(".BBCBD.A.D", "...DB..AA.", "B......AB.",
"BDDDBACADD", "BB.ABC.AAD"), d.n.i = c(1, 2, 3, 4, 5)), row.names = c(NA,
5L), class = "data.frame")
Here is a base R option
x <- "ADCDAB"
out <- table(utf8ToInt(x))
names(out) <- intToUtf8(names(out), multiple = TRUE)
out
#A B C D
#2 1 1 2
With multiple elements use lapply
x <- c("ADCDAB", "EFG")
f <- function(i) {
out <- table(utf8ToInt(i))
names(out) <- intToUtf8(names(out), multiple = TRUE)
out
}
lapply(x, f)
Returns
#[[1]]
#A B C D
#2 1 1 2
#[[2]]
#E F G
#1 1 1
If you need output as single table, try
x <- c("ADCDAB", "EFGAA")
f(paste(x, collapse = ""))
#A B C D E F G
#4 1 1 2 1 1 1
.. or as dataframe
as.data.frame(f(paste(x, collapse = "")))
# Var1 Freq
#1 A 4
#2 B 1
#3 C 1
#4 D 2
#5 E 1
#6 F 1
#7 G 1
You could do :
library(tidyverse)
test %>% mutate(p1 = strsplit(answer,""), p2 = map(p1, table))
However, I would suggest something like below :
test %>%
mutate(p1 = strsplit(answer,"")) %>%
unnest(p1) %>%
count(answer, p1)
# answer p1 n
# <chr> <chr> <int>
#1 ABCD A 1
#2 ABCD B 1
#3 ABCD C 1
#4 ABCD D 1
#5 ADCDAB A 2
#6 ADCDAB B 1
#7 ADCDAB C 1
#8 ADCDAB D 2
data
test <- data.frame(answer = c("ADCDAB", "ABCD"), stringsAsFactors = FALSE)

Recursively sum data frames for matching rows

I would like to combine a set of data frames into a single data frame by summing columns that have matching variables (instead of appending columns).
For example, given
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
I want to match by "A" and "B" and sum the values of "x". For this example, I can get the desired result as follows:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
Result:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
A more general solution is
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
Is there a more efficient option for combining a large set of data frames? Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
An easier option is to bind the rows of the datasets, then group by the columns of interest and get the summarised output by getting the sum of 'x'
library(tidyverse)
bind_rows(df1, df2, df3) %>%
group_by(A, B) %>%
summarise(x = sum(x))
# A tibble: 11 x 3
# Groups: A [?]
# A B x
# <dbl> <dbl> <dbl>
# 1 0 1 6
# 2 0 2 3
# 3 0 5 5
# 4 1 1 9
# 5 1 2 5
# 6 1 3 7
# 7 1 4 3
# 8 2 1 7
# 9 2 2 2
#10 2 4 0
#11 2 5 3
If there are many objects in the global environment with the pattern "df" followed by some digits
mget(ls(pattern= "^df\\d+")) %>%
bind_rows %>%
group_by(A, B) %>%
summarise(x = sum(x))
As the OP mentioned about memory constraints, if we do the join first and then use rowSums or + with reduce, it would be more efficient
mget(ls(pattern= "^df\\d+")) %>%
reduce(full_join, by = c("A", "B")) %>%
transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
arrange(A, B)
# A B x
#1 0 1 6
#2 0 2 3
#3 0 5 5
#4 1 1 9
#5 1 2 5
#6 1 3 7
#7 1 4 3
#8 2 1 7
#9 2 2 2
#10 2 4 0
#11 2 5 3
This could also be done with data.table
library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]
Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).
If you're memory constrained and willing to sacrifice speed (vs #akrun's data.table approach), use one table at a time in a loop:
library(data.table)
tabs = c("df1", "df2", "df3")
# enumerate all combos for the results table
# initializing sum to 0
res = CJ(A = 0:2, B = 1:5, x = 0)
# loop over tabs, adding on
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res[tab, on=.(A, B), x := x + i.x][]
rm(tab)
}
If you need to read tables from disk, change tabs to file names and get to fread or whatever function.
I am skeptical that you can fit all the tables in memory, but cannot also fit an rbind-ed copy of them together.
Similarly (thanks to #akrun's comment), use his approach pairwise:
res = data.table(get(tabs[[1]]))[0L]
for (i in seq_along(tabs)){
tab = get(tabs[[i]])
res = rbind(res, tab)[, .(x = sum(x)), by=.(A,B)]
rm(tab)
}

Resources