dplyr concat columns stored in variable (mutate and non standard evaluation) - r

I would like to concatenate an arbitrary number of columns in a dataframe based on a variable cols_to_concat
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = c("a", "b", "c")
To achieve the desired result with this specific value of cols_to_concat I could do this:
df %>%
dplyr::mutate(concat = paste0(a, b, c))
But I need to generalise this, using syntax a bit like this
# (DOES NOT WORK)
df %>%
dplyr::mutate(concat = paste0(cols))
I'd like to use the new NSE approach of dplyr 0.7.0, if this is appropriate, but can't figure out the correct syntax.

You can perform this operation using only the tidyverse if you'd like to stick to those packages and principles. You can do it by using either mutate() or unite_(), which comes from the tidyr package.
Using mutate()
library(dplyr)
df <- tibble(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat <- c("a", "b", "c")
df %>% mutate(new_col = do.call(paste0, .[cols_to_concat]))
# A tibble: 3 × 4
a b c new_col
<chr> <chr> <chr> <chr>
1 a d g adg
2 b e h beh
3 c f i cfi
Using unite_()
library(tidyr)
df %>% unite_(col='new_col', cols_to_concat, sep="", remove=FALSE)
# A tibble: 3 × 4
new_col a b c
* <chr> <chr> <chr> <chr>
1 adg a d g
2 beh b e h
3 cfi c f i
EDITED July 2020
As of dplyr 1.0.0, it appears that across() and c_across() are replacing the underscore verbs (e.g. unite_) and scoped variants like mutate_if(), mutate_at() and mutate_all(). Below is an example using that convention. Not the most concise, but still an option that promises to be more extensible.
Using c_across()
library(dplyr)
df <- tibble(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat <- c("a", "b", "c")
df %>%
rowwise() %>%
mutate(new_col = paste0(c_across(all_of(cols_to_concat)), collapse=""))
#> # A tibble: 3 x 4
#> # Rowwise:
#> a b c new_col
#> <chr> <chr> <chr> <chr>
#> 1 a d g adg
#> 2 b e h beh
#> 3 c f i cfi
Created on 2020-07-08 by the reprex package (v0.3.0)

You can try syms from rlang:
library(dplyr)
packageVersion('dplyr')
#[1] ‘0.7.0’
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = c("a", "b", "c")
library(rlang)
cols_quo <- syms(cols_to_concat)
df %>% mutate(concat = paste0(!!!cols_quo))
# or
df %>% mutate(concat = paste0(!!!syms(cols_to_concat)))
# # A tibble: 3 x 4
# a b c concat
# <chr> <chr> <chr> <chr>
# 1 a d g adg
# 2 b e h beh
# 3 c f i cfi

You can do the following:
library(dplyr)
df <- dplyr::data_frame(a = letters[1:3], b = letters[4:6], c = letters[7:9])
cols_to_concat = lapply(list("a", "b", "c"), as.name)
q <- quos(paste0(!!! cols_to_concat))
df %>%
dplyr::mutate(concat = !!! q)

Related

How to relocate several columns in one step using dplyr::relocate?

I would like to reorder some columns to come after a particular other column using dplyr::relocate. Here is a MWE:
a <- letters[1:3]
b <- letters[4:6]
c <- letters[7:9]
d <- letters[10:12]
mytib <- tibble::tibble(a,b,c,d)
# A tibble: 3 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 a d g j
# 2 b e h k
# 3 c f i l
mytib %>%
relocate(c, .after = a)
This example works but is there a way that I could, with one relocate command, move c after a and, say, d after b?
I tried the following without success:
mytib %>%
relocate(c(c, d), .after(c(a, b)))
Edit 1: I explicitly ask about relocate because functions like select do not work for large datasets where all I know is after which column (name) I want to insert a column.
Edit 2: This is my expected output:
# A tibble: 3 x 4
# a c b d
# <chr> <chr> <chr> <chr>
# 1 a g d j
# 2 b h e k
# 3 c i f l
As dplyr::relocate itself apparently doesn't allow relocating in pairs, you can "hack" this behavior by preparing a list of column pairs like the ones you describe ("c after a" & "d after b") and reduce over that list, passing your df in as an .init value and in each reduce-step relocating one pair.
Like this:
library(dplyr)
library(purrr)
df_relocated <- reduce(
.x = list(c('c','a'), c('d','b')),
.f = ~ relocate(.x, .y[1], .after = .y[2]),
.init = mytib
)
This produces a tibble just as you expect it:
> df_relocated
# A tibble: 3 x 4
a c b d
<chr> <chr> <chr> <chr>
1 a g d j
2 b h e k
3 c i f l
In case you want to work with two lists, where element 1 of list 2 should relocated after element 1 of list 1 and so forth, this would be a solution:
reduce2(
.x = c("a", "b"),
.y = c("c", "d"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = mytib
)

Combine string elements within a list into one variable in R

I have a data frame (df) with one variable that's a list containing string vectors (mylist).
v1 = c("a", "b", "c")
v2 = c("d", "e", "f", "g", "h")
v3 = c("x", "y", "z", "k")
df = tibble(id = seq(1:3), mylist = list(v1, v2, v3))
How can I combine the elements of mylist into a single variable for each row? I want my data to look like this:
id mylist
1 "a b c"
2 "d e f g h"
3 "x y z k"
One dplyr option could be:
df %>%
rowwise() %>%
mutate(mylist = Reduce(paste, mylist))
id mylist
<int> <chr>
1 1 a b c
2 2 d e f g h
3 3 x y z k
A base R option would be to use collapse the list elements using sapply() and paste():
df$mylist <- sapply(mylist, paste, collapse = " ")
df
# A tibble: 3 x 2
id mylist
<int> <chr>
1 1 a b c
2 2 d e f g h
3 3 x y z k
Or, using dplyr with purrr::map_chr():
library(purrr)
library(dplyr)
df %>%
mutate(mylist = map_chr(mylist, paste, collapse = " "))
An option is to unnest and do a group_by paste
library(dplyr)
library(tidyr)
library(stringr)
df %>%
# // expand the dataset by unnesting the column
unnest(c(mylist)) %>%
# // grouped by id
group_by(id) %>%
# // paste the elements of mylist to a single string
summarise(mylist = str_c(mylist, collapse=' '))
# A tibble: 3 x 2
# id mylist
# <int> <chr>
#1 1 a b c
#2 2 d e f g h
#3 3 x y z k

Solution on R group by issue _ multiple combination

I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6

Using purrr functions to do left join and bind rows

I've built a web scraping function that takes a variety of arguments. Let's use the sample arguments for demonstration purposes.
Arguments: year, type, gender and col_types.
My function takes the referenced arguments and scrapes the data to return a df.
I am looking to join the alternate col_types to standard based on matches at the year, type, gender, name.
Then I want to bind all of the rows to one df.
Sample Data:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Sample DF
a <- tibble(year = 2019, type = "full_year", col_types = "standard", gender = "M", name = c("a","b","c"), variable_1 = 1:3)
b <- tibble(year = 2019, type = "full_year", col_types = "alternate", gender = "M", name = c("a","b","c"), variable_2 = 1:3, variable_3 = 8:10)
c <- tibble(year = 2019, type = "full_year", col_types = "standard", gender = "F", name = c("ab","ba","ca"), variable_1 = 4:6)
d <- tibble(year = 2019, type = "full_year", col_types = "alternate", gender = "F", name = c("ab","ba","ca"), variable_2 = 1:3, variable_3 = 16:18)
e <- tibble(year = 2019, type = "last_month", col_types = "standard", gender = "M", name = c("a","b","c"), variable_1 = 1:3)
f <- tibble(year = 2019, type = "last_month", col_types = "alternate", gender = "M", name = c("a","b","c"), variable_2 = 1:3, variable_3 = 8:10)
g <- tibble(year = 2019, type = "last_month", col_types = "standard", gender = "F", name = c("ab","ba","ca"), variable_1 = 4:6)
h <- tibble(year = 2019, type = "last_month", col_types = "alternate", gender = "F", name = c("ab","ba","ca"), variable_2 = 1:3, variable_3 = 16:18)
# I know this is not going to work as it presents me with NA where I want there to be joins
df <- bind_rows(a, b, c, d, e, f, g, h)
# Adding desired output
df <- bind_rows(a, b, c, d, e, f, g, h)
m_fy_join <-
a %>%
left_join(b %>% select(-matches("col_types")))
f_fy_join <-
c %>%
left_join(d %>% select(-matches("col_types")))
m_lm_join <-
e %>%
left_join(f %>% select(-matches("col_types")))
f_lm_join <-
g %>%
left_join(h %>% select(-matches("col_types")))
# Desired Output
desired_output <- bind_rows(m_fy_join, f_fy_join, m_lm_join, f_lm_join)
What purrr function can I use to do a left_join, and then bind rows?
I don't think you necessarily need to do a join. You can bind all the tibbles together and use coalesce to get rid of the NAs (which arise due to the fact that the "standard"s don't have variable 2/3 and the "alternate"s don't have variable 1).
I think this may be the easiest given the way your data is currently arranged. But, you might consider re-engineering the process (if possible) so that all the "alternate" tibbles are added to one list when created, and all the "standard" tibbles are added to another, so you could just rbind_list each of them and join the two together, rather than devising a way to manage a bunch of tibbles which are all mixed together.
library(tidyverse)
bind_rows(a, b, c, d, e, f, g, h) %>%
group_by(year, type, gender, name) %>%
summarise_at(vars(contains('variable')), reduce, coalesce)
# # A tibble: 12 x 7
# # Groups: year, type, gender [4]
# year type gender name variable_1 variable_2 variable_3
# <dbl> <chr> <chr> <chr> <int> <int> <int>
# 1 2019 full_year F ab 4 1 16
# 2 2019 full_year F ba 5 2 17
# 3 2019 full_year F ca 6 3 18
# 4 2019 full_year M a 1 1 8
# 5 2019 full_year M b 2 2 9
# 6 2019 full_year M c 3 3 10
# 7 2019 last_month F ab 4 1 16
# 8 2019 last_month F ba 5 2 17
# 9 2019 last_month F ca 6 3 18
# 10 2019 last_month M a 1 1 8
# 11 2019 last_month M b 2 2 9
# 12 2019 last_month M c 3 3 10
Edit: Thanks for showing desired output. I've checked and this output is equivalent, except for ordering and the fact that it doesn't have a col_types column,
library(dplyr)
library(purrr)
my_join_function <- function(df1, df2) {
x <- get(df1)
y <- get(df2)
left_join(x, select(y, -matches("col_types")))
}
desired_output2 <- map2_df(
.x = c("a", "c", "e", "g"),
.y = c("b", "d", "f", "h"),
.f = my_join_function
)
testthat::expect_error(testthat::expect_identical(desired_output, desired_output2))
Error: testthat::expect_identical(desired_output, desired_output2) did not throw an error.

Count occurrence of a categorical variable, when grouping and summarising by a different variable in R

I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333

Resources