dplyr concatenate column by variable value - r

I can concatenate one column of data.frame, following the code as below if the column name is available.
However, How about the "column" name saved in the variable?
Further question is, how can I specify the columns by the value of a variable? (!!sym() ?)
Here are test code:
> library(dplyr)
> packageVersion("dplyr")
[1] ‘1.0.7’
> df <- data.frame(x = 1:3, y = c("A", "B", "A"))
> df %>%
group_by(y) %>%
summarise(z = paste(x, collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A 1,3
2 B 2
I have a variable a, with the value x, How can I do above summarize?
> a <- "x"
> df %>%
group_by(y) %>%
summarise(z = paste(a, collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A x
2 B x
Solution-1: use !!sym()
> a <- "x"
> df %>%
group_by(y) %>%
summarise(z = paste(!!sym(a), collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A 1,3
2 B 2
Solution-2: Assign the column to new variable
> df %>%
group_by(y) %>%
rename(new_col = a) %>%
summarise(z = paste(new_col, collapse = ","))
# A tibble: 2 x 2
y z
<chr> <chr>
1 A 1,3
2 B 2
Are there any other ways to do the job?
similar questions could be found: https://stackoverflow.com/a/15935166/2530783 ,https://stackoverflow.com/a/50537209/2530783,

Here are some other options -
Use .data -
library(dplyr)
a <- "x"
df %>% group_by(y) %>% summarise(z = toString(.data[[a]]))
# y z
# <chr> <chr>
#1 A 1, 3
#2 B 2
get
df %>% group_by(y) %>% summarise(z = toString(get(a)))
as.name
df %>% group_by(y) %>% summarise(z = toString(!!as.name(a)))
paste(..., collapse = ',') is equivalent to toString.

Related

turn pivot_wider() into spread()

I love the new tidyr pivot_wider function but since it hasn't been officially added to the CRAN package I was wondering how to convert the following code into the older spread() function (I do not have access to the server to DL tidyr from github)
test <- data.frame(x = c(1,1,2,2,2,2,3,3,3,4),
y = c(rep("a", 5), rep("b", 5)))
test %>%
count(x, y) %>%
group_by(x) %>%
mutate(prop = prop.table(n)) %>%
mutate(v1 = paste0(n, ' (', round(prop, 2), ')')) %>%
pivot_wider(id_cols = x, names_from = y, values_from = v1)
Desired Output:
# A tibble: 4 x 3
# Groups: x [4]
x a b
<dbl> <chr> <chr>
1 1 2 (1) NA
2 2 3 (0.75) 1 (0.25)
3 3 NA 3 (1)
4 4 NA 1 (1)
I tried (but is not quite right):
test %>%
count(x, y) %>%
group_by(x) %>%
mutate(prop = prop.table(n)) %>%
mutate(v1 = paste0(n, ' (', round(prop, 2), ')')) %>%
spread(y, v1) %>%
select(-n, -prop)
Any help appreciated!
One option is to remove the columns 'n', 'prop' before the spread statement as including them would create unique rows with that column values as well
library(dplyr)
library(tidyr)
test %>%
count(x, y) %>%
group_by(x) %>%
mutate(prop = prop.table(n)) %>%
mutate(v1 = paste0(n, ' (', round(prop, 2), ')')) %>%
select(-n, -prop) %>%
spread(y, v1)
# A tibble: 4 x 3
# Groups: x [4]
# x a b
# <dbl> <chr> <chr>
#1 1 2 (1) <NA>
#2 2 3 (0.75) 1 (0.25)
#3 3 <NA> 3 (1)
#4 4 <NA> 1 (1)
Or using base R
tbl <- table(test)
tbl[] <- paste0(tbl, "(", prop.table(tbl, 1), ")")
You can use data.table package:
> library(data.table)
> setDT(test)[,.(n=.N),by=.(x,y)][,.(y=y,n=n,final=gsub('\\(1\\)','',paste0(n,'(',round(prop.table(n),2), ')'))),by=x]
x y n final
1: 1 a 2 2
2: 2 a 3 3(0.75)
3: 2 b 1 1(0.25)
4: 3 b 3 3
5: 4 b 1 1

convert R list to tibble() - purrr or better option?

library(tidyverse)
library(purrr)
x <- c(20, 30, 58)
n <- 100
mylist <- data_frame(x = c(0, x), n) %>%
distinct() %>%
filter(x >= 0 & x < n) %>%
arrange(x) %>%
bind_rows(data_frame(x = n)) %>%
mutate(lag_x = lag(x)) %>%
mutate(y = x - lag_x) %>%
filter(!is.na(y)) %>%
summarise(n = list(rep(row_number(), y))) %>%
pull(n)
What's the best way to convert the list above into a tibble? purrr maybe? I am actually going to use this list inside of a mutate call, to add said list as a column to another tibble.
# A tibble: 100 x 1
grp
<dbl>
1 1
2 1
3 1
4 1
etc...
unnest() & rename()
library(tidyverse)
x <- c(20, 30, 58)
n <- 100
data_frame(x = c(0, x), n) %>%
distinct() %>%
filter(x >= 0 & x < n) %>%
arrange(x) %>%
bind_rows(data_frame(x = n)) %>%
mutate(lag_x = lag(x)) %>%
mutate(y = x - lag_x) %>%
filter(!is.na(y)) %>%
summarise(n = list(rep(row_number(), y))) %>%
unnest(n) %>%
rename(grp = n)
## # A tibble: 100 x 1
## grp
## <int>
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## # ... with 90 more rows
I would use a combination of tibble and unlist. This way:
new_tibble <- tibble(grp = unlist(mylist))
##if you want to add it as column to a data frame, here is how I'd do it
mock_df <- tibble(x = rnorm(100),
y = rnorm(100))
mock_df %>% mutate(grp = unlist(mylist))

Identify subsets containing only repeats of an expression

I have a dataset like so:
df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B",
"C","C","C","C","C","D","D","D","D","D"),
y= as.factor(c(rep("Eoissp2",4),rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2","Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))
I want to identify, for each subset of x, the corresponding levels in y that are entirely duplicates containing the expression Eois. Therefore, A , B, and D will be returned in a vector because every level of A , B, and D contains the expression Eois , while level C consists of various unique levels (e.g. Eois, Automeris and Acharias). For this example the output would be:
output<- c("A", "B", "D")
Using new df:
> df %>% filter(str_detect(y,"Eois")) %>% group_by(x) %>% distinct(y) %>%
count() %>% filter(n==1) %>% select(x)
# A tibble: 2 x 1
# Groups: x [2]
x
<fct>
1 A
2 B
(Answer below uses the original df posted by the question author.)
Using the pipe function in magrittr & functions from dplyr:
> df %>% group_by(x) %>% distinct(y)
# A tibble: 7 x 2
# Groups: x [3]
x y
<fct> <fct>
1 A plant1a
2 B plant1b
3 C plant1a
4 C plant2a
5 C plant3a
6 C plant4a
7 C plant5a
Then you can roll up the results like this:
> results <- df %>% group_by(x) %>% distinct(y) %>%
count() %>% filter(n==1) %>% select(x)
> results
# A tibble: 2 x 1
# Groups: x [2]
x
<fct>
1 A
2 B
If you know your original data frame is always going to come with the x's in order, you can drop the group_by part.
A dplyr based solution could be as:
library(dplyr)
df %>% group_by(x) %>%
filter(grepl("Eoiss", y)) %>%
mutate(y = sub("\\d+", "", y)) %>%
filter(n() >1 & length(unique(y)) == 1) %>%
select(x) %>% unique(.)
# A tibble: 3 x 1
# Groups: x [3]
# x
# <fctr>
#1 A
#2 B
#3 D
Data
df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B",
"C","C","C","C","C","D","D","D","D","D"),
y= as.factor(c(rep("Eoissp2",4),
rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2",
"Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))

Are there any disadvantages for doing a mutate + filter vs summarise on a grouped data frame?

In dplyr 0.5.0, calling summarise on a grouped data frame does not guarantee any resultant row order (Currently, it reorders the rows by group, not sure how it handles duplicate grouping levels).
To get around this, I would like to replace all summarise(x = ...) operations with mutate(x = ...) %>% filter(row_number() == 1). Are there any disadvantages or drawbacks to doing this?
Example of the two operations.
tmp_df <-
data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5)) %>%
group_by(group)
tmp_df %>%
summarise(b = sum(b))
tmp_df %>%
mutate(b = sum(b)) %>%
filter(row_number() == 1)
producing:
> tmp_df %>%
+ summarise(b = sum(b))
# A tibble: 2 × 2
group b
<int> <dbl>
1 1 5
2 2 -5
> tmp_df %>%
+ mutate(b = sum(b)) %>%
+ filter(row_number() == 1)
Source: local data frame [2 x 2]
Groups: group [2]
group b
<int> <dbl>
1 2 -5
2 1 5
EDIT: In response to a comment, for readability I can define the function:
summarise_o <- function (.data, ...) {
# order preserving summarise
mutate_(.data, .dots = lazyeval::lazy_dots(...)) %>%
filter(row_number() == 1) %>%
return
}
and simply call:
tmp_df %>%
summarise_o(b = sum(b))
One option is to create the 'group' as a factor
tmp_df <- data.frame(group = rep(c(2L, 1L), each = 5), b = rep(c(-1, 1), each = 5)) %>%
group_by(group = factor(group, levels = unique(group)))
tmp_df %>%
summarise(b = sum(b))
# A tibble: 2 x 2
# group b
# <fctr> <dbl>
#1 2 -5
#2 1 5

tidyverse - prefered way to turn a named vector into a data.frame/tibble

Using the tidyverse a lot i often face the challenge of turning named vectors into a data.frame/tibble with the columns being the names of the vector.
What is the prefered/tidyversey way of doing this?
EDIT: This is related to: this and this github-issue
So i want:
require(tidyverse)
vec <- c("a" = 1, "b" = 2)
to become this:
# A tibble: 1 × 2
a b
<dbl> <dbl>
1 1 2
I can do this via e.g.:
vec %>% enframe %>% spread(name, value)
vec %>% t %>% as_tibble
Usecase example:
require(tidyverse)
require(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
Which gives
# A tibble: 2 × 3
a b c
<chr> <chr> <chr>
1 1 2 <NA>
2 1 <NA> 3
This is now directly supported using bind_rows (introduced in dplyr 0.7.0):
library(tidyverse))
vec <- c("a" = 1, "b" = 2)
bind_rows(vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
This quote from https://cran.r-project.org/web/packages/dplyr/news.html explains the change:
bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).
With this change, it means that the following line in the use case example:
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
can be rewritten as
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(bind_rows)
which is also equivalent to
txt %>% map(read_xml) %>% map(xml_attrs) %>% { bind_rows(!!! .) }
The equivalence of the different approaches is demonstrated in the following example:
library(tidyverse)
library(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
temp <- txt %>% map(read_xml) %>% map(xml_attrs)
# x, y, and z are identical
x <- temp %>% map_df(~t(.) %>% as_tibble)
y <- temp %>% map_df(bind_rows)
z <- bind_rows(!!! temp)
identical(x, y)
#> [1] TRUE
identical(y, z)
#> [1] TRUE
z
#> # A tibble: 2 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 1 2 <NA>
#> 2 1 <NA> 3
The idiomatic way would be to splice the vector with !!! within a tibble() call so the named vector elements become column definitions :
library(tibble)
vec <- c("a" = 1, "b" = 2)
tibble(!!!vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Created on 2019-09-14 by the reprex package (v0.3.0)
This works for me: c("a" = 1, "b" = 2) %>% t() %>% tbl_df()
Interestingly you can use the as_tibble() method for lists to do this in one call. Note that this isn't best practice since this isn't an exported method.
tibble:::as_tibble.list(vec)
as_tibble(as.list(c(a=1, b=2)))

Resources