collapsing strings with summarise_all [duplicate] - r

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Closed 2 years ago.
I have the following data:
df = data.frame(
id("anton", "anton", "charly", "charly", "klaus", "klaus"),
fruits=c("apple", "cherry", "pear", "pear", "apple", "pear"),
number=c(1,4,1,2,3,5))
id fruits number
1 anton apple 1
2 anton cherry 4
3 charly pear 1
4 charly pear 2
5 klaus apple 3
6 klaus pear 5
desired outcome:
id fruits number
1 anton apple, cherry 1, 4
2 charly pear, pear 1, 2
3 klaus apple, pear 3, 5
it works with
library(dplyr)
df.wide <- df %>%
group_by(id) %>%
summarise_all(funs(toString(na.omit(.))))
but I get the warning
"funs() is deprecated as of dplyr 0.8.0. Please use a list of either
functions or lambdas:
Simple named list:
list(mean = mean, median = median)
Auto named with tibble::lst():
tibble::lst(mean, median)
Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))".
How could I reproduce it? 'list' and then 'unnest'? (tried it, but cannot wrap my head around it how to unnest all columns)

Note that also summarise_all is depreciated. Instead you can use across together with a purrr style lambda function:
df = data.frame(
id = c("anton", "anton", "charly", "charly", "klaus", "klaus"),
fruits=c("apple", "cherry", "pear", "pear", "apple", "pear"),
number=c(1,4,1,2,3,5))
library(dplyr)
df.wide <- df %>%
group_by(id) %>%
summarise_all(funs(toString(na.omit(.))))
df.wide
#> # A tibble: 3 x 3
#> id fruits number
#> <chr> <chr> <chr>
#> 1 anton apple, cherry 1, 4
#> 2 charly pear, pear 1, 2
#> 3 klaus apple, pear 3, 5
df_new <- df %>%
group_by(id) %>%
summarise(across(everything(), ~toString(na.omit(.))))
df_new
#> # A tibble: 3 x 3
#> id fruits number
#> <chr> <chr> <chr>
#> 1 anton apple, cherry 1, 4
#> 2 charly pear, pear 1, 2
#> 3 klaus apple, pear 3, 5
Created on 2020-09-25 by the reprex package (v0.3.0)

Try using across() in combination with everything()
df %>%
group_by(id) %>%
summarise(fruits = paste(fruits, collapse = ", "),
number = paste(number, collapse = ", "))
df %>%
group_by(id) %>%
summarise(across(everything(), ~paste(., collapse = ", ")))
which yields
id fruits number
<chr> <chr> <chr>
1 anton apple, cherry 1, 4
2 charly pear, pear 1, 2
3 klaus apple, pear 3, 5
for examples on how to use these new functions, see: https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

Related

R how to create a new column that summarizes the total of a string detect

I have a data frame in R that looks like this:
structure(list(items = c("Apple", "Apple, Pear", "Apple, Pear, Banana"
)), row.names = c(NA, -3L), class = "data.frame")
I would like to create new columns for each item in the "items" column and count the frequency of each item. For example, I want to create an "Apple" column that contains the frequency of "Apple" in the "items" column, a "Pear" column that contains the frequency of "Pear" in the "items" column, and so on.
The final data frame should look like this:
structure(list(items = c("Apple", "Apple, Pear", "Apple, Pear, Banana"
), Apple = c(3, 3, 3), Pear = c(2, 2, 2), Banana = c(1, 1, 1)), row.names = c(NA,
-3L), class = "data.frame")
I have tried using the mutate() and str_count() functions from the dplyr and stringr packages, but I'm not sure how to get the final data frame that I want.
Here is the code that I have tried so far:
items %>%
mutate(Apple = str_count(items, "Apple"),
Pear = str_count(items, "Pear"),
Banana = str_count(items, "Banana"))
This gets me part way there, but I'm not sure how to create a new column for each item and count the frequency of each item. Can someone help me figure out how to do this in R?
You can wrap str_count with sum:
items %>%
mutate(Apple = sum(str_count(items, "Apple")),
Pear = sum(str_count(items, "Pear")),
Banana = sum(str_count(items, "Banana")))
items Apple Pear Banana
1 Apple 3 2 1
2 Apple, Pear 3 2 1
3 Apple, Pear, Banana 3 2 1
Especially in situation where you have multiple rows and values ->
Here is a solution using separate the rows count and combining with cbind and finally pivoting with filling the NAs:
library(dplyr)
library(tidyr)
df %>%
separate_rows(items, sep='\\,') %>%
count(items1 = trimws(items)) %>%
cbind(df) %>%
pivot_wider(names_from = items1, values_from = n) %>%
fill(-items, .direction = "downup")
items Apple Banana Pear
<chr> <int> <int> <int>
1 Apple 3 1 2
2 Apple, Pear 3 1 2
3 Apple, Pear, Banana 3 1 2
Using map - loop over the words of interest, and transmute to return a single column with the count of the word in the items column and bind the output to the original data
library(purrr)
library(dplyr)
map_dfc(c("Apple", "Pear", "Banana"), ~ df1 %>%
transmute(!! .x := sum(str_count(items, .x)))) %>%
bind_cols(df1, .)
-output
items Apple Pear Banana
1 Apple 3 2 1
2 Apple, Pear 3 2 1
3 Apple, Pear, Banana 3 2 1
Or another option is to split the column 'items', use mtabulate and cbind the columns after getting the colSums
library(qdapTools)
cbind(df1, as.list(colSums(mtabulate(strsplit(df1$items, ",\\s*")))))
items Apple Banana Pear
1 Apple 3 1 2
2 Apple, Pear 3 1 2
3 Apple, Pear, Banana 3 1 2
You can try the following,
library(tidyverse)
df <- structure(list(items = c(
"Apple", "Apple, Pear", "Apple, Pear, Banana"
)),
row.names = c(NA,-3L),
class = "data.frame")
total_count <- function(x, word) {
paste0(x, collapse = ", ") %>%
stringr::str_count(word)
}
df %>%
mutate(Apple = total_count(items, "Apple"),
Pear = total_count(items, "Pear"),
Banana = total_count(items, "Banana"))
#> items Apple Pear Banana
#> 1 Apple 3 2 1
#> 2 Apple, Pear 3 2 1
#> 3 Apple, Pear, Banana 3 2 1
Created on 2023-01-04 with reprex v2.0.2

Compare overlap of groups pairwise using tidyverse

I have a tidy data.frame in this format:
library(tidyverse)
df = data.frame(name = c("Clarence","Clarence","Clarence","Shelby","Shelby", "Patricia","Patricia"), fruit = c("Apple", "Banana", "Grapes", "Apple", "Apricot", "Banana", "Grapes"))
df
# name fruit
#1 Clarence Apple
#2 Clarence Banana
#3 Clarence Grapes
#4 Shelby Apple
#5 Shelby Apricot
#6 Patricia Banana
#7 Patricia Grapes
I want to compare the overlaps between groups in a pairwise manner (i.e. if both people have an apple that counts as an overlap of 1) so that I end up with a dataframe that looks like this:
df2 = data.frame(names = c("Clarence-Shelby", "Clarence-Patricia", "Shelby-Patricia"), n_overlap = c(1, 2, 0))
df2
# names n_overlap
#1 Clarence-Shelby 1
#2 Clarence-Patricia 2
#3 Shelby-Patricia 0
Is there an elegant way to do this in the tidyverse framework? My real dataset is much larger than this and will be grouped on multiple columns.
If the 0 overlap is not important, a solution is:
> df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
name.x name.y n
1 Clarence Patricia 2
2 Clarence Shelby 1
If you really need non-overlapping pairs:
> a = df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
> b = as.data.frame(t(combn(sort(unique(df$name,2)),2)))
> colnames(b)=colnames(a)[1:2]
> a %>% full_join(b) %>% replace_na(list(n=0))
Joining, by = c("name.x", "name.y")
name.x name.y n
1 Clarence Patricia 2
2 Clarence Shelby 1
3 Patricia Shelby 0
Try this,
combinations <- apply(combn(unique(df$name), 2), 2, function(z) paste(sort(z), collapse = "-"))
combinations
# [1] "Clarence-Shelby" "Clarence-Patricia" "Patricia-Shelby"
library(dplyr)
df %>%
group_by(fruit) %>%
summarize(names = paste(sort(unique(name)), collapse = "-")) %>%
right_join(tibble(names = combinations), by = "names") %>%
group_by(names) %>%
summarize(n_overlap = sum(!is.na(fruit)))
# # A tibble: 3 x 2
# names n_overlap
# <chr> <int>
# 1 Clarence-Patricia 2
# 2 Clarence-Shelby 1
# 3 Patricia-Shelby 0

Separate multi-value obs with pairs of values and count

I have a data frame combining single and multi-values obs.
dataset <- c("Apple;Banana;Kiwi", "orange", "Apple;Banana", "orange" )
dataset <- as.data.frame(dataset)
My output :
dataset
1 Apple;Banana;Kiwi
2 orange
3 Apple;Banana
4 orange
What I want : separate by pairs all the combinaisons of values into 2 columns and count to make a graph
from |to |weight
Apple |Banana|2
Apple | Kiwi | 1
Banana| Kiwi | 1
orange|NA |2
What I tried :
dataset2 <- dataset %>%
separate_rows(dataset, sep = ";")
We may use combn on each row and get the frequency
stack(table(unlist(lapply(strsplit(dataset$dataset, ";"),
function(x) if(length(x) > 1) combn(x, 2, FUN = toString) else x))))[2:1]
-output
ind values
1 Apple, Banana 2
2 Apple, Kiwi 1
3 Banana, Kiwi 1
4 orange 2
You could do:
library(dplyr)
result <-
do.call(rbind, lapply(strsplit(dataset$dataset, ';'), function(x) {
if(length(x) == 1) return(c(x, NA_character_))
do.call(rbind, lapply(1:(length(x) - 1), function(i) c(x[i], x[i+1])))
}))
as.data.frame(table(paste(result[,1], result[,2]))) %>%
tidyr::separate(Var1, into = c('from', 'to'), sep = ' ') %>%
mutate(to = ifelse(to == 'NA', NA, to),
weight = Freq) %>%
select(-Freq)
#> from to weight
#> 1 Apple Banana 2
#> 2 Banana Kiwi 1
#> 3 orange <NA> 2
Another possible solution:
library(tidyverse)
pmap(dataset, ~ if (str_detect(.x, ";"))
{combn(.x %>% str_split(";") %>% unlist, 2, str_c, collapse=";")} else {.x}) %>%
map_dfr(data.frame) %>%
separate(1, ";", into = c("from", "to"), fill = "right") %>%
count(from, to, name = "weight")
#> from to weight
#> 1 Apple Banana 2
#> 2 Apple Kiwi 1
#> 3 Banana Kiwi 1
#> 4 orange <NA> 2
Or without purrr:
library(tidyverse)
dataset %>%
rowwise %>%
mutate(from = ifelse(str_detect(dataset, ";"), combn(dataset %>%
str_split(";") %>% unlist, 2, str_c, collapse=";") %>% list,
list(dataset))) %>%
unnest_longer(from) %>%
separate(from, ";", into = c("from", "to"), fill = "right") %>%
count(from, to, name = "weight")
#> # A tibble: 4 × 3
#> from to weight
#> <chr> <chr> <int>
#> 1 Apple Banana 2
#> 2 Apple Kiwi 1
#> 3 Banana Kiwi 1
#> 4 orange <NA> 2

How to subset list by ID for specfic output

If I say I have a list of people who used cafeteria.
Fruits ID Date
apple 1 100510
apple 2 100710
banana 2 110710
banana 1 120910
kiwi 2 120710
apple 3 100210
kiwi 3 110810
I want to select people who have took both apple and banana and my new dataset to contain people who qualify for this inclusion criteria and give:
ID
1
2
(because only ID 1 and 2 had both apple and banana in the dataset)
what code should I use in R?
In base R you could do something like
data.frame(ID = names(which(sapply(split(df$Fruits, df$ID), function(x) {
"apple" %in% x & "banana" %in% x
}))))
#> ID
#> 1 1
#> 2 2
This will give you the name of the IDs that contain both "apple" and "banana"
If you want the subset of the data frame containing these rows you can do:
df[df$ID %in% names(which(sapply(split(df$Fruits, df$ID), function(x) {
"apple" %in% x & "banana" %in% x
}))),]
#> Fruits ID Date
#> 1 apple 1 100510
#> 2 apple 2 100710
#> 3 banana 2 110710
#> 4 banana 1 120910
#> 5 kiwi 2 120710
You can use dplyr package to check if any of the entries is an apple AND any of the entries is a banana per ID:
library(dplyr)
df <- data.frame(Fruits = c("apple", "apple", "banana", "banana", "kiwi", "apple", "kiwi"),
ID = c(1,2,2,1,2,3,3),
Date = c(100510,100710,100710,120910, 120710,100210,110810))
df %>%
group_by(ID) %>%
filter(any(Fruits == "apple") & any(Fruits == "banana")) %>%
ungroup() %>%
select(ID) %>%
distinct()
In this case, the result is
# A tibble: 2 x 1
ID
<dbl>
1 1
2 2

How to left_join() two datasets but only select specific columns from one of the datasets?

Here are two datasets: (this is fake data)
library(tidyverse)
myfruit <- tibble(fruit_name = c("apple", "pear", "banana", "cherry"),
number = c(2, 4, 6, 8))
fruit_info <- tibble(fruit_name = c("apple", "pear", "banana", "cherry"),
colour = c("red", "green", "yellow", "d.red"),
batch_number = c(4, 4, 4, 4),
type = c("gala", "conference", "cavendish", "bing"),
weight = c(10, 11, 12, 13),
age_days = c(20, 22, 24, 16))
> myfruit
# A tibble: 4 x 2
fruit_name number
<chr> <dbl>
1 apple 2
2 pear 4
3 banana 6
4 cherry 8
> fruit_info
# A tibble: 4 x 6
fruit_name colour batch_number type weight age_days
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 apple red 4 gala 10 20
2 pear green 4 conference 11 22
3 banana yellow 4 cavendish 12 24
4 cherry d.red 4 bing 13 16
I want to use the dplyr::left_join() function to combine myfruit and fruit_info together, but I only want "batch_number" and "type" columns from fruit_info.
I know I can do this:
new_myfruit <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
select(fruit_name, number, batch_number, type)
# OR
new_myfruit <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
select(-colour, -weight, -age_days)
But if I had many more columns in fruit_info and I had to type in many column names into the select() function it would be very time-consuming. So, is there a more efficient way to do this?
Edit:
I've seen examples online where you can do something like this:
new_myfruits <- left_join(myfruit,
fruit_info %>% select(batch_number, type),
by = "fruit_name")
But I get an error which says:
# Error: Join columns must be present in data.
# x Problem with `fruit_name`.
Anyone know what I'm doing wrong?
I would appreciate any help :)
Does this work:
> myfruit %>% left_join(
+ fruit_info %>% select(1,3,4)
+ )
Joining, by = "fruit_name"
# A tibble: 4 x 4
fruit_name number batch_number type
<chr> <dbl> <dbl> <chr>
1 apple 2 4 gala
2 pear 4 4 conference
3 banana 6 4 cavendish
4 cherry 8 4 bing
>
I believe you're getting that error because your select statement does not include "fruit_name", so it can't match on that column. Try new_myfruits <- left_join(myfruit, fruit_info %>% select(fruit_name, batch_number, type), by = "fruit_name") instead.
Try this. You can combine select() with contains() and in the last function add the tags you want to extract, so there is no need of setting each name individually or by column number. Here the code:
library(dplyr)
#Code
newdf <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
select(contains(c('fruit','number','type')))
Output:
# A tibble: 4 x 4
fruit_name number batch_number type
<chr> <dbl> <dbl> <chr>
1 apple 2 4 gala
2 pear 4 4 conference
3 banana 6 4 cavendish
4 cherry 8 4 bing

Resources