Summarise multiple variables to strings in dplyr - r

I wish to summarize two variables in string. Let's say this is my id
#visit
id source1 source2
1 a t
2 c l
3 c z
1 b x
second dataset:
#transaction
id transactions
1 1
3 2
1 2
I'd like to join these data together but convert them to string at the same time:
I can do for one variable ( let's say source 1):
library(dplyr)
%>% left_join(visit, transaction, by="id")
%>% group_by( id)
%>% summarise( Source = toString(unique(source1)), transactions = toString(unique(transactions)) )
This gives me the following output:
id source transactions
1 a,b 1,2
2 c NA
3 c 2
But I wish to summarize for two variables: So my desire output would be something like that:
id source transactions
1 a,t > b,x 1,2
2 c,l NA
3 c,z 2

You can paste the two variables together, using both sep and collapse to combine:
visit %>% left_join(transaction) %>%
group_by(id) %>%
summarise(source = paste(unique(source1), unique(source2), sep = ', ', collapse = ' > '),
transaction = na_if(toString(unique(na.omit(transactions))), ''))
## # A tibble: 3 × 3
## id source transaction
## <int> <chr> <chr>
## 1 1 a, t > b, x 1, 2
## 2 2 c, l <NA>
## 3 3 c, z 2
Beware, though; paste and toString stupidly coerce NAs to strings. You may want to wrap in na.omit or use na_if.

Related

count number of combinations by group

I am struggling to count the number of unique combinations in my data. I would like to first group them by the id and then count, how many times combination of each values occurs. here, it does not matter if the elements are combined in 'd-f or f-d, they still belongs in teh same category, as they have same element:
combinations:
n
c-f: 2 # aslo f-c
c-d-f: 1 # also cfd or fdc
d-f: 2 # also f-d or d-f. The dash is only for isualization purposes
Dummy example:
# my data
dd <- data.frame(id = c(1,1,2,2,2,3,3,4, 4, 5,5),
cat = c('c','f','c','d','f','c','f', 'd', 'f', 'f', 'd'))
> dd
id cat
1 1 c
2 1 f
3 2 c
4 2 d
5 2 f
6 3 c
7 3 f
8 4 d
9 4 f
10 5 f
11 5 d
Using paste is a great solution provided by #benson23, but it considers as unique category f-d and d-f. I wish, however, that the order will not matter. Thank you!
Create a "combination" column in summarise, we can count this column afterwards.
An easy way to count the category is to order them at the beginning, then in this case they will all be in the same order.
library(dplyr)
dd %>%
group_by(id) %>%
arrange(id, cat) %>%
summarize(combination = paste0(cat, collapse = "-"), .groups = "drop") %>%
count(combination)
# A tibble: 3 x 2
combination n
<chr> <int>
1 c-d-f 1
2 c-f 2
3 d-f 2

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

Check if values of one dataframe exist in another dataframe in exact order

I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667

Create a list of all values of a variable grouped by another variable in R

I have a data frame that contains two variables, like this:
df <- data.frame(group=c(1,1,1,2,2,3,3,4),
type=c("a","b","a", "b", "c", "c","b","a"))
> df
group type
1 1 a
2 1 b
3 1 a
4 2 b
5 2 c
6 3 c
7 3 b
8 4 a
I want to produce a table showing for each group the combination of types it has in the data frame as one variable e.g.
group alltypes
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
The output would always list the types in the same order (e.g. groups 2 and 3 get the same result) and there would be no repetition (e.g. group 1 is not "a, b, a").
I tried doing this using dplyr and summarize, but I can't work out how to get it to meet these two conditions - the code I tried was:
> df %>%
+ group_by(group) %>%
+ summarise(
+ alltypes = paste(type, collapse=", ")
+ )
# A tibble: 4 × 2
group alltypes
<dbl> <chr>
1 1 a, b, a
2 2 b, c
3 3 c, b
4 4 a
I also tried turning type into a set of individual counts, but not sure if that's actually useful:
> df %>%
+ group_by(group, type) %>%
+ tally %>%
+ spread(type, n, fill=0)
Source: local data frame [4 x 4]
Groups: group [4]
group a b c
* <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0
2 2 0 1 1
3 3 0 1 1
4 4 1 0 0
Any suggestions would be greatly appreciated.
I think you were very close. You could call the sort and unique functions to make sure your result adheres to your conditions as follows:
df %>% group_by(group) %>%
summarize(type = paste(sort(unique(type)),collapse=", "))
returns:
# A tibble: 4 x 2
group type
<int> <chr>
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
To expand on Florian's answer this could be extended to generating an ordered list based on values in your data set. An example could be determining the order of dates:
library(lubridate)
library(tidyverse)
# Generate random dates
set.seed(123)
Date = ymd("2018-01-01") + sort(sample(1:200, 10))
A = ymd("2018-01-01") + sort(sample(1:200, 10))
B = ymd("2018-01-01") + sort(sample(1:200, 10))
C = ymd("2018-01-01") + sort(sample(1:200, 10))
# Combine to data set
data = bind_cols(as.data.frame(Date), as.data.frame(A), as.data.frame(B), as.data.frame(C))
# Get order of dates for each row
data %>%
mutate(D = Date) %>%
gather(key = Var, value = D, -Date) %>%
arrange(Date, D) %>%
group_by(Date) %>%
summarize(Ord = paste(Var, collapse=">"))
Somewhat tangential to the original question but hopefully helpful to someone.

n objects, a value for each combination of two objects, find minimum value for each object in R

I want to find the minimum value associated with an object out of a dataframe. The dataframe contains two columns representing all combinations of the objects and a value-column for each combination. It looks like this:
id_A id_B dist
206 208 2385.5096
207 208 467.8890
207 209 576.4631
...
208 209 1081.539
208 210 8214.439
...
I tried the following recommended dplyr functions:
df %>%
group_by(id_A) %>%
slice(which.min(dist))
But it creates not the desired output:
id_A id_B dist
...
207 208 467.8890
208 209 1081.5393
...
Note that for id 208 the combination with id 207 has the lowest value, but is not associated to id 208 (when it is in the grouped_by column).
I wrote a function doing this right, but since I got many entries it is way to slow. Its a loop subsetting the data by all entries containing a specific id and then finds the minimum within that subset and associates that value with that id.
Do you have an idea, how to make that fast e.g. using dplyr.
The issue boils down to needing a long (rather than wide) data format. First, here are some reproducible data (using the pipe from dplyr):
df <-
LETTERS[1:4] %>%
combn(2) %>%
t %>%
data.frame() %>%
mutate(val = 1:n()) %>%
setNames( c("id_A", "id_B", "dist") )
gives:
id_A id_B dist
1 A B 1
2 A C 2
3 A D 3
4 B C 4
5 B D 5
6 C D 6
What we want is a pair of columns giving matching each category with the distance from its row. For this, I am using gather from tidyr. It creates new columns telling us which column the data came from and what value that held. Here, we are telling it to pull from columns id_A and id_B to give us the category for each ID entry (it then duplicates the dist column as necessary)
df %>%
gather(whichID, Category, id_A, id_B)
Gives
dist whichID Category
1 1 id_A A
2 2 id_A A
3 3 id_A A
4 4 id_A B
5 5 id_A B
6 6 id_A C
7 1 id_B B
8 2 id_B C
9 3 id_B D
10 4 id_B C
11 5 id_B D
12 6 id_B D
We can then pass that data.frame to group_by and then use summarise to give us whatever information we wanted. I know that you didn't ask for the max, but I am including it just to show the general syntax you can use to get whatever type of result you want:
df %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist))
Returns:
Category minDist maxDist
<chr> <int> <int>
1 A 1 3
2 B 1 5
3 C 2 6
4 D 3 6
I just looked at the question and realized that you wanted to also display which comparison had the minimum value. Here is an approach that does that by tracking an index of the match (so that it is replicated when gathering) and then pulls the correct row from the original df and pastes together the two comparison values:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, whichMin = whichComparison[which.min(dist)]
, maxDist = max(dist)
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minComp = sapply(whichMin, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
, maxComp = sapply(whichMax, function(x){
paste(df[x, "id_A"], df[x, "id_B"], sep = " vs " )})
)
returns
Category minDist whichMin maxDist whichMax minComp maxComp
<chr> <int> <int> <int> <int> <chr> <chr>
1 A 1 1 3 3 A vs B A vs D
2 B 1 1 5 5 A vs B B vs D
3 C 2 2 6 6 A vs C C vs D
4 D 3 3 6 6 A vs D C vs D
If you really want a single column giving which comparison gave the min value (and the max, in my output), you can instead use the index to pull both the id_A and id_B from the original df, knock out the one that matches the Category of interest, then use use_first_valid_of from the package janitor to grab just the one you are interested in. Because this generated a large number of intermediate columns, I am using select to clean things back up:
df %>%
mutate(whichComparison = 1:n()) %>%
gather(whichID, Category, id_A, id_B) %>%
group_by(Category) %>%
summarise(minDist = min(dist)
, maxDist = max(dist)
, whichMin = whichComparison[which.min(dist)]
, whichMax = whichComparison[which.max(dist)]) %>%
mutate(
minA = df$id_A[whichMin]
, minB = df$id_B[whichMin]
, maxA = df$id_A[whichMax]
, maxB = df$id_B[whichMax]
) %>%
mutate_each(funs(ifelse(. == Category, NA, as.character(.)) )
, minA:maxB) %>%
mutate(minComp = use_first_valid_of(minA, minB)
, maxComp = use_first_valid_of(maxA, maxB)) %>%
select(-(whichMin:maxB))
returns:
Category minDist maxDist minComp maxComp
<chr> <int> <int> <chr> <chr>
1 A 1 3 B D
2 B 1 5 A D
3 C 2 6 A D
4 D 3 6 A C
An alternative approach is to first convert the distance pairs to a matrix. Here, I first duplicate the comparisons in the reverse order to ensure that the matrix is complete (using tidyr to spread):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist)
returns:
id_A A B C D
1 A NA 1 2 3
2 B 1 NA 4 5
3 C 2 4 NA 6
4 D 3 5 6 NA
Then, we just apply across rows much like we would if we working from a distance matrix (which may be where your data actually started):
bind_rows(
df
, rename(df, id_A = id_B, id_B = id_A)
) %>%
spread(id_B, dist) %>%
mutate(
minDist = apply(as.matrix(.[, -1]), 1, min, na.rm = TRUE)
, minComp = names(.)[apply(as.matrix(.[, -1]), 1, which.min) + 1]
, maxDist = apply(as.matrix(.[, -1]), 1, max, na.rm = TRUE)
, maxComp = names(.)[apply(as.matrix(.[, -1]), 1, which.max) + 1]
) %>%
select(Category = `id_A`
, minDist:maxComp)
returns:
Category minDist minComp maxDist maxComp
1 A 1 B 3 D
2 B 1 A 5 D
3 C 2 A 6 D
4 D 3 A 6 C

Resources