Selecting most frequent combinations while removing the parts of that combination R - r

I have a dataset with combinations and their frequency as shown below in an example. The idea is to find all combinations (every name has to be used) to have the highest possible value for count (frequency).
Person 1
Person 2
Count
A
B
4
A
D
4
A
C
3
B
C
2
C
D
1
B
D
0
A, B, C and D are names of people and count is the frequency of a combination of two people.
In this example the highest count can be reached by having an AD and BC combination, which sums to 6 (4+2). If we take AB and CD as a combination the total sum of count will be lower (5, 4+1).
I would like to have a dataset looking like this as an answer:
Person 1
Person 2
Count
A
D
4
B
C
2
How can I create this dataset from the original without having duplicate names and with having the highest possible count. So if there is an AD combination, there can not be another combination including A or D.
I tried following code, but this does not give me the desired dataset:
dat <- data %>%
arrange(desc(count))
count = 0
while (nrow(dat)>0){
print(dat[1,])
dat <- dat %>%
filter(!(X1==X1[1]|X1==X2[1]|X2==X1[1]|X2==X2[1]))
}
dat is the arranged dataset shown in the first table. I print the first row with the highest count and delete all combinations that has one of the names in their combination (because I can use a name only once). This is looped until there are no more people left.
This code will give following dataset:
Person 1
Person 2
Count
A
B
4
C
D
1
Thank you in advance.

There is probably a more elegant solution with igraph, but here is my approach:
Using your data
your_data <- tibble::tribble( ~Person.1, ~Person.2, ~Count, "A", "B", 4L, "A", "D", 4L, "A", "C", 3L, "B", "C", 2L, "C", "D", 1L, "B", "D", 0L)
and assuming Person.1 and Person.2 are in alphabetical order, you can do
library(purrr)
with(your_data, unique(c(Person.1, Person.2))) %>%
combinat::permn(\(x) split(x, (seq_along(x) + 1) %/% 2) %>%
map(sort) %>%
map_dfr(set_names, c("Person.1", "Person.2"))) %>%
map(~ arrange(.x, Person.1)) %>%
unique() %>%
imap(~ dplyr::left_join(.x, your_data)) %>%
rlist::list.sort((sum(Count))) %>%
first()
returning the desired
# A tibble: 2 x 3
Person.1 Person.2 Count
<chr> <chr> <int>
1 A D 4
2 B C 2

Related

Move subgroup under repeated main group while keeping main group once in data.frame R

I'm aware that the question is awkward. If I could phrase it better I'd probably find the solution in an other thread.
I have this data structure...
df <- data.frame(group = c("X", "F", "F", "F", "F", "C", "C"),
subgroup = c(NA, "camel", "horse", "dog", "cat", "orange", "banana"))
... and would like to turn it into this...
data.frame(group = c("X", "F", "camel", "horse", "dog", "cat", "C", "orange", "banana"))
... which is surprisingly confusing. Also, I would prefer not using a loop.
EDIT: I updated the example to clarify that solutions that depend on sorting unfortunately do not do the trick.
Here an (edited) answer with new data.
Using data.table is going to help a lot. The idea is to split the df into groups and lapply() to each group what we need. Whe have to take care of some things meanwhile.
library(data.table)
# set as data.table
setDT(df)
# to mantain the ordering, you need to put as factor the group.
# the levels are going to give the ordering infos to split
df[,':='(group = factor(group, levels =unique(df$group)))]
# here the split function, splitting df int a list
df_list <-split(df, df$group, sorted =F)
# now you lapply to each element what you need
df_list <-lapply(df_list, function(x) data.frame(group = unique(c(as.character(x$group),x$subgroup))))
# put into a data.table and remove NAs
rbindlist(df_list)[!is.na(df_onecol$group)]
group
1: X
2: F
3: camel
4: horse
5: dog
6: cat
7: C
8: orange
9: banana
With the edited data we need to add another column (here row_number) to sort by:
df %>%
pivot_longer(col = everything()) %>%
mutate(r_n = row_number()) %>%
group_by(value) %>% slice(1) %>%
arrange(r_n) %>%
filter(!is.na(value))
#output
# A tibble: 9 × 3
# Groups: value [9]
name value r_n
<chr> <chr> <int>
1 group X 1
2 group F 3
3 subgroup camel 4
4 subgroup horse 6
5 subgroup dog 8
6 subgroup cat 10
7 group C 11
8 subgroup orange 12
9 subgroup banana 14

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

R: group_id by changing row values

1) Firstly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"), ,
desired_id=c(1,1,1,2,2,2,3,3,3))
How do I generate the desired_id column?
My groups are assigned by row order.
That is, everytime the value column changes, I want the group indices to assign the next higher group indices.
I tried df$desired_id_replicate <- df %>% group_by(value) %>% group_indices
but that doesn't work as all value=="a" will be assigned the same group indices.
2)Secondly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"),
value2=c("a","a","c", "b", "b", "c", "a", "a", "d"),
desired_id=c(1,1,2,3,3,4,5,5,6))
How do I generate the desired_id from the value and value2 column.
My groups are assigned row-wise again. That is, everytime a unique combination of value and value2 changes, the next higher desired_id should be assigned.
Similar to the above, I tried df$desired_id_replicate <- df %>% group_by(value, value2) %>% group_indices
but that doesn't work as all value=="a"&value2=="a" will be assigned the same group indices.
Thank you!
We can use rleid (run-length-encoding id) from data.table which would basically increment 1 for each element that is not equal to the previous element
library(data.table)
library(dplyr)
df%>%
mutate(newcol = rleid(value))
and for the second dataset, it would be
df %>%
mutate(new = rleid(value, value2))
# value value2 desired_id new
#1 a a 1 1
#2 a a 1 1
#3 a c 2 2
#4 b b 3 3
#5 b b 3 3
#6 b c 4 4
#7 a a 5 5
#8 a a 5 5
#9 a d 6 6
Or with rle from base R
df$newcol <- with(rle(df$value), rep(seq_along(values), lengths))

detecting sequence by group and compute new variable for the subset

I need to detect a sequence by group in a data.frame and compute new variable.
Consider I have this following data.frame:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
I am interested to compute a measure for ID that follow this sequence:
- Count == 1
- Count > 1
- Count == 1
In the example this is true for:
- rows 2, 3, 4 for `ID==1`
- rows 8, 9, 10 for `ID==2`
- rows 12, 13, 14 for `ID==3`
For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.
The desired outcome is shown below:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
Note:
In the sequence detected for ID no new products are added to the stock.
In the original data there are a lot of IDs who do not have any sequences.
Some ID have multiple qualifying sequences. All should be recorded.
Count is always 1 or greater.
The original data holds millions of ID with up to 1500 sequences.
How would you write an efficient piece of code to get this output?
Here's a data.table option:
library(data.table)
char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
Here's another approach using tidyverse; however, I think lag and lead has made this solution a bit time-consuming. I included the comments within the code to make it more legible.
But I spent enough time on it, to post it anyway.
library(tidyverse)
df1 %>% group_by(ID) %>%
# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%
# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F)) %>%
filter(test2) %>% ungroup() %>%
# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%
# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%
# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%
# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%
# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()
# ID seqs_1 seqs_2 seqs_3 new
# 1 1 2 3 4 C
# 2 2 1 2 3
# 3 3 2 3 4 D

Creating a matrix of co-occurence from many variables, and plotting it

I have a data set of individuals with a number of health conditions. Individuals either do (1) or do not (0) have each condition (my real data set has 14). What I want to do is summarise the data so I know how often pairs of conditions occur. Note that some individuals may have three or four of the conditions, but what I'm interested in is the pairwise co-occurence. I would then like to plot this as a heatmap.
I suspect that the solution involves the 'gather' function from tidyr, but I haven't been able to work it out. This is an example of what my input looks like and what I'd like to achieve:
Here's some data on individuals and whether or not they have conditions "a", "b" or "c":
library(tidyverse)
library(viridis)
dat <- tibble(
id = c(1:15),
a = c(1,0,0,0,1,1,1,0,1,0,0,0,1,0,1),
b = c(1,0,0,1,1,1,0,0,1,0,0,1,1,0,1),
c = c(0,0,1,1,0,1,0,1,0,1,1,0,1,1,0))
I want to summarise how often each of the conditions occur, and how often they co-occur. In this case, it's evident that conditions "a" and "b" co-occur more often than do either of these with "c", which usually occurs on its own. Below is my imagined idea of what the data will look like in a plottable format. The first column is 'variable 1', the second is 'variable 2', and the third, is the count of how often these occur together. Below that is the plot which I have in my mind.
plotdat <- tibble(
var1 = c("a", "a", "a", "b", "b", "c"),
var2 = c("a", "b", "c", "b", "c", "c"),
count = c(7, 6, 2, 8, 3, 8))
ggplot(plotdat) +
geom_tile(aes(var1, var2, fill = count)) +
scale_fill_viridis()
Perhaps this is not the right approach at all and I actually need to convert the data into a 3x3 matrix. Any possible solutions would be gratefully received!
Here is a way
library(tidyverse)
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
filter(!is.na(value))
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 b b 8
#5 c b 3
#6 c c 8
The most important piece is crossprod, I think. But let's go through it step by step.
You don't need column id so we exclude it and convert dat[-1] to a matrix because this is what crossprod expects.
as.matrix(dat[-1]) %>%
crossprod()
# a b c
#a 7 6 2
#b 6 8 3
#c 2 3 8
Then we replace the upper triangle of this matrix with NA because you don't want to compare a-b and b-a etc.
The next step is to convert to a dataframe, make the rownames a column and reshape from wide to long
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname)
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 a b NA
#5 b b 8
#6 c b 3
#7 a c NA
#8 b c NA
#9 c c 8
Finally remove NAs to get desired output.

Resources