Iterating through all df column pairs and counting non-zero intersections - r

I have a ~15000*1000 dataframe, where each column represents an individual, and each row represents the incidence of a trait (0 or 1).
I want to efficiently compare all pairs of columns, and generate a comma separated list of all mutual traits (row names) for all possible pairs.
Currently, I am looping through all the columns via combn, and pasting mutual row names into a string. That's to say I have a solution, however, it is very, very slow (probably quadratic with column number).
Is there a way to vectorise this problem/approach it with tidyr/dplyr etc.? I can't seem to find a way.
For example:
------|individual1 | individual2 | individual3 | ...
trait1| 0 | 1 | 1 | ...
trait2| 0 | 0 | 0 | ...
trait3| 1 | 1 | 1 | ...
... | ... | ... | ... | ...
Yields the string trait1,trait3 for the pair individual 2 and individual 3.
Thanks!
Toy data (the actual data is too sparse just to pull a subset):
df <- data.frame(trait = c("a", "b", "c", "d", "e"), ind1 = c(0, 1, 1, 0, 1), ind2 = c(1, 0, 1, 0, 1), ind3 = c(1, 0, 1, 1, 1))

Try to apply a custom function on each combination of columns. Maybe the efficiency can be improved a little.
t(combn(1:(ncol(df)-1), 2, function(x){
string <- paste(df$trait[df[[x[1]+1]] == 1 & df[[x[2]+1]] == 1], collapse = ",")
c(names(df)[x+1], string)
}))
# [,1] [,2] [,3]
# [1,] "Alice" "Bob" "c,e"
# [2,] "Alice" "Charlie" "c,e"
# [3,] "Bob" "Charlie" "a,c,e"
Data
df <- data.frame(trait = c("a", "b", "c", "d", "e"),
Alice = c(0, 1, 1, 0, 1),
Bob = c(1, 0, 1, 0, 1),
Charlie = c(1, 0, 1, 1, 1))

Although this question has an accepted answer, I would like to suggest a different approach which uses dplyr and tidyr as well as a data.table variant.
Whenever column names are treated as data items this indicates that the dataset is stored in an untidy format, IMHO. Reshaping the data into long format will allow to apply the usual data manipulations like joining, grouping, aggregating.
dplyr and tidyr
library(dplyr)
library(tidyr)
df %>%
pivot_longer(!"trait") %>%
filter(value == 1L) %>%
select(-value) %>%
inner_join(., ., by = "trait") %>%
filter(name.x < name.y) %>%
group_by(name.x, name.y) %>%
summarise(traits = toString(trait)) %>%
ungroup()
# A tibble: 3 x 3
name.x name.y traits
<chr> <chr> <chr>
1 Alice Bob c, e
2 Alice Charlie c, e
3 Bob Charlie a, c, e
Explanation
df %>%
pivot_longer(!"trait") %>%
filter(value == 1L)
reshapes the data into long format which is compact representation of the original matrix in wide format:
# A tibble: 10 x 3
trait name value
<fct> <chr> <dbl>
1 a Bob 1
2 a Charlie 1
3 b Alice 1
4 c Alice 1
5 c Bob 1
6 c Charlie 1
7 d Charlie 1
8 e Alice 1
9 e Bob 1
10 e Charlie 1
The value column is dropped as it is no longer needed. Then, the long data is joined with itself to find all names which match on trait. The result includes pairs of names which are given in a different order, e.g., (Alice, Bob) and (Bob, ALice) as well as duplicates names, e.g., (Bob, Bob). These are removed.
Finally, the data are grouped and summarised.
data.table
The data.table variant implements the same approach but has the advantage to allow for a non-equi self-join which reduces the number of rows directly in the join instead of a subsequent filtering step.
library(data.table)
long <- melt(setDT(df), id.vars = "trait", variable.name = "name")[value == 1]
long[long, on = .(trait, name < name), .(name1 = x.name, name2 = i.name, trait), nomatch = NULL][
, .(traits = toString(trait)), keyby = .(name1, name2)]
name1 name2 traits
1: Alice Bob c, e
2: Alice Charlie c, e
3: Bob Charlie a, c, e

Related

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

detecting sequence by group and compute new variable for the subset

I need to detect a sequence by group in a data.frame and compute new variable.
Consider I have this following data.frame:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
I am interested to compute a measure for ID that follow this sequence:
- Count == 1
- Count > 1
- Count == 1
In the example this is true for:
- rows 2, 3, 4 for `ID==1`
- rows 8, 9, 10 for `ID==2`
- rows 12, 13, 14 for `ID==3`
For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.
The desired outcome is shown below:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
Note:
In the sequence detected for ID no new products are added to the stock.
In the original data there are a lot of IDs who do not have any sequences.
Some ID have multiple qualifying sequences. All should be recorded.
Count is always 1 or greater.
The original data holds millions of ID with up to 1500 sequences.
How would you write an efficient piece of code to get this output?
Here's a data.table option:
library(data.table)
char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
Here's another approach using tidyverse; however, I think lag and lead has made this solution a bit time-consuming. I included the comments within the code to make it more legible.
But I spent enough time on it, to post it anyway.
library(tidyverse)
df1 %>% group_by(ID) %>%
# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%
# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F)) %>%
filter(test2) %>% ungroup() %>%
# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%
# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%
# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%
# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%
# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()
# ID seqs_1 seqs_2 seqs_3 new
# 1 1 2 3 4 C
# 2 2 1 2 3
# 3 3 2 3 4 D

New column conditional on whether number is even/uneven and on column

Say i have the following df:
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
id name
1 1 a
2 1 t
3 1 signal
4 1 b
5 1 s
6 1 e
7 1 signal
8 2 x
9 2 signal
10 2 r
11 2 s
12 2 t
13 2 signal
I want to add a new column with a character value conditional on whether the id number is even or not, and whether the string 'signal' is reached in the 'name' column.
For uneven id numbers, and up to including 'signal' for the column 'name' I would like the character T. After the signal, the character should become 'C'.
For even id numbers, and up to including 'signal' for the column 'name' I would like the character C. After the signal, the character should become 'T'.
For the example given, this should result in the following data.frame:
id, name condition
1, a, T
1, t, T
1, signal, T
1, b, C
1, s, C
1, e, C
1, signal C
2, x, C
2, signal, C
2, r, T
2, s, T
2, t, T
2, signal T
Any help is very much appreciated!
This is not a vectorized solution, but for me it seems as a wroking code.
Data preparation - I add new column to describe the condition
id<-rep(1:2,c(7,6))
name<-c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
df$condition <- rep("X", nrow(df))
I need to control two states: (i) if the signal has switched; (ii) if the id changes last (from even to odd and other way). Then I read row by row and update the condition state along with two variables.
signal <- F
last <- 1
for (i in 1:nrow(df)){
# id changed - reset signal
if (last != (df[i, "id"] %% 2)) signal <- F
if(!signal){
df[i,"condition"] <- ifelse(df[i,"id"] %% 2, "T", "C")
} else {
df[i, "condition"] <- ifelse(df[i,"id"] %% 2, "C", "T")
}
# signal is on
if (df[i, "name"] == "signal") signal <- T
# save last id (even or odd)
last <- df[i, "id"] %% 2
}
I hope it helps.
We could make use of %% with == to create the column
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(ind = (cumsum(lag(name, default = name[1]) == 'signal')>0) + 1,
condition = c('T', 'C')[ifelse(id %%2 > 0, ind,
as.integer(factor(ind, levels = rev(unique(ind)))))] ) %>%
select(-ind)
# A tibble: 13 x 3
# Groups: id [2]
# id name condition
# <int> <chr> <chr>
# 1 1 a T
# 2 1 t T
# 3 1 signal T
# 4 1 b C
# 5 1 s C
# 6 1 e C
# 7 1 signal C
# 8 2 x C
# 9 2 signal C
#10 2 r T
#11 2 s T
#12 2 t T
#13 2 signal T
data
df1 <- data.frame(id, name, stringsAsFactors=FALSE)
Another approach could be
id <- rep(1:2,c(7,6))
name <- c('a','t','signal','b','s','e','signal','x','signal','r','s','t','signal')
df <- data.frame(id, name)
library(dplyr)
df %>%
group_by(id) %>%
mutate(FirstSignalIndex=min(which(name=='signal'))) %>%
mutate(condition = ifelse((id %% 2)==0,
ifelse(row_number()>FirstSignalIndex, 'T', 'C'),
ifelse(row_number()>FirstSignalIndex, 'C', 'T')))
Hope this helps!

Grouping R data multiple times before summing

I'm trying to group my data by a number of variables before providing a summary table showing the sum of the values within each group.
I have created the below data as an example.
Value <- c(21000,10000,50000,60000,2000, 4000, 5500, 10000, 35000, 40000)
Group <- c("A", "A", "B", "B", "C", "C", "A", "A", "B", "C")
Type <- c(1, 2, 1, 2, 1, 1, 1, 2, 2, 1)
Matrix <- cbind(Value, Group, Type)
I want to group the above data first by the 'Group' variable, and then by the 'Type' variable to then sum the values and get an output similar to the attached example I worked on Excel. I would usually use the aggregate function if I just wanted to group by one variable, but am not sure whether I can translate this for multiple variables?
Further to this I then need to provide an identical table but with the values being calculated with a "count" function rather than a "sum".
Many thanks in advance!
You can supply multiple groupings to aggregate:
df <- data.frame(Value, Group, Type)
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), sum)
Type Group x
1 1 A 26500
2 2 A 20000
3 1 B 50000
4 2 B 95000
5 1 C 46000
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), length)
Type Group x
1 1 A 2
2 2 A 2
3 1 B 1
4 2 B 2
5 1 C 3
There are other packages which may be easier to use such as data.table:
>library(data.table)
>dt <- as.data.table(df)
>dt[, .(Count = length(Value), Sum = sum(Value)),
by = .(Type, Group)]
Type Group Count Sum
1: 1 A 2 26500
2: 2 A 2 20000
3: 1 B 1 50000
4: 2 B 2 95000
5: 1 C 3 46000
dplyr is another option and #waskuf has good example of that.
Using dplyr (note that "Matrix" needs to be a data.frame):
library(dplyr)
Matrix <- data.frame(Value, Group, Type)
Matrix %>% group_by(Group, Type) %>% summarise(Sum = sum(Value),
Count = n()) %>% ungroup()

Conditionally removing duplicates

I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.
Specifically, I need to delete any row where size = 0 only if SampleID is duplicated.
SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)
I want to delete rows with:
Sample ID size
a 0
d 0
And keep:
SampleID size
a 1
b 1
b 2
b 3
c 0
d 1
e 0
Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.
In dplyr we can do this using group_by and filter:
library(dplyr)
data %>%
group_by(SampleID) %>%
filter(!(size==0 & n() > 1)) # filter(size!=0 | n() == 1))
#> # A tibble: 7 x 2
#> # Groups: SampleID [5]
#> SampleID size
#> <fct> <dbl>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 b 3
#> 5 c 0
#> 6 d 1
#> 7 e 0
Using data.table framework: Transform your set to data.table
require(data.table)
setDT(data)
Build a list of id where we can delete lines:
dropable_ids = unique(data[size != 0, SampleID])
Finaly keep lines that are not in the dropable list or with non 0 value
data = data[!(SampleID %in% dropable_ids & size == 0), ]
Please note that not( a and b ) is equivalent to a or b but data.table framework doesn't handle well or.
Hope it helps
A solution that works in base R without data.table and is easy to follow through for R starters:
#Find all duplicates
data$dup1 <- duplicated(data$SampleID)
data$dup2 <- duplicated(data$SampleID, fromLast = TRUE)
data$dup <- ifelse(data$dup1 == TRUE | data$dup2 == TRUE, 1, 0)
#Subset to relevant
data$drop <- ifelse(data$dup == 1 & data$size == 0, 1, 0)
data2 <- subset(data, drop == 0)

Resources