Grouping Ids based on at least one common values - r

I have a list whose elements are integers and I would like to accumulate these elements if only they share at least one value. With regard to those elements that don't share any values with the rest I would like them to stay as they are. Here is my sample date:
x <- list(c(1, 2), c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 8), c(6, 9, 7), 7, c(5, 8), 10, 11)
And here is my desired output:
desired_reult <- list(c(1, 2, 3, 4, 5, 8),
c(6, 9, 7),
10,
11)
I would like to do it first with reduce or accumulate functions from purrr but any other tidyverse solution would be welcomed. I have tried this solution so far but it only gave me one union and apparently abandons the rest:
x %>%
reduce(~ if(any(.x %in% .y)) union(.x, .y) else .x)
[1] 1 2 3 4 5 8
In general I am looking for a way of grouping integers (ids) with common values like a sort of clustering but so far my efforts have been in vain unfortunately.
Thank you very much indeed for your help in advance.

I suspect there's a set covering solution to be had, but in the interim here's a graph approach:
First, let's convert the integer vectors to an edge list so it can be made into a graph. We can use expand.grid.
library(igraph)
edgelist <- do.call(rbind,lapply(x,\(x)expand.grid(x,x))) #R version >= 4.1.0
Now we have a two column data.frame showing the connections between all the integers (a set of edges).
igraph::graph.data.frame can conveniently make a graph from this.
From there we can use igraph::components to extract the connected components.
g <- graph.data.frame(edgelist)
split(names(components(g)$membership),components(g)$membership)
#$`1`
#[1] "1" "2" "3" "4" "5" "8"
#$`2`
#[1] "6" "9" "7"
#$`3`
#[1] "10"
#$`4`
#[1] "11"
Or with Tidyverse:
library(dplyr); library(purrr)
map_dfr(x, ~expand.grid(.x,.x)) %>%
graph.data.frame() %>%
components() %>%
pluck(membership) %>%
stack() %>%
{split(as.numeric(as.character(.[,2])),.[,1])}
$`1`
[1] 1 2 3 4 5 8
$`2`
[1] 6 9 7
$`3`
[1] 10
$`4`
[1] 11

One way of doing it:
i is adjacent to j iff intersect(i, j) != empty set. We want to find connected components of matrix that on position (i,j) has 1 iff set i is adjecent to set j, 0 otherwise. First 4 rows build adjacency matrix, 5th and 6th row finds connected components and rest is splitting list based on that membership and taking unique values.
library(tidyverse)
library(igraph)
map(x, function(a) map_int(x, ~length(base::intersect(a, .x)) > 0) * 1L) %>%
reduce(rbind) %>%
graph.adjacency() %>%
as.undirected() %>%
components() %>%
pluck("membership") %>%
split(seq_along(.), .) %>%
map(~unique(unlist(x[.x])))

Thanks to a very informative post introduced by my dear friend #Ian Canmpbell, I thought to challenge myself to write a custom function for this purpose. It is still the first version, though not very elegant and can certainly be improved greatly but for now it is stable as I tried it on some inputs and it did not disappoint.
anoush <- function(x) {
# First we check whether x is a list
stopifnot(is.list(x))
# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices
# in `vec` vector. So in the end we have a list called `ind` whose elements
# are all the indices connected with the corresponding elements of the original
# list for example first element of `ind` is `1`, `2`, `3` which means in
# the original list these elements have common values.
ind <- lapply(1:length(x), function(a) {
vec <- c()
for(i in 1:length(x)) {
if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
vec <- c(vec, i)
}
}
vec
})
# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also
# result in duplicated values. For example elements `1` through `5` of
# `dup_ind` contains the same value cause in the original list these
# elements have common values.
dup_ind <- lapply(1:length(ind), function(a) {
out <- c()
for(i in 1:length(ind)) {
if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
out[[i]] <- union(ind[[a]], ind[[i]])
}
vec2 <- Reduce("union", out)
}
vec2
})
# Here we get rid of the duplicated elements of the list by means of
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
un <- unlist(dup_ind)
res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
res2 <- Filter(length, res)
sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
}
Output
> anoush(x)
[[1]]
[1] 1 2 3 4 5 8
[[2]]
[1] 6 9 7
[[3]]
[1] 10
[[4]]
[1] 11

Related

Get index and value of non-NA list element

Take a list like as.list(rep(c(NA, 4, NA), times = c(5, 1, 2))) i.e.
[[1]]
[1] NA
[[2]]
[1] NA
[[3]]
[1] NA
[[4]]
[1] NA
[[5]]
[1] NA
[[6]] # index of non-NA list element, 6
[1] 4 # ...and its corresponding value, 4
[[7]]
[1] NA
[[8]]
[1] NA
I want to extract the index of the non-NA element (here: 6), and its corresponding value (here: 4). Is there any idiomatic way to get these two numbers?
1) Base R Assuming that the list L contains only scalars and NA's this returns a 2 column matrix with one row for each set of xy coordinates and an attribute recording which positions were omitted.
Omit the x= and y= if you don't want the column names. If you don't want the attribute recording the positions of the NA's append [,] to the end of the line. If you know that there is only one scalar you might want to wrap it in c(...) to produce a 2 element vector. If you prefer data frame output replace cbind with data.frame.
na.omit(cbind(x = seq_along(L), y = unlist(L)))
2) tidyverse or using the tidyverse
library(tibble)
library(tidyr)
drop_na(enframe(unlist(L)))
2a) which could alternately be written using pipes like this:
L %>% unlist %>% enframe %>% drop_na
I am not sure this is elegant enough but works;
mylist <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- (1:length(mylist))[!sapply(mylist,is.na)]
y <- mylist[[x]]
coor <- c(x,y)
coor
output;
6 4
Find index of non NA list == 6 to get x
Remove all list elements with NA to get y
my_list <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- which(!is.na(my_list))
y <- unlist(Filter(function(a) any(!is.na(a)), my_list))
coordinate <- c(x,y)
coordinate
> coordinate
[1] 6 4
Using stack
na.omit(stack(setNames(L, seq_along(L))))
values ind
6 4 6

Differing number of rows

Suppose I have a vector of numbers a a<-c(1, 2, 3, 4, 5, 6) and a vector of positions b b<-c(1, 2, 3).
Then I want to get the numbers that come before every position b in the vector a.
I do this lapply(b, function(x) a[1:x]) and I get the result
[1] 1
[[2]]
[1] 1 2
[[3]]
[1] 1 2 3
Now I want to combine them in a dataframe normally if the number of values for every position were equal I would have done t(as.data.frame(lapply(b, function(x) a[1:x])) But I cannot do that right now because the number of rows are different. How can I put zeros for the non-existing values?
If the output list is 'lst1', then make the lengths same with length<- assignment
lapply(lst1, function(x) {
length(x) <- max(lengths(lst1))
replace(x, is.na(x), 0)})
data
lst1 <- lapply(b, function(x) a[1:x])

Conditionally select element positions of objects in list and return new list with object element positions

I have a list of many objects; in the case of this MWE - only 6. I am interested in selecting the position of an element for objects msfr, msfl, mshr and mshl which correspond to the respective value in vms (this object holds the maximum value of the elements in the 'ms...' objects) and where scheduled.day corresponds with `day. My objective is to return a second list of objects (the 'ms...' objects) holding the position of the elements where the conditions are true. Here is the data:
l <- list(msfr=c(1,5,0,0,0),
msfl=c(1,4,0,5,0),
mshr=c(1,0,0,0,0),
mshl=c(0,0,0,0,4),
vms= c(1,5,0,5,4),
scheduled.day = c(0, 3, 0, 4, 3))
today <- 3
ctt<- which(l[["scheduled.day"]] == today)
The closest I have come to achieving my objective is with the following code. But by using %in% element positions are being returned for the object msfl.
obj.names <- names(l)[1:4]
l2 <- lapply(lapply(l[obj.names],"["), function(x){which(x %in% l[["vms"]][ctt])}) # ctt being ignored
> l2
$msfr
[1] 2
$msfl
[1] 2 4
$mshr
integer(0)
$mshl
[1] 5
What I am looking for is some neat code that will return a list that looks like this.
l3 <- list(msfr = 2,
mshl = 5)
> l3
$msfr
[1] 2
$mshl
[1] 5
I must be blind!
lapply(lapply(l[obj.names],"["), function(x){which(x == l[["vms"]] & l[["scheduled.day"]] == today)})
I don't know why I didn't try this before, but it does the trick. If anyone has an alternative solution, I would be glad to see it.
Another option is to convert it to data.frame, then do a group by subset to return a summarised index and finally convert to list
library(tibble)
library(dplyr)
library(tidyr)
enframe(l[obj.names]) %>%
unnest(c(value)) %>%
group_by(name) %>%
summarise(rn = list(row_number()[value ==l[['vms']] & l[["scheduled.day"]] == today])) %>%
unnest(c(rn)) %>%
deframe %>%
as.list
#$msfr
#[1] 2
#$mshl
#[1] 5

R regexp for odd sorting of a char vector

I have several hundred files that need their columns sorted in a convoluted way. Imagine a character vector x which is the result of names(foo) where foo is a data.frame:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
I'd like to have it ordered according to the following rule: First, alphabetical for anything starting with "Component". Second, alphabetical for anything remaining starting with "C" and a number. Third anything remaining in alphabetical order.
For x that would be:
x[c(3,4,6,5,9,8,2,7,1)]
Is this a regexp kind of task? And does one use match? Each file will have a different number of columns (so x will be of varying lengths). Any tips appreciated.
You can achieve that with the function order from base-r:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
order(
!startsWith(x, "Component"), # 0 - starts with component, 1 - o.w.
!grepl("^C\\d", x), # 0 - starts with C<NUMBER>, 1 - o.w.
x # alphabetical
)
# output: 3 4 6 5 9 8 2 7 1
A brute-force solution using only base R:
first = sort(x[grepl('^Component', x)])
second = sort(x[grepl('^C\\d', x)])
third = sort(setdiff(x, c(first, second)))
c(first, second, third)
We can split int to different elements and then use mixedsort from gtools
v1 <- c(gtools::mixedsort(grep("Component", x, value = TRUE)),
gtools::mixedsort(grep("^C\\d+", x, value = TRUE)))
c(v1, gtools::mixedsort(x[!x %in% v1]))
#[1] "Component.1" "Component.10" "Component.13" "Component.143" "C16:1n-7" "A" "i2" "r4"
#[9] "x1"
Or another option in select assuming that these are the columns of the data.frame
library(dplyr)
df1 %>%
select(mixedsort(starts_with('Component')),
mixedsort(names(.)[matches("^C\\d+")]),
gtools::mixedsort(names(.)[everything()]))
If it is just the order of occurrence
df1 %>%
select(starts_with('Component'), matches('^C\\d+'), sort(names(.)[everything()]))
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(5 * 9), ncol = 9,
dimnames = list(NULL, x)))

Removing all subsets from a list

I have a list that looks as follows:
a <- c(1, 3, 4)
b <- c(0, 2, 6)
c <- c(3, 4)
d <- c(0, 2, 6)
list(a, b, c, d)
From this list I would like to remove all subsets such that the list looks as follows:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
How do I do this? In my actual data I am working with a very long list (> 500k elements) so any suggestions for an efficient implementation are welcome.
Here is an approach.
lst <- list(a, b, c, d) # The list
First, remove all duplicates.
lstu <- unique(lst)
If the list still contains more than one element, we order the list by the lengths of its elements (decreasing).
lstuo <- lstu[order(-lengths(lstu))]
Then subsets can be filtered with this command:
lstuo[c(TRUE, !sapply(2:length(lstuo),
function(x) any(sapply(seq_along(lstuo)[-x],
function(y) all(lstuo[[x]] %in% lstu[[y]])))))]
The result:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
Alternative approach
Your data
lst <- list(a, b, c, d) # The list
lstu <- unique(lst) # remove duplicates, piggyback Sven's approach
Make matrix of values and index
m <- combn(lstu, 2) # 2-row matrix of non-self pairwise combinations of values
n <- combn(length(lstu), 2) # 2-row matrix of non-self pairwise combination of index
Determine if subset
issubset <- t(sapply(list(c(1,2),c(2,1)), function(z) mapply(function(x,y) all(x %in% y), m[z[1],], m[z[2],])))
Discard subset vectors from list
discard <- c(n*issubset)[c(n*issubset)>0]
ans <- lstu[-discard]
Output
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6

Resources