Differing number of rows - r

Suppose I have a vector of numbers a a<-c(1, 2, 3, 4, 5, 6) and a vector of positions b b<-c(1, 2, 3).
Then I want to get the numbers that come before every position b in the vector a.
I do this lapply(b, function(x) a[1:x]) and I get the result
[1] 1
[[2]]
[1] 1 2
[[3]]
[1] 1 2 3
Now I want to combine them in a dataframe normally if the number of values for every position were equal I would have done t(as.data.frame(lapply(b, function(x) a[1:x])) But I cannot do that right now because the number of rows are different. How can I put zeros for the non-existing values?

If the output list is 'lst1', then make the lengths same with length<- assignment
lapply(lst1, function(x) {
length(x) <- max(lengths(lst1))
replace(x, is.na(x), 0)})
data
lst1 <- lapply(b, function(x) a[1:x])

Related

Get index and value of non-NA list element

Take a list like as.list(rep(c(NA, 4, NA), times = c(5, 1, 2))) i.e.
[[1]]
[1] NA
[[2]]
[1] NA
[[3]]
[1] NA
[[4]]
[1] NA
[[5]]
[1] NA
[[6]] # index of non-NA list element, 6
[1] 4 # ...and its corresponding value, 4
[[7]]
[1] NA
[[8]]
[1] NA
I want to extract the index of the non-NA element (here: 6), and its corresponding value (here: 4). Is there any idiomatic way to get these two numbers?
1) Base R Assuming that the list L contains only scalars and NA's this returns a 2 column matrix with one row for each set of xy coordinates and an attribute recording which positions were omitted.
Omit the x= and y= if you don't want the column names. If you don't want the attribute recording the positions of the NA's append [,] to the end of the line. If you know that there is only one scalar you might want to wrap it in c(...) to produce a 2 element vector. If you prefer data frame output replace cbind with data.frame.
na.omit(cbind(x = seq_along(L), y = unlist(L)))
2) tidyverse or using the tidyverse
library(tibble)
library(tidyr)
drop_na(enframe(unlist(L)))
2a) which could alternately be written using pipes like this:
L %>% unlist %>% enframe %>% drop_na
I am not sure this is elegant enough but works;
mylist <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- (1:length(mylist))[!sapply(mylist,is.na)]
y <- mylist[[x]]
coor <- c(x,y)
coor
output;
6 4
Find index of non NA list == 6 to get x
Remove all list elements with NA to get y
my_list <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- which(!is.na(my_list))
y <- unlist(Filter(function(a) any(!is.na(a)), my_list))
coordinate <- c(x,y)
coordinate
> coordinate
[1] 6 4
Using stack
na.omit(stack(setNames(L, seq_along(L))))
values ind
6 4 6

Grouping Ids based on at least one common values

I have a list whose elements are integers and I would like to accumulate these elements if only they share at least one value. With regard to those elements that don't share any values with the rest I would like them to stay as they are. Here is my sample date:
x <- list(c(1, 2), c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 8), c(6, 9, 7), 7, c(5, 8), 10, 11)
And here is my desired output:
desired_reult <- list(c(1, 2, 3, 4, 5, 8),
c(6, 9, 7),
10,
11)
I would like to do it first with reduce or accumulate functions from purrr but any other tidyverse solution would be welcomed. I have tried this solution so far but it only gave me one union and apparently abandons the rest:
x %>%
reduce(~ if(any(.x %in% .y)) union(.x, .y) else .x)
[1] 1 2 3 4 5 8
In general I am looking for a way of grouping integers (ids) with common values like a sort of clustering but so far my efforts have been in vain unfortunately.
Thank you very much indeed for your help in advance.
I suspect there's a set covering solution to be had, but in the interim here's a graph approach:
First, let's convert the integer vectors to an edge list so it can be made into a graph. We can use expand.grid.
library(igraph)
edgelist <- do.call(rbind,lapply(x,\(x)expand.grid(x,x))) #R version >= 4.1.0
Now we have a two column data.frame showing the connections between all the integers (a set of edges).
igraph::graph.data.frame can conveniently make a graph from this.
From there we can use igraph::components to extract the connected components.
g <- graph.data.frame(edgelist)
split(names(components(g)$membership),components(g)$membership)
#$`1`
#[1] "1" "2" "3" "4" "5" "8"
#$`2`
#[1] "6" "9" "7"
#$`3`
#[1] "10"
#$`4`
#[1] "11"
Or with Tidyverse:
library(dplyr); library(purrr)
map_dfr(x, ~expand.grid(.x,.x)) %>%
graph.data.frame() %>%
components() %>%
pluck(membership) %>%
stack() %>%
{split(as.numeric(as.character(.[,2])),.[,1])}
$`1`
[1] 1 2 3 4 5 8
$`2`
[1] 6 9 7
$`3`
[1] 10
$`4`
[1] 11
One way of doing it:
i is adjacent to j iff intersect(i, j) != empty set. We want to find connected components of matrix that on position (i,j) has 1 iff set i is adjecent to set j, 0 otherwise. First 4 rows build adjacency matrix, 5th and 6th row finds connected components and rest is splitting list based on that membership and taking unique values.
library(tidyverse)
library(igraph)
map(x, function(a) map_int(x, ~length(base::intersect(a, .x)) > 0) * 1L) %>%
reduce(rbind) %>%
graph.adjacency() %>%
as.undirected() %>%
components() %>%
pluck("membership") %>%
split(seq_along(.), .) %>%
map(~unique(unlist(x[.x])))
Thanks to a very informative post introduced by my dear friend #Ian Canmpbell, I thought to challenge myself to write a custom function for this purpose. It is still the first version, though not very elegant and can certainly be improved greatly but for now it is stable as I tried it on some inputs and it did not disappoint.
anoush <- function(x) {
# First we check whether x is a list
stopifnot(is.list(x))
# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices
# in `vec` vector. So in the end we have a list called `ind` whose elements
# are all the indices connected with the corresponding elements of the original
# list for example first element of `ind` is `1`, `2`, `3` which means in
# the original list these elements have common values.
ind <- lapply(1:length(x), function(a) {
vec <- c()
for(i in 1:length(x)) {
if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
vec <- c(vec, i)
}
}
vec
})
# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also
# result in duplicated values. For example elements `1` through `5` of
# `dup_ind` contains the same value cause in the original list these
# elements have common values.
dup_ind <- lapply(1:length(ind), function(a) {
out <- c()
for(i in 1:length(ind)) {
if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
out[[i]] <- union(ind[[a]], ind[[i]])
}
vec2 <- Reduce("union", out)
}
vec2
})
# Here we get rid of the duplicated elements of the list by means of
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
un <- unlist(dup_ind)
res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
res2 <- Filter(length, res)
sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
}
Output
> anoush(x)
[[1]]
[1] 1 2 3 4 5 8
[[2]]
[1] 6 9 7
[[3]]
[1] 10
[[4]]
[1] 11

Removing all subsets from a list

I have a list that looks as follows:
a <- c(1, 3, 4)
b <- c(0, 2, 6)
c <- c(3, 4)
d <- c(0, 2, 6)
list(a, b, c, d)
From this list I would like to remove all subsets such that the list looks as follows:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
How do I do this? In my actual data I am working with a very long list (> 500k elements) so any suggestions for an efficient implementation are welcome.
Here is an approach.
lst <- list(a, b, c, d) # The list
First, remove all duplicates.
lstu <- unique(lst)
If the list still contains more than one element, we order the list by the lengths of its elements (decreasing).
lstuo <- lstu[order(-lengths(lstu))]
Then subsets can be filtered with this command:
lstuo[c(TRUE, !sapply(2:length(lstuo),
function(x) any(sapply(seq_along(lstuo)[-x],
function(y) all(lstuo[[x]] %in% lstu[[y]])))))]
The result:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
Alternative approach
Your data
lst <- list(a, b, c, d) # The list
lstu <- unique(lst) # remove duplicates, piggyback Sven's approach
Make matrix of values and index
m <- combn(lstu, 2) # 2-row matrix of non-self pairwise combinations of values
n <- combn(length(lstu), 2) # 2-row matrix of non-self pairwise combination of index
Determine if subset
issubset <- t(sapply(list(c(1,2),c(2,1)), function(z) mapply(function(x,y) all(x %in% y), m[z[1],], m[z[2],])))
Discard subset vectors from list
discard <- c(n*issubset)[c(n*issubset)>0]
ans <- lstu[-discard]
Output
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6

For each row, return the column name of the largest value whilst removing ties

I have this data:
x = c(1,1,3, 3, 2)
y = c(1,2,1, 1, 2)
z = c(1,1,2, 3, 7)
data <- data.frame(x, y, z)
And I would like to get a vector indicating the column number of the highest value in each row; whilst removing ties; or indicate ties with NA.
I have tried which.max:
HighestIncludingTies <- apply(data, 1, which.max)
Although this does not mark ties with NA (or something similar).
Thanks a lot for any help or guidance!
Here's an attempt using max.col:
HighsNoTies <- max.col(data,"first")
replace(HighsNoTies, HighsNoTies != max.col(data,"last"), NA)
#[1] NA 2 1 NA 3

Find indices of rows from matrix A in matrix B

Let's consider two matrices A and B. A is a subset of B. How to find the index of each row of A in matrix B?
Here is a reproductible example:
set.seed(30)
B <- matrix(rnorm(n =30,mean = 0), ncol=3)
A <- subset(B, B[,1] > 1)
The goal is to find the indices idx which in this case gives row 4 and 5.
Nested apply loops should do it.
apply(A, 1, function(a)
which(apply(B, 1, function(b) all(b==a)))
)
# [1] 4 5
Or alternatively, using colSums
apply(A, 1, function(a)
which(colSums(t(B) == a) == ncol(B)))
# [1] 4 5
Alternatively, you could do this:
transform(A, idx = 1 * duplicated(rbind(A, B))[-seq_len(nrow(A))])
A nice solution without apply, originally by #Arun.
> match(apply(A, 1, paste, collapse="\b"), apply(B, 1, paste, collapse="\b"))
[1] 4 5
This takes a slightly different approach and relies on the fact that a matrix is a vector, it won't work if you have data.frames:
which( B %in% A , arr.ind=TRUE )[1:nrow(A)]
#[1] 4 5
And if you had really big matrices and wanted to be a bit more efficient you could use %in% on a subset like so:
which( B[1:nrow(B)] %in% A[1:nrow(A)] , arr.ind=TRUE )
But I don't expect this would make too much of a difference except in really big matrices.
If you had your data as data.frames you could do the same thing by passing just the first column to which:
A <- data.frame(A)
B <- data.frame(B)
which( B$X1 %in% A$X1 )
#[1] 4 5

Resources