Take a list like as.list(rep(c(NA, 4, NA), times = c(5, 1, 2))) i.e.
[[1]]
[1] NA
[[2]]
[1] NA
[[3]]
[1] NA
[[4]]
[1] NA
[[5]]
[1] NA
[[6]] # index of non-NA list element, 6
[1] 4 # ...and its corresponding value, 4
[[7]]
[1] NA
[[8]]
[1] NA
I want to extract the index of the non-NA element (here: 6), and its corresponding value (here: 4). Is there any idiomatic way to get these two numbers?
1) Base R Assuming that the list L contains only scalars and NA's this returns a 2 column matrix with one row for each set of xy coordinates and an attribute recording which positions were omitted.
Omit the x= and y= if you don't want the column names. If you don't want the attribute recording the positions of the NA's append [,] to the end of the line. If you know that there is only one scalar you might want to wrap it in c(...) to produce a 2 element vector. If you prefer data frame output replace cbind with data.frame.
na.omit(cbind(x = seq_along(L), y = unlist(L)))
2) tidyverse or using the tidyverse
library(tibble)
library(tidyr)
drop_na(enframe(unlist(L)))
2a) which could alternately be written using pipes like this:
L %>% unlist %>% enframe %>% drop_na
I am not sure this is elegant enough but works;
mylist <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- (1:length(mylist))[!sapply(mylist,is.na)]
y <- mylist[[x]]
coor <- c(x,y)
coor
output;
6 4
Find index of non NA list == 6 to get x
Remove all list elements with NA to get y
my_list <- as.list(rep(c(NA, 4, NA), times = c(5, 1, 2)))
x <- which(!is.na(my_list))
y <- unlist(Filter(function(a) any(!is.na(a)), my_list))
coordinate <- c(x,y)
coordinate
> coordinate
[1] 6 4
Using stack
na.omit(stack(setNames(L, seq_along(L))))
values ind
6 4 6
Related
I have a list whose elements are integers and I would like to accumulate these elements if only they share at least one value. With regard to those elements that don't share any values with the rest I would like them to stay as they are. Here is my sample date:
x <- list(c(1, 2), c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 8), c(6, 9, 7), 7, c(5, 8), 10, 11)
And here is my desired output:
desired_reult <- list(c(1, 2, 3, 4, 5, 8),
c(6, 9, 7),
10,
11)
I would like to do it first with reduce or accumulate functions from purrr but any other tidyverse solution would be welcomed. I have tried this solution so far but it only gave me one union and apparently abandons the rest:
x %>%
reduce(~ if(any(.x %in% .y)) union(.x, .y) else .x)
[1] 1 2 3 4 5 8
In general I am looking for a way of grouping integers (ids) with common values like a sort of clustering but so far my efforts have been in vain unfortunately.
Thank you very much indeed for your help in advance.
I suspect there's a set covering solution to be had, but in the interim here's a graph approach:
First, let's convert the integer vectors to an edge list so it can be made into a graph. We can use expand.grid.
library(igraph)
edgelist <- do.call(rbind,lapply(x,\(x)expand.grid(x,x))) #R version >= 4.1.0
Now we have a two column data.frame showing the connections between all the integers (a set of edges).
igraph::graph.data.frame can conveniently make a graph from this.
From there we can use igraph::components to extract the connected components.
g <- graph.data.frame(edgelist)
split(names(components(g)$membership),components(g)$membership)
#$`1`
#[1] "1" "2" "3" "4" "5" "8"
#$`2`
#[1] "6" "9" "7"
#$`3`
#[1] "10"
#$`4`
#[1] "11"
Or with Tidyverse:
library(dplyr); library(purrr)
map_dfr(x, ~expand.grid(.x,.x)) %>%
graph.data.frame() %>%
components() %>%
pluck(membership) %>%
stack() %>%
{split(as.numeric(as.character(.[,2])),.[,1])}
$`1`
[1] 1 2 3 4 5 8
$`2`
[1] 6 9 7
$`3`
[1] 10
$`4`
[1] 11
One way of doing it:
i is adjacent to j iff intersect(i, j) != empty set. We want to find connected components of matrix that on position (i,j) has 1 iff set i is adjecent to set j, 0 otherwise. First 4 rows build adjacency matrix, 5th and 6th row finds connected components and rest is splitting list based on that membership and taking unique values.
library(tidyverse)
library(igraph)
map(x, function(a) map_int(x, ~length(base::intersect(a, .x)) > 0) * 1L) %>%
reduce(rbind) %>%
graph.adjacency() %>%
as.undirected() %>%
components() %>%
pluck("membership") %>%
split(seq_along(.), .) %>%
map(~unique(unlist(x[.x])))
Thanks to a very informative post introduced by my dear friend #Ian Canmpbell, I thought to challenge myself to write a custom function for this purpose. It is still the first version, though not very elegant and can certainly be improved greatly but for now it is stable as I tried it on some inputs and it did not disappoint.
anoush <- function(x) {
# First we check whether x is a list
stopifnot(is.list(x))
# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices
# in `vec` vector. So in the end we have a list called `ind` whose elements
# are all the indices connected with the corresponding elements of the original
# list for example first element of `ind` is `1`, `2`, `3` which means in
# the original list these elements have common values.
ind <- lapply(1:length(x), function(a) {
vec <- c()
for(i in 1:length(x)) {
if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
vec <- c(vec, i)
}
}
vec
})
# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also
# result in duplicated values. For example elements `1` through `5` of
# `dup_ind` contains the same value cause in the original list these
# elements have common values.
dup_ind <- lapply(1:length(ind), function(a) {
out <- c()
for(i in 1:length(ind)) {
if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
out[[i]] <- union(ind[[a]], ind[[i]])
}
vec2 <- Reduce("union", out)
}
vec2
})
# Here we get rid of the duplicated elements of the list by means of
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.
un <- unlist(dup_ind)
res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
res2 <- Filter(length, res)
sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))
}
Output
> anoush(x)
[[1]]
[1] 1 2 3 4 5 8
[[2]]
[1] 6 9 7
[[3]]
[1] 10
[[4]]
[1] 11
Suppose I have a vector of numbers a a<-c(1, 2, 3, 4, 5, 6) and a vector of positions b b<-c(1, 2, 3).
Then I want to get the numbers that come before every position b in the vector a.
I do this lapply(b, function(x) a[1:x]) and I get the result
[1] 1
[[2]]
[1] 1 2
[[3]]
[1] 1 2 3
Now I want to combine them in a dataframe normally if the number of values for every position were equal I would have done t(as.data.frame(lapply(b, function(x) a[1:x])) But I cannot do that right now because the number of rows are different. How can I put zeros for the non-existing values?
If the output list is 'lst1', then make the lengths same with length<- assignment
lapply(lst1, function(x) {
length(x) <- max(lengths(lst1))
replace(x, is.na(x), 0)})
data
lst1 <- lapply(b, function(x) a[1:x])
I have several hundred files that need their columns sorted in a convoluted way. Imagine a character vector x which is the result of names(foo) where foo is a data.frame:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
I'd like to have it ordered according to the following rule: First, alphabetical for anything starting with "Component". Second, alphabetical for anything remaining starting with "C" and a number. Third anything remaining in alphabetical order.
For x that would be:
x[c(3,4,6,5,9,8,2,7,1)]
Is this a regexp kind of task? And does one use match? Each file will have a different number of columns (so x will be of varying lengths). Any tips appreciated.
You can achieve that with the function order from base-r:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
order(
!startsWith(x, "Component"), # 0 - starts with component, 1 - o.w.
!grepl("^C\\d", x), # 0 - starts with C<NUMBER>, 1 - o.w.
x # alphabetical
)
# output: 3 4 6 5 9 8 2 7 1
A brute-force solution using only base R:
first = sort(x[grepl('^Component', x)])
second = sort(x[grepl('^C\\d', x)])
third = sort(setdiff(x, c(first, second)))
c(first, second, third)
We can split int to different elements and then use mixedsort from gtools
v1 <- c(gtools::mixedsort(grep("Component", x, value = TRUE)),
gtools::mixedsort(grep("^C\\d+", x, value = TRUE)))
c(v1, gtools::mixedsort(x[!x %in% v1]))
#[1] "Component.1" "Component.10" "Component.13" "Component.143" "C16:1n-7" "A" "i2" "r4"
#[9] "x1"
Or another option in select assuming that these are the columns of the data.frame
library(dplyr)
df1 %>%
select(mixedsort(starts_with('Component')),
mixedsort(names(.)[matches("^C\\d+")]),
gtools::mixedsort(names(.)[everything()]))
If it is just the order of occurrence
df1 %>%
select(starts_with('Component'), matches('^C\\d+'), sort(names(.)[everything()]))
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(5 * 9), ncol = 9,
dimnames = list(NULL, x)))
I have a list that looks as follows:
a <- c(1, 3, 4)
b <- c(0, 2, 6)
c <- c(3, 4)
d <- c(0, 2, 6)
list(a, b, c, d)
From this list I would like to remove all subsets such that the list looks as follows:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
How do I do this? In my actual data I am working with a very long list (> 500k elements) so any suggestions for an efficient implementation are welcome.
Here is an approach.
lst <- list(a, b, c, d) # The list
First, remove all duplicates.
lstu <- unique(lst)
If the list still contains more than one element, we order the list by the lengths of its elements (decreasing).
lstuo <- lstu[order(-lengths(lstu))]
Then subsets can be filtered with this command:
lstuo[c(TRUE, !sapply(2:length(lstuo),
function(x) any(sapply(seq_along(lstuo)[-x],
function(y) all(lstuo[[x]] %in% lstu[[y]])))))]
The result:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
Alternative approach
Your data
lst <- list(a, b, c, d) # The list
lstu <- unique(lst) # remove duplicates, piggyback Sven's approach
Make matrix of values and index
m <- combn(lstu, 2) # 2-row matrix of non-self pairwise combinations of values
n <- combn(length(lstu), 2) # 2-row matrix of non-self pairwise combination of index
Determine if subset
issubset <- t(sapply(list(c(1,2),c(2,1)), function(z) mapply(function(x,y) all(x %in% y), m[z[1],], m[z[2],])))
Discard subset vectors from list
discard <- c(n*issubset)[c(n*issubset)>0]
ans <- lstu[-discard]
Output
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!
library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c
Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.
EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.