Split a list whose elements are multiple element lists - r

Say I have a list a which is defined as:
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
I want to split this list by semicolon ;, get only unique values, and return another list. So far I have split the list using str_split():
a <- str_split(a, ";")
which gives me
> a
[[1]]
[1] "aaa" "bbb"
[[2]]
[1] "aaa"
[[3]]
[1] "bbb"
[[4]]
[1] "aaa" "ccc"
How can I manipulate this list (using unique()?) to give me something like
[[1]]
[1] "aaa"
[[2]]
[1] "bbb"
[[3]]
[1] "ccc"
or more simply,
[[1]]
[1] "aaa" "bbb" "ccc"

One option is to use list() with unique() and unlist() inside your list.
# So first you use your code
a <- list("aaa;bbb", "aaa", "bbb", "aaa;ccc")
# Load required library
library(stringr) # load str_split
a <- str_split(a, ";")
# Finally use list() with unique() and unlist()
list(unique(unlist(a)))
# And the otuput
[[1]]
[1] "aaa" "bbb" "ccc"

One alternative in base R is to use rapply which applies a function to each of the inner most elements in a nested list and returns the most simplified object possible by default. Here, it returns a vector of characters.
unique(rapply(a, strsplit, split=";"))
[1] "aaa" "bbb" "ccc"
To return a list, wrap the output in list
list(unique(rapply(a, strsplit, split=";")))
[[1]]
[1] "aaa" "bbb" "ccc"

Related

Program to iterate through lists within a list

I have a list containing a number of lists containing character vectors. The lists are always arranged so that the first list contains a vector with a single element, the second list contains a vector with two elements and the third contains one or more vectors containing three elements.
fruits <- list(
list(c("orange")),
list(c("pear", "orange")),
list(c("orange", "pear", "grape"),
c("orange", "lemon", "pear"))
)
I need to iterate through the lists in order to remove the elements from the vector in the previous list. i.e. I would first find the value from the vector in the first list ('orange') and remove it from the vector in the second list, then take the values from the second list ("pear", "orange") and remove them from both vectors in the third list, so I ended up with:
new_fruits <- list(
list(c("orange")),
list(c("pear")),
list(c("grape"),
c("lemon"))
)
I should add that I have had a go at doing this, but I'm finding the lists within lists make it quite complicated and my solution is long and not very efficient.
Maybe you can try the code below
new_fruits <- s <- c()
for (k in seq_along(fruits)) {
new_fruits[[k]] <- lapply(fruits[[k]],function(x) x[!x%in%s])
s <- union(s,unlist(fruits[[k]]))
}
which gives
> new_fruits
[[1]]
[[1]][[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"
Here is an idea where we unlist, convert to strings and resplit to differentiate between different vectors of same element. We then unlist one more time and get the unique values, i.e.
as.list(unique(unlist(strsplit(unlist(lapply(fruits, function(i) sapply(i, toString))), ', '))))
#[[1]]
#[1] "orange"
#[[2]]
#[1] "pear"
#[[3]]
#[1] "grape"
#[[4]]
#[1] "lemon"
Another two compact options:
> mapply(fruits,append(list(list("")),fruits[-length(fruits)], after = length(fruits)), FUN = function(x,y) sapply(x,function(item)list(setdiff(item,y[[1]]))))
[[1]]
[[1]][[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"
or also
> append(fruits[[1]],mapply(fruits[-1],fruits[-length(fruits)], FUN = function(x,y) sapply(x,function(item)list(setdiff(item,y[[1]])))), after = length(fruits))
[[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"

split vector of strings with partial match

If I have a list of some elements:
x = c('abc', 'bbc', 'cd', 'hj', 'aa', 'zz', 'd9', 'jk')
I'd like to split it every time there's an 'a' to create a nested list:
[1][[1]] 'abc', 'bbc', 'cd', 'hj'
[2][[1]] 'aa', 'zz', 'd9', 'jk'
I tried
split(x, 'a')
but split doesn't look for partial matches.
We can create a group by matching the substring 'a' with grepl to a logical vector and then convert to numeric by getting the cumulative sum for distinct groups and use that in split
split(x, cumsum(grepl('a', x)))
#$`1`
#[1] "abc" "bbc" "cd" "hj"
#$`2`
#[1] "aa" "zz" "d9" "jk"
Another base R solution using split + findInterval (code is not as short as the answer by #akrun)
split(x,findInterval(seq_along(x),grep("a",x)))
such that
> split(x,findInterval(seq_along(x),grep("a",x)))
$`1`
[1] "abc" "bbc" "cd" "hj"
$`2`
[1] "aa" "zz" "d9" "jk"
Another base R possibility could be:
split(x, cumsum(nchar(sub("a", "", x, fixed = TRUE)) - nchar(x) != 0))
$`1`
[1] "abc" "bbc" "cd" "hj"
$`2`
[1] "aa" "zz" "d9" "jk"

Append element of two nested lists with separate order in R

I try to append the element github from the listA to listB based on the element id. How can I achieve this ?
G1 <- list(id = "aaa", github = "www.github.aaa")
G2 <- list(id = "bbb", github = "www.github.bbb")
R1 <- list(id = "aaa", reddit = "www.reddit.aaa")
R2 <- list(id = "bbb", reddit = "www.reddit.bbb")
listA <- list(G1, G2)
listB <- list(R2, R1)
With :
> listB
[[1]]
[[1]]$id
[1] "bbb"
[[1]]$reddit
[1] "www.reddit.bbb"
[[2]]
[[2]]$id
[1] "aaa"
[[2]]$reddit
[1] "www.reddit.aaa"
What I'm trying to do is :
> listB
[[1]]
[[1]]$id
[1] "bbb"
[[1]]$reddit
[1] "www.reddit.bbb"
[[1]]$github
[1] "www.github.bbb"
[[2]]
[[2]]$id
[1] "aaa"
[[2]]$reddit
[1] "www.reddit.aaa"
[[1]]$github
[1] "www.github.aaa"
I try with mapply but it doesn't give me the expected result. Instead it cross the two list:
> mapply(c, listA, listB, SIMPLIFY=FALSE)
[[1]]
[[1]]$id
[1] "aaa"
[[1]]$github
[1] "www.github.aaa"
[[1]]$id
[1] "bbb"
[[1]]$reddit
[1] "www.reddit.bbb"
[[2]]
[[2]]$id
[1] "bbb"
[[2]]$github
[1] "www.github.bbb"
[[2]]$id
[1] "aaa"
[[2]]$reddit
[1] "www.reddit.aaa"
Thank you
One possibility is to convert them to data.frames and perform a merge:
merge(do.call(rbind, lapply(listA, data.frame, stringsAsFactors=FALSE)),
do.call(rbind, lapply(listB, data.frame, stringsAsFactors=FALSE)), by="id")
this returns
id github reddit
1 aaa www.github.aaa www.reddit.aaa
2 bbb www.github.bbb www.reddit.bbb

How do I paste string columns in data.frame [duplicate]

This question already has answers here:
Concatenate rows of a data frame
(4 answers)
Closed 6 years ago.
suppose we have:
mydf <- data.frame(a= LETTERS, b = LETTERS, c =LETTERS)
Now we want to add a new column, containing a concatenation of all columns.
So that rows in the new column read "AAA", "BBB", ...
In my mind the following should work?
mydf[,"Concat"] <- apply(mydf, 1, paste0)
In addition to #akrun's answer, here is a short explanation on why your code didn't work.
What you are passing to paste0 in your code are vectors and here is the behavior of paste and paste0 with vectors:
> paste0(c("A","A","A"))
[1] "A" "A" "A"
Indeed, to concatenate a vector, you need to use argument collapse:
> paste0(c("A","A","A"), collapse="")
[1] "AAA"
Consequently, your code should have been:
> apply(mydf, 1, paste0, collapse="")
[1] "AAA" "BBB" "CCC" "DDD" "EEE" "FFF" "GGG" "HHH" "III" "JJJ" "KKK" "LLL" "MMM" "NNN" "OOO" "PPP" "QQQ" "RRR" "SSS" "TTT" "UUU" "VVV"
[23] "WWW" "XXX" "YYY" "ZZZ"
We can use do.call with paste0 for faster execution
mydf[, "Concat"] <- do.call(paste0, mydf)

Extract a fixed-length character in R

I have an attribute consisting DNA sequences and would like to translate it to its amino name.
So I need to split the sequence in a fixed-length character that is 3.
Here is the sample of the data
data=c("AATAGACGT","TGACCC","AAATCACTCTTT")
How can I extract it into:
[1] "AAT" "AGA" "CGT"
[2] "TGA" "CCC"
[3] "AAA" "TCA" "CTC" "TTT"
So far I can only find how to split a string given a certain regex as the separator
Try
strsplit(data, '(?<=.{3})', perl=TRUE)
Or
library(stringi)
stri_extract_all_regex(data, '.{1,3}')
Another solution, still one liner, but less elegant than the other ones (using lapply):
lapply(data, function(u) substring(u, seq(1, nchar(u), 3), seq(3, nchar(u),3)))
#[[1]]
#[1] "AAT" "AGA" "CGT"
#[[2]]
#[1] "TGA" "CCC"
#[[3]]
#[1] "AAA" "TCA" "CTC" "TTT"
as.list(gsub("(.{3})", "\\1 ", data))
[[1]]
[1] "AAT AGA CGT "
[[2]]
[1] "TGA CCC "
[[3]]
[1] "AAA TCA CTC TTT "
or
regmatches(data, gregexpr(".{3}", data))
[[1]]
[1] "AAT" "AGA" "CGT"
[[2]]
[1] "TGA" "CCC"
[[3]]
[1] "AAA" "TCA" "CTC" "TTT"
Another:
library(gsubfn)
strapply(data, "...")

Resources