R strsplit, nested lists blues - r

I am facing this issue in R in which I want to split the strings on comma and then further split on semicolon, but only keep the first item before the semicolon i.e. ee and jj below. I have tried a bunch of things but nested lists seem too convoluted!
Here's what I am doing:
d <- c("aa,bb,cc,dd,ee;e,ff",
"gg,hh,ii,jj;j")
e=strsplit(d,",")
myfun2 <- function(x,arg1) {
strsplit(x,";")
}
f=lapply(e,myfun2)
f=
[[1]]
[[1]][[1]]
[1] "aa"
[[1]][[2]]
[1] "bb"
[[1]][[3]]
[1] "cc"
[[1]][[4]]
[1] "dd"
[[1]][[5]]
[1] "ee" "e"
[[1]][[6]]
[1] "ff"
[[2]]
[[2]][[1]]
[1] "gg"
[[2]][[2]]
[1] "hh"
[[2]][[3]]
[1] "ii"
[[2]][[4]]
[1] "jj" "j"
Here's the output that I want
Correct output=
[[1]]
[1] "aa" "bb" "cc" "dd" "ee" "ff"
[[2]]
[1] "gg" "hh" "ii" "jj"
I have tried a bunch of things using lapply to the nested list "f" and used "[[" and "[" but with no success.
Any help is greatly appreciated. (I know that I am missing something silly, but just can't figure it out right now!)

This is your code
d <- c("aa,bb,cc,dd,ee;e,ff", "gg,hh,ii,jj;j")
e <- strsplit(d,",")
myfun2 <- function(x,arg1) { strsplit(x,";") }
f <- lapply(e,myfun2)
If we start from your f, then the next step would be
lapply(f,function(x) mapply(`[`,x,1))
[[1]]
[1] "aa" "bb" "cc" "dd" "ee" "ff"
[[2]]
[1] "gg" "hh" "ii" "jj"
Basically, you need an inner and outer type apply function to go down the two levels of nesting.

We can use gsub to match the pattern ; followed by one ore more alphabetic characters, replace with '', and then split (strsplit) with ,.
strsplit(gsub(';[a-z]+', '', d), ',')
#[[1]]
#[1] "aa" "bb" "cc" "dd" "ee" "ff"
#[[2]]
#[1] "gg" "hh" "ii" "jj"

Related

How to expand loop n times in R programming?

I have a vector of characters 'A', 'B', 'C', 'D' and would like to loop n times to get all possible combinations (4^n) of the characters. How do I write a function that will perform this given input n?
For example, if n=2, my loop will look something like this:
string <- c('A','B','C','D')
combination = c()
count = 1
for (j in string) {
for (k in string) {
combination[count] <- paste0(j,k)
count = count + 1
}
}
which will yield:
> combination
[1] "AA" "AB" "AC" "AD" "BA" "BB" "BC" "BD" "CA" "CB" "CC" "CD" "DA" "DB" "DC" "DD"
and if n=3, the code will be like this:
combination = c()
count = 1
for (j in string) {
for (k in string) {
for (l in string) {
combination[count] <- paste0(j,k,l)
count = count + 1
}
}
}
which yields
> combination
[1] "AAA" "AAB" "AAC" "AAD" "ABA" "ABB" "ABC" "ABD" "ACA" "ACB" "ACC" "ACD" "ADA" "ADB" "ADC" "ADD" "BAA" "BAB" "BAC"
[20] "BAD" "BBA" "BBB" "BBC" "BBD" "BCA" "BCB" "BCC" "BCD" "BDA" "BDB" "BDC" "BDD" "CAA" "CAB" "CAC" "CAD" "CBA" "CBB"
[39] "CBC" "CBD" "CCA" "CCB" "CCC" "CCD" "CDA" "CDB" "CDC" "CDD" "DAA" "DAB" "DAC" "DAD" "DBA" "DBB" "DBC" "DBD" "DCA"
[58] "DCB" "DCC" "DCD" "DDA" "DDB" "DDC" "DDD"
In a word, recursion:
#' n: length of each combination string
#' basis: the starting vector of characters to combine
#' extras: the vector of characters to be combined with basis. Defaults to basis
#' i: The current depth of recursion. Users should generally not need to access
#' this parameter.
combine <- function(n, basis=c('A','B','C','D'), extras=basis, i=1) {
x <- expand.grid(basis, extras)
y <- paste0(x$Var1, x$Var2)
if (i == n-1) {
return(y)
} else {
return(combine(n, y, extras, i+1))
}
}
Giving, for example,
> combine(2)
[1] "AA" "BA" "CA" "DA" "AB" "BB" "CB" "DB" "AC" "BC" "CC" "DC" "AD" "BD" "CD" "DD"
and
> combine(3)
[1] "AAA" "BAA" "CAA" "DAA" "ABA" "BBA" "CBA" "DBA" "ACA" "BCA" "CCA" "DCA" "ADA" "BDA" "CDA" "DDA" "AAB" "BAB" "CAB" "DAB" "ABB" "BBB" "CBB" "DBB" "ACB" "BCB" "CCB"
[28] "DCB" "ADB" "BDB" "CDB" "DDB" "AAC" "BAC" "CAC" "DAC" "ABC" "BBC" "CBC" "DBC" "ACC" "BCC" "CCC" "DCC" "ADC" "BDC" "CDC" "DDC" "AAD" "BAD" "CAD" "DAD" "ABD" "BBD"
[55] "CBD" "DBD" "ACD" "BCD" "CCD" "DCD" "ADD" "BDD" "CDD" "DDD"
etc.
Feel free to sort the output if another order is more desirable.

R - how to sort named list of character vectors

I'm new to R and looking for the following:
My input:
v = list(bob=c("aa", "cc"), cas=c("tt", "ff"), john=c("aa", "bb"))
v
$bob
[1] "aa" "cc"
$cas
[1] "tt" "ff"
$john
[1] "aa" "bb"
I want to sort based on the character vectors inside it, the desired output I'm looking for :
sorted_v
$john
[1] "aa" "bb"
$bob
[1] "aa" "cc"
$cas
[1] "tt" "ff"
How to obtain sorted_v?
We can paste all the elements of the list together, sort them and extract the names of them.
sorted_v <- v[names(sort(sapply(v, paste0, collapse = "")))]
sorted_v
#$john
#[1] "aa" "bb"
#$bob
#[1] "aa" "cc"
#$cas
#[1] "tt" "ff"
OR
as #ycw mentioned in the comments we can also use toString instead of paste0, collapse combination :
sorted_v <- v[names(sort(sapply(v, toString)))]
Also using #A5C1D2H2I1M1N2O1R2T1 and #ycw's inputs we can reduce it to
v[order(sapply(v, toString))]
#$john
#[1] "aa" "bb"
#$bob
#[1] "aa" "cc"
#$cas
#[1] "tt" "ff"

R: how to apply to a list a function that loops over the previous subelements

I guess this is better understood with an example, I feel this is really easy but I cannot get around it...
I have a list that looks like this:
[[1]] [1] "A" "B" "C" "D" "E" "F"
[[2]] [1] "A" "B" "C"
[[3]] [1] "A" "B" "C" "D"
[[4]] [1] "A" "B" "C" "D"
[[5]] [1] "A" "B" "C" "D" "E"
And I want to obtain this:
[[1]] [1] "A" "A;B" "A;B;C" "A;B;C;D" "A;B;C;D;E" "A;B;C;D;E;F"
[[2]] [1] "A" "A;B" "A;B;C"
[[3]] [1] "A" "A;B" "A;B;C" "A;B;C;D"
[[4]] [1] "A" "A;B" "A;B;C" "A;B;C;D"
[[5]] [1] "A" "A;B" "A;B;C" "A;B;C;D" "A;B;C;D;E"
So I need a function to apply in this way:
list2 <- lapply(list1,
function(x) {
#something here
})
We can loop through the list, get the sequence of the length of elements, loop through it with sapply, extract the list elements based on the index and paste
lapply(list1, function(x) sapply(seq(length(x)),
function(i) paste(x[seq_len(i)], collapse=",")))
#[[1]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D" "A,B,C,D,E" "A,B,C,D,E,F"
#[[2]]
#[1] "A" "A,B" "A,B,C"
#[[3]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D"
#[[4]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D"
#[[5]]
#[1] "A" "A,B" "A,B,C" "A,B,C,D" "A,B,C,D,E"
Or another option is Reduce with accumulate = TRUE
lapply(list1, function(x) Reduce(function(...) paste(..., sep=","), x, accumulate = TRUE))
This can be written without an anonymous function call if the sep is not important
lapply(list1, Reduce, f = paste, accumulate = TRUE)
data
list1 <- lapply(c(6, 3, 4, 4, 5), function(i) LETTERS[1:i])

Sort character list in ascending order based on order of numerical list in r

I want to arrange a list with characters based on order/arrange results on another list.
For example, given a list char, and list of values (numbers) mini, I can get sorted char list:
sorted<-mapply(function(x, y) y[x], lapply(mini, order), char)
I want to use arrange/order that will sort char list based on ascending min list
(I want to have ascendant alphabetical char when values in mini are same).
Suggestions?
EDIT: dummy example
char <- list(A=c("dd", "aa", "cc", "ff"), B=c("rr", "ee", "tt", "aa"))
mini <- list(A=c(4,2,4,4), B=c(5,5,7,1))
char
$A
"dd" "aa" "cc" "ff" ...
$B
"rr" "ee" "tt" "aa" ...
mini
$A
4 2 4 4 ...
$B
5 5 7 1 ...
expected result:
sorted
$A
"aa" "cc" "dd" "ff"
$B
"aa" "ee" "rr" "tt"
Try this:
Map(function(x, y) y[order(x, y)], mini, char)
lapply( names(char), function(nm) char[[nm]][order(mini[[nm]], char[[nm]])])
#------
[[1]]
[1] "aa" "cc" "dd" "ff"
[[2]]
[1] "aa" "ee" "rr" "tt"

Extract elements between a character and space

I'm having a hard time extracting elements between a / and a black space. I can do this when I have two characters like < and > for instance but the space is throwing me. I'd like the most efficient way to do this in base R as This will be lapplied to thousands of vectors.
I'd like to turn this:
x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
This:
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
EDIT:
Thank you all for the answers. I'm going for speed so Andres code wins out. Dwin's code wins for the shotest amount of code. Dirk yours was the second fastest. The stringr solution was the slowest (I figured it would be) and wasn't in base but is pretty understandable (which really is the intent of the stringr package I think as this seems to be Hadley's philosophy with most things.
I appreciate your assistance. Thanks again.
I thought I'd include the benchmarking since this will be lapplied over several thousand vectors:
test replications elapsed relative user.self sys.self
1 ANDRES 10000 1.06 1.000000 1.05 0
3 DIRK 10000 1.29 1.216981 1.20 0
2 DWIN 10000 1.56 1.471698 1.43 0
4 FLODEL 10000 8.46 7.981132 7.70 0
Similar but a bit more succinct:
#1- Separate the elements by the blank space
y=unlist(strsplit(x,' '))
#2- extract just what you want from each element:
sub('^.*/([^ ]+).*$','\\1',y)
Where beginning and end anchor characters
are ^ and $ respectively, .* matches any character.
[^ ]+ takes the nonblank characters.
\\1 is the first tagged character
Use regex pattern that is fwd-slash or space:
strsplit(x, "/|\\s" )
[[1]]
[1] "This" "DT" "is" "VBZ" "a" "DT" "short"
[8] "JJ" "sentence" "NN" "consisting" "VBG" "of" "IN"
[15] "some" "DT" "nouns," "JJ" "verbs," "NNS" "and"
[22] "CC" "adjectives." "VBG"
Didn't read the Q closely enough. One could use that result to extract the even numbered elements:
strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
Here is a one-liner:
R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
"of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")),
+ ncol=2, byrow=TRUE)[,2]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
R>
The key is to get rid of 'text before slash':
R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG"
R>
after which it is just a matter of splitting the string
R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN"
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS"
[21] "" "CC" "" "VBG"
and filtering the "". There may well be more compact ways for the last bit.
R>
The stringr package has nice functions for working with strings, with very intuitive names. Here you can use str_extract_all to get all matches (including the leading slash), then str_sub to remove the slashes:
str_extract_all(x, "/\\w*")
# [[1]]
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS"
# [11] "/CC" "/VBG"
str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"

Resources