R strsplit with multiple unordered split arguments? - r

Given a character string
test_1<-"abc def,ghi klm"
test_2<-"abc, def ghi klm"
I wish to obtain
"abc"
"def"
"ghi"
However, using strsplit, one must know the order of the splitting values in the string, as strsplit uses the first value to do the first split, the second to do the second... and then recycles.
But this does not:
strsplit(test_1, c(",", " "))
strsplit(test_2, c(" ", ","))
strsplit(test_2, split=c("[:punct:]","[:space:]"))[[1]]
I am looking to split the string wherever I find any of my splitting values in a single step.

Actually strsplit uses grep patterns as well. (A comma is a regex metacharacter whereas a space is not; hence the need for double escaping the commas in the pattern argument. So the use of "\\s" would be more to improve readability than of necessity):
> strsplit(test_1, "\\, |\\,| ") # three possibilities OR'ed
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_2, "\\, |\\,| ")
[[1]]
[1] "abc" "def" "ghi" "klm"
Without using both \\, and \\, (note extra space that SO does not show) you would have gotten some character(0) values. Might have been clearer if I had written:
> strsplit(test_2, "\\,\\s|\\,|\\s")
[[1]]
[1] "abc" "def" "ghi" "klm"
#Fojtasek is so right: Using character classes often simplifies the task because it creates an implicit logical OR:
> strsplit(test_2, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_1, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"

In case you don't like regular expressions, you can call strsplit() multiple times:
strsplits <- function(x, splits, ...)
{
for (split in splits)
{
x <- unlist(strsplit(x, split, ...))
}
return(x[!x == ""]) # Remove empty values
}
strsplits(test_1, c(" ", ","))
# "abc" "def" "ghi" "klm"
strsplits(test_2, c(" ", ","))
# "abc" "def" "ghi" "klm"
Updated for the added example
strsplits(test_1, c("[[:punct:]]","[[:space:]]"))
# "abc" "def" "ghi" "klm"
strsplits(test_2, c("[[:punct:]]","[[:space:]]"))
# "abc" "def" "ghi" "klm"
But if you are going to use regular expressions, you might as well go with #DWin's approach:
strsplit(test_1, "[[:punct:][:space:]]+")[[1]]
# "abc" "def" "ghi" "klm"
strsplit(test_2, "[[:punct:][:space:]]+")[[1]]
# "abc" "def" "ghi" "klm"

You could go with strsplit(test_1, "\\W").

test_1<-"abc def,ghi klm"
test_2<-"abc, def ghi klm"
key_words <- c("abc","def","ghi")
matches <- str_c(key_words, collapse ="|")
str_extract_all(test_1, matches)
str_extract_all(test_2, matches)

Related

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

append a list element-wise to elements of a nested list in R

I'm new to R and still trying to get my head around the apply family instead of using loops.
I have two lists, one nested, the other not, both composed of characters:
>lst1 <- list(c("ABC", "DEF", "GHI"), c("JKL", "MNO", "PQR"))
>lst2 <- c("abc", "def")
I would like to create a third list such that each element of lst2 is appended as the last element of the respective sublist in lst1. The desired output looks like this:
>lst3
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
My experience thus far in R tells me there likely is a way of doing this without writing a loop explicitly.
You can use Map which does exactly what mapply(..., simplify = F) do:
Map(c, lst1, lst2)
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
You can definitely use lapply if you apply your function over the length of your lst1 vector. This works:
lapply(1:length(lst1),function(i) append(lst1[[i]],lst2[[i]]))
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
lapply will not do what you need. You can use a loop with append to do this:
list1 <- list(c("ABC","DEF","GHI"),c("JKL","MNO","PQR"))
list2 <- c("abc","def")
listcomplete <- list(c("ABC","DEF","GHI","abc"),c("JKL","MNO","PQR","def"))
for (i in 1:length(list2)) {
list1[[i]] <- append(list1[[i]],list2[i])
}
Results:
> list1
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"

Return only matched pattern from grep

given the follwing example in R:
my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
grep('(?<=ivw_2014_)[a-z]*',my.list,perl=T,value=T)
returns
a b c
"ivw_2014_abc.pdf" "ivw_2014_def.pdf" "ivw_2014_ghi.pdf"
I would like to make it return only
[1] 'abc' 'def' 'ghi'
in bash I would use the -o option. How do I achieve this in R?
Without using any capturing groups,
> my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
> gsub("^.*_|\\..*$", "", my.list, perl=T)
[1] "abc" "def" "ghi"
For example :
sub('.*_(.*)[.].*','\\1',my.list)
[1] "abc" "def" "ghi"
Following may be of interest:
as.character(unlist(data.frame(strsplit(as.character(unlist(data.frame(strsplit(as.character(my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"
Following is easier to read:
as.character(
unlist(data.frame(strsplit(as.character(
unlist(data.frame(strsplit(as.character(
my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"
Another option would be:
library(stringi)
stri_extract_first_regex(unlist(my.list), "[A-Za-z]+(?=\\.)")
#[1] "abc" "def" "ghi"
Look at the regmatches function. It works with regexpr rather than grep, but returns just the matched part of the string.

Chopping string everytime a ; is there R

I have a string in which the entries are delimited by a ;.
abc;def;tyu;poi;asf;ghl
Is there a function in R which allows me to split up this string into
abc
def
tyu
all being seperate objects so that I can seperately access and index them?
Is there a way to do this without a character argument? Ny directly indexing a cell in a data frame? The cell looks like the string shown above but if I try
strsplit(k[1,8],split=';')
there is an error non character argument.
items <- scan(text="abc;def;tyu;poi;asf;ghl", sep=";", what="")
Read 6 items
as.matrix(items)
#----------
[,1]
[1,] "abc"
[2,] "def"
[3,] "tyu"
[4,] "poi"
[5,] "asf"
[6,] "ghl"
If these are items in a data.frame, it's likely that you should have used read.delim with sep=";" and probably stringsAsFactors=FALSE. You can still extract from a factor value with as.character:
df <- data.frame(a="abc;def;tyu;poi;asf;ghl")
items <- scan(text=df[1,1], sep=";", what="")
Error in textConnection(text) : invalid 'text' argument
# Use as.character instead
items <- scan(text=as.character(df[1,1]), sep=";", what="")
Read 6 items
For this you can use the strsplit function.
> strsplit('abc;def;ghi', split = ';')
[[1]]
[1] "abc" "def" "ghi"
Note that strsplit is vectorized, and it returns a list of results, also if you just feed one string. To get only that one string:
strsplit('abc;def;ghi', split = ';')[[1]]
[1] "abc" "def" "ghi"
the advantage of the vectorization is that you can feed strsplit a vector of strings:
> strsplit(rep('abc;def;ghi', 10), split = ';')
[[1]]
[1] "abc" "def" "ghi"
[[2]]
[1] "abc" "def" "ghi"
[[3]]
[1] "abc" "def" "ghi"
[[4]]
[1] "abc" "def" "ghi"
[[5]]
[1] "abc" "def" "ghi"
[[6]]
[1] "abc" "def" "ghi"
[[7]]
[1] "abc" "def" "ghi"
[[8]]
[1] "abc" "def" "ghi"
[[9]]
[1] "abc" "def" "ghi"
[[10]]
[1] "abc" "def" "ghi"

How should I split and retain elements using strsplit?

What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.
>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" "" "" "def"
But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.
>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def"
Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?
Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:
test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )
[[1]]
[1] "abc" "123" "def"
You could use lookaround assertions.
> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"
You can use strapply from gsubfn package.
test <- "abc123def"
strapply(X=test,
pattern="([^[:digit:]]*)(\\d+)(.+)",
FUN=c,
simplify=FALSE)
[[1]]
[1] "abc" "123" "def"

Resources