Return only matched pattern from grep - r

given the follwing example in R:
my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
grep('(?<=ivw_2014_)[a-z]*',my.list,perl=T,value=T)
returns
a b c
"ivw_2014_abc.pdf" "ivw_2014_def.pdf" "ivw_2014_ghi.pdf"
I would like to make it return only
[1] 'abc' 'def' 'ghi'
in bash I would use the -o option. How do I achieve this in R?

Without using any capturing groups,
> my.list<-list(a='ivw_2014_abc.pdf',b='ivw_2014_def.pdf',c='ivw_2014_ghi.pdf')
> gsub("^.*_|\\..*$", "", my.list, perl=T)
[1] "abc" "def" "ghi"

For example :
sub('.*_(.*)[.].*','\\1',my.list)
[1] "abc" "def" "ghi"

Following may be of interest:
as.character(unlist(data.frame(strsplit(as.character(unlist(data.frame(strsplit(as.character(my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"
Following is easier to read:
as.character(
unlist(data.frame(strsplit(as.character(
unlist(data.frame(strsplit(as.character(
my.list),'\\.'))[1,])), '_'))[3,]))
[1] "abc" "def" "ghi"

Another option would be:
library(stringi)
stri_extract_first_regex(unlist(my.list), "[A-Za-z]+(?=\\.)")
#[1] "abc" "def" "ghi"

Look at the regmatches function. It works with regexpr rather than grep, but returns just the matched part of the string.

Related

Extracting every nth element of vector of lists

I have the following ids.
ids <- c('a-000', 'b-001', 'c-002')
I want to extract the numeric part of them (001, 002, 003).
I tried this :
(str_split(ids, '-', n=2))[2]
returns the following :
[[1]]
[1] "b" "001"
I don't want the second element of the list. I want the second element of all elements in the vector. I know this is definitely a basic question, but how do I resolve the syntax conflict? Going through lambda function ?
The function is also available in base R.
sapply(strsplit(ids, "-"), `[`, 2)
# [1] "000" "001" "002"
You can also try gsub and substring.
gsub("\\D+", "", ids)
# [1] "000" "001" "002"
substring(ids, 3)
# [1] "000" "001" "002"
To continue with your attempt, you can use sapply :
sapply(stringr::str_split(ids, '-', n=2), `[`, 2)
#[1] "000" "001" "002"
It is better to use str_split_fixed though here.
stringr::str_split_fixed(ids, '-', n=2)[, 2]
#[1] "000" "001" "002"
Or in base R :
sub('.*?-(.*)-?.*', '\\1', ids)
You could try str_remove(ids, "\\D+")
With base R you can remove all the characters that are not digits:
ids <- c('a-000', 'b-001', 'c-002')
gsub("[^[:digit:]]", "", ids)
#> [1] "000" "001" "002"
[:digit:] is regex for digit and ^ means everything that is not a digit, so you basically replace every other characters with empty string "".
For more information see documentation for gsub() and regex in R.
An option with str_replace
library(stringr)
str_replace(ids, "\\D+", "")
#[1] "000" "001" "002"

append a list element-wise to elements of a nested list in R

I'm new to R and still trying to get my head around the apply family instead of using loops.
I have two lists, one nested, the other not, both composed of characters:
>lst1 <- list(c("ABC", "DEF", "GHI"), c("JKL", "MNO", "PQR"))
>lst2 <- c("abc", "def")
I would like to create a third list such that each element of lst2 is appended as the last element of the respective sublist in lst1. The desired output looks like this:
>lst3
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
My experience thus far in R tells me there likely is a way of doing this without writing a loop explicitly.
You can use Map which does exactly what mapply(..., simplify = F) do:
Map(c, lst1, lst2)
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
You can definitely use lapply if you apply your function over the length of your lst1 vector. This works:
lapply(1:length(lst1),function(i) append(lst1[[i]],lst2[[i]]))
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"
lapply will not do what you need. You can use a loop with append to do this:
list1 <- list(c("ABC","DEF","GHI"),c("JKL","MNO","PQR"))
list2 <- c("abc","def")
listcomplete <- list(c("ABC","DEF","GHI","abc"),c("JKL","MNO","PQR","def"))
for (i in 1:length(list2)) {
list1[[i]] <- append(list1[[i]],list2[i])
}
Results:
> list1
[[1]]
[1] "ABC" "DEF" "GHI" "abc"
[[2]]
[1] "JKL" "MNO" "PQR" "def"

Chopping string everytime a ; is there R

I have a string in which the entries are delimited by a ;.
abc;def;tyu;poi;asf;ghl
Is there a function in R which allows me to split up this string into
abc
def
tyu
all being seperate objects so that I can seperately access and index them?
Is there a way to do this without a character argument? Ny directly indexing a cell in a data frame? The cell looks like the string shown above but if I try
strsplit(k[1,8],split=';')
there is an error non character argument.
items <- scan(text="abc;def;tyu;poi;asf;ghl", sep=";", what="")
Read 6 items
as.matrix(items)
#----------
[,1]
[1,] "abc"
[2,] "def"
[3,] "tyu"
[4,] "poi"
[5,] "asf"
[6,] "ghl"
If these are items in a data.frame, it's likely that you should have used read.delim with sep=";" and probably stringsAsFactors=FALSE. You can still extract from a factor value with as.character:
df <- data.frame(a="abc;def;tyu;poi;asf;ghl")
items <- scan(text=df[1,1], sep=";", what="")
Error in textConnection(text) : invalid 'text' argument
# Use as.character instead
items <- scan(text=as.character(df[1,1]), sep=";", what="")
Read 6 items
For this you can use the strsplit function.
> strsplit('abc;def;ghi', split = ';')
[[1]]
[1] "abc" "def" "ghi"
Note that strsplit is vectorized, and it returns a list of results, also if you just feed one string. To get only that one string:
strsplit('abc;def;ghi', split = ';')[[1]]
[1] "abc" "def" "ghi"
the advantage of the vectorization is that you can feed strsplit a vector of strings:
> strsplit(rep('abc;def;ghi', 10), split = ';')
[[1]]
[1] "abc" "def" "ghi"
[[2]]
[1] "abc" "def" "ghi"
[[3]]
[1] "abc" "def" "ghi"
[[4]]
[1] "abc" "def" "ghi"
[[5]]
[1] "abc" "def" "ghi"
[[6]]
[1] "abc" "def" "ghi"
[[7]]
[1] "abc" "def" "ghi"
[[8]]
[1] "abc" "def" "ghi"
[[9]]
[1] "abc" "def" "ghi"
[[10]]
[1] "abc" "def" "ghi"

How should I split and retain elements using strsplit?

What a strsplit function in R does is, match and delete a given regular expression to split the rest of the string into vectors.
>strsplit("abc123def", "[0-9]+")
[[1]]
[1] "abc" "" "" "def"
But how should I split the string the same way using regular expression, but also retain the matches? I need something like the following.
>FUNCTION("abc123def", "[0-9]+")
[[1]]
[1] "abc" "123" "def"
Using strapply("abc123def", "[0-9]+|[a-z]+") works here, but what if the rest of the string other than the matches cannot be captured by a regular expression?
Fundamentally, it seems to me that what you want is not to split on [0-9]+ but to split on the transition between [0-9]+ and everything else. In your string, that transition is not pre-existing. To insert it, you could pre-process with gsub and back-referencing:
test <- "abc123def"
strsplit( gsub("([0-9]+)","~\\1~",test), "~" )
[[1]]
[1] "abc" "123" "def"
You could use lookaround assertions.
> test <- "abc123def"
> strsplit(test, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl=T)
[[1]]
[1] "abc" "123" "def"
You can use strapply from gsubfn package.
test <- "abc123def"
strapply(X=test,
pattern="([^[:digit:]]*)(\\d+)(.+)",
FUN=c,
simplify=FALSE)
[[1]]
[1] "abc" "123" "def"

R strsplit with multiple unordered split arguments?

Given a character string
test_1<-"abc def,ghi klm"
test_2<-"abc, def ghi klm"
I wish to obtain
"abc"
"def"
"ghi"
However, using strsplit, one must know the order of the splitting values in the string, as strsplit uses the first value to do the first split, the second to do the second... and then recycles.
But this does not:
strsplit(test_1, c(",", " "))
strsplit(test_2, c(" ", ","))
strsplit(test_2, split=c("[:punct:]","[:space:]"))[[1]]
I am looking to split the string wherever I find any of my splitting values in a single step.
Actually strsplit uses grep patterns as well. (A comma is a regex metacharacter whereas a space is not; hence the need for double escaping the commas in the pattern argument. So the use of "\\s" would be more to improve readability than of necessity):
> strsplit(test_1, "\\, |\\,| ") # three possibilities OR'ed
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_2, "\\, |\\,| ")
[[1]]
[1] "abc" "def" "ghi" "klm"
Without using both \\, and \\, (note extra space that SO does not show) you would have gotten some character(0) values. Might have been clearer if I had written:
> strsplit(test_2, "\\,\\s|\\,|\\s")
[[1]]
[1] "abc" "def" "ghi" "klm"
#Fojtasek is so right: Using character classes often simplifies the task because it creates an implicit logical OR:
> strsplit(test_2, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"
> strsplit(test_1, "[, ]+")
[[1]]
[1] "abc" "def" "ghi" "klm"
In case you don't like regular expressions, you can call strsplit() multiple times:
strsplits <- function(x, splits, ...)
{
for (split in splits)
{
x <- unlist(strsplit(x, split, ...))
}
return(x[!x == ""]) # Remove empty values
}
strsplits(test_1, c(" ", ","))
# "abc" "def" "ghi" "klm"
strsplits(test_2, c(" ", ","))
# "abc" "def" "ghi" "klm"
Updated for the added example
strsplits(test_1, c("[[:punct:]]","[[:space:]]"))
# "abc" "def" "ghi" "klm"
strsplits(test_2, c("[[:punct:]]","[[:space:]]"))
# "abc" "def" "ghi" "klm"
But if you are going to use regular expressions, you might as well go with #DWin's approach:
strsplit(test_1, "[[:punct:][:space:]]+")[[1]]
# "abc" "def" "ghi" "klm"
strsplit(test_2, "[[:punct:][:space:]]+")[[1]]
# "abc" "def" "ghi" "klm"
You could go with strsplit(test_1, "\\W").
test_1<-"abc def,ghi klm"
test_2<-"abc, def ghi klm"
key_words <- c("abc","def","ghi")
matches <- str_c(key_words, collapse ="|")
str_extract_all(test_1, matches)
str_extract_all(test_2, matches)

Resources