Avoiding a loop on a strsplit list - r

I have a vector v where each entry is one or more strings (or possibly character(0)) seperated by semicolons:
ABC
DEF;ABC;QWE
TRF
character(0)
ABC;GFD
I need to find the indices of the vector which contain "ABC" (1,2,5 or a logical vector T,T,F,F,T) after splitting on ";"
I am currently using a loop as follows:
toSelect=integer(0)
for(i in c(1:length(v))){
if(length(v[i])==0) next
words=strsplit(v[i],";")[[1]]
if(!is.na(match("ABC",words))) toSelect=c(toSelect,i)
}
Unfortunately, my vector has 450k entries, so this takes far too long. I would prefer create a logical vector by doing something like
toSelect=(!is.na(match("ABC",strsplit(v,";")))
But since strsplit returns a list, I can't find a way to properly format strsplit(v,";") as a vector (unlist won't do since it would ruin the indices). Does anybody have any ideas on how to speed up this code?
Thanks!

Use regular expressions:
v = list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
grep("(^|;)ABC($|;)", v)
#[1] 1 2 5

The tricky part is dealing with character(0), which #BlueMagister fudges by replacing it with character(1) (this allows use of a vector, but doesn't allow representation of the original problem). Perhaps
v <- list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
v[sapply(v, length) != 0] <- strsplit(unlist(v), ";", fixed=TRUE)
to do the string split. One might proceed in base R, but I'd recommend the IRanges package
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
to install, then
library(IRanges)
w = CharacterList(v)
which gives a list-like structure where all elements must be character vectors.
> w
CharacterList of length 5
[[1]] ABC
[[2]] DEF ABC QWE
[[3]] TRF
[[4]] character(0)
[[5]] ABC GFD
One can then do fun things like ask "are element members equal to ABC"
> w == "ABC"
LogicalList of length 5
[[1]] TRUE
[[2]] FALSE TRUE FALSE
[[3]] FALSE
[[4]] logical(0)
[[5]] TRUE FALSE
or "are any element members equal to ABC"
> any(w == "ABC")
[1] TRUE TRUE FALSE FALSE TRUE
This will scale very well. For operations not supported "out of the box", the strategy (computationally cheap) is to unlist then transform to an equal-length vector then relist using the original CharacterList as a skeleton, for instance to use reverse on each member:
> relist(reverse(unlist(w)), w)
CharacterList of length 5
[[1]] CBA
[[2]] FED CBA EWQ
[[3]] FRT
[[4]] character(0)
[[5]] CBA DFG
As #eddi points out, this is slower than grep. The motivation is (a) to avoid needing to formulate complicated regular expressions while (b) gaining flexibility for other operations one might like to do on data structured like this.

Using strsplit with sapply and %in%:
v <- c("ABC","DEF;ABC;QWE","TRF",character(1),"ABC;GFD")
sapply(strsplit(v,";"),function(x) "ABC" %in% x)
#[1] TRUE TRUE FALSE FALSE TRUE

Related

How to apply list of regex pattern on list

I have a list of strings and a list of patterns
like:
links <- c(
"http://www.google.com"
,"google.com"
,"www.google.com"
,"http://google.com"
,"http://google.com/"
,"www.google.com/#"
,"www.google.com/xpto"
,"http://google.com/xpto"
,"http://google.com/xpto&utml"
,"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA")
patterns <- c(".com$","/$")
what i want is wipe out all links that matches this patterns.
and get this result:
"www.google.com/#"
"www.google.com/xpto"
"http://google.com/xpto"
"http://google.com/xpto&utml"
"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
if i use
x<-lapply (patterns, grepl, links)
i get
[[1]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
what takes me to this 2 lists
> links[!x[[2]]]
[1] "http://www.google.com" "google.com"
[3] "www.google.com" "http://google.com"
[5] "www.google.com/#" "www.google.com/xpto"
[7] "http://google.com/xpto" "http://google.com/xpto&utml"
[9] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
> links[!x[[1]]]
[1] "http://google.com/" "www.google.com/#"
[3] "www.google.com/xpto" "http://google.com/xpto"
[5] "http://google.com/xpto&utml" "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
in this case each result list wiped 1 pattern out.. but i wanted 1 list with all patterns wiped... how to apply the regex to only one result ... or somehow to merge the n boolean vectors always choosing true.
like:
b[1] <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
b[2] <- c(FALSE,FALSE,TRUE,TRUE,FALSE)
b[3] <- c(FALSE,FALSE,FALSE,FALSE,FALSE)
res <- somefunction(b)
res
TRUE,FALSE,TRUE,TRUE,FALSE
In most cases the best solution will be to merge the regular expression patterns, and to apply a single pattern search, as shown in Thomas’ answer.
However, it is also trivial to merge logical vectors by combining them with logical operations. In your case, you want to compute the member-wise logical disjunction. Between two vectors, this can be computed as x | y. Between a list of multiple vectors, it can be computed using Reduce(|, logical_list).
In your case, this results in:
any_matching = Reduce(`|`, lapply(patterns, grepl, links))
result = links[! any_matching]
This should do what you want:
links[!sapply("(\\.com|/)$", grepl, links)]
Explanation:
You can use sapply so you get a vector and not a list
I'd use the pattern "(\\.com|/)$" (i.e. ends with .com OR /).
In the end I negate the resulting boolean vector using !.
You can try the base R code below, using grep
r <- grep(paste0(patterns,collapse = "|"),links,value = TRUE,invert = TRUE)
such that
> r
[1] "www.google.com/#"
[2] "www.google.com/xpto"
[3] "http://google.com/xpto"
[4] "http://google.com/xpto&utml"
[5] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
You can do this using stringr::str_subset() function.
library(stringr)
str_subset(links, pattern = ".com$|/$", negate = TRUE)

grep exact match in vector inside a list in R

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

Dimension of a function

Because this would simplify my code only slightly, I'm mainly asking out of curiosity. Say you have a function
f <- function(x) {
c(x[1] - x[2],
x[2] - x[3],
x[3] - x[1])
}
Is there a way of finding out the dimension of the input required, e.g.
dim(f) = 3
Here's a very questionable solution that involves computing on the language.
I've written three functions that allow matching, searching, and extracting pieces of parse trees based on a sort of "template" parse tree piece. Here they are:
ptlike <- function(lhs,rhs,wcs='.any.',wcf=NULL) if (is.symbol(rhs) && as.character(rhs)%in%wcs) !is.function(wcf) || isTRUE(wcf(lhs,as.character(rhs))) else typeof(lhs) == typeof(rhs) && length(lhs) == length(rhs) && if (is.call(lhs)) all(sapply(seq_along(lhs),function(i) ptlike(lhs[[i]],rhs[[i]],wcs,wcf))) else lhs == rhs;
ptfind <- function(ptspace,ptpat,wcs='.any.',wcf=NULL,locf=NULL,loc=integer(),ptspaceorig=ptspace) c(list(),if (ptlike(ptspace,ptpat,wcs,wcf) && (!is.function(locf) || isTRUE(locf(ptspaceorig,loc)))) list(loc),if (is.call(ptspace)) do.call(c,lapply(seq_along(ptspace),function(i) ptfind(ptspace[[i]],ptpat,wcs,wcf,locf,c(loc,i),ptspaceorig))));
ptextract <- function(ptspace,ptpat,gets='.get.',wcs='.any.',wcf=NULL,locf=NULL,getf=NULL) { getlocs <- do.call(c,lapply(gets,function(get) ptfind(ptpat,as.symbol(get),character()))); if (length(getlocs)==0L) getlocs <- list(integer()); c(list(),do.call(c,lapply(ptfind(ptspace,ptpat,unique(c(gets,wcs)),wcf,locf),function(loc) do.call(c,lapply(getlocs,function(getloc) { cloc <- c(loc,getloc); ptget <- if (length(cloc)>0) ptspace[[cloc]] else ptspace; if (!is.function(getf) || isTRUE(getf(if (missing(ptget)) substitute() else ptget,loc,getloc,as.character(ptpat[[getloc]])))) list(if (missing(ptget)) substitute() else ptget); }))))); };
ptlike() matches two parse tree pieces against each other, allowing for wildcards on the RHS to match anything. For example:
ptlike(1L,2L);
## [1] FALSE
ptlike(1L,1L);
## [1] TRUE
ptlike(quote(a),quote(b));
## [1] FALSE
ptlike(quote(a),quote(a));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(sum(b+1)));
## [1] FALSE
ptlike(quote(sum(a+1)),quote(sum(.any.+1)));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(.any.(a+1)));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(.any.));
## [1] TRUE
ptfind() returns a recursive index vector for each match of a parse tree pattern (RHS, if you like) in a given parse tree space (LHS), combined into a list. For example:
sp <- quote({ a+b*c:d; e+f*g:h; });
ptfind(sp,quote(.any.:.any.));
## [[1]]
## [1] 2 3 3
##
## [[2]]
## [1] 3 3 3
##
sp[[c(2L,3L,3L)]];
## c:d
sp[[c(3L,3L,3L)]];
## g:h
ptextract() is similar to ptfind(), but returns the matched piece of the parse tree space or a subset thereof:
ptextract(sp,quote(c:d));
## [[1]]
## c:d
##
ptextract(sp,quote(.any.:.any.));
## [[1]]
## c:d
##
## [[2]]
## g:h
##
ptextract(sp,quote(.any.:.get.));
## [[1]]
## d
##
## [[2]]
## h
##
So what we can do is extract from (the parse tree that comprises the body of) your function all the subscripts of the x argument and get the maximum literal subscript value:
f <- function(x) c(x[1L]-x[2L],x[2L]-x[3L],x[3L]-x[1L]);
max(unlist(Filter(is.numeric,ptextract(body(f),quote(x[.get.])))));
## [1] 3
The Filter(is.numeric,...) piece isn't really necessary here, since all subscripts are literal numbers (doubles in your definition, integers in mine, although that's inconsequential), but if there were ever non-numeric-literal subscripts, then it would be necessary.
Caveats:
It's absurd.
It would not take into account variable subscripts that might rise above the maximum literal subscript value in the parse tree. (Although, strictly speaking, in the general case, it is impossible to statically analyze all possible subscripts that might occur at run-time, due to the halting problem or something.)
If the vector was ever assigned to a different variable name, say y, and then that variable was indexed with a different subscript, even a numeric literal subscript, this algorithm would not take that into account.
And there's probably some more.

Get indices of all character elements matches in string in R

I want to get indices of all occurences of character elements in some word. Assume these character elements I look for are: l, e, a, z.
I tried the following regex in grep function and tens of its modifications, but I keep receiving not what I want.
grep("/([leazoscnz]{1})/", "ylaf", value = F)
gives me
numeric(0)
where I would like:
[1] 2 3
To use grep work with individual characters of a string, you first need to split the string into separate character vectors. You can use strsplit for this:
strsplit("ylaf", split="")[[1]]
[1] "y" "l" "a" "f"
Next you need to simplify your regular expression, and try the grep again:
strsplit("ylaf", split="")[[1]]
grep("[leazoscnz]", strsplit("ylaf", split="")[[1]])
[1] 2 3
But it is easier to use gregexpr:
gregexpr("[leazoscnz]", "ylaf")
[[1]]
[1] 2 3
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

Vector-list comparison in R

I am currently trying to check if a list(containing multiple vectors filled with values) is equal to a vector. Unfortunately the following functions did not worked for me: match(), any(), %in%. An example of what I am trying to achieve is given below:
Lets say:
lists=list(c(1,2,3,4),c(5,6,7,8),c(9,7))
vector=c(1,2,3,4)
answer=match(lists,vector)
When I execute this it does return False values instead of a positive result. When I compare a vector with a vector is working but when I compare a vector with a list it seems that it can not work properly.
I would use intersect, something like this :
lapply(lists,intersect,vector)
[[1]]
[1] 1 2 3 4
[[2]]
numeric(0)
[[3]]
numeric(0)
I'm not completely sure what you want the result to be (for example do you care about vector order?) but regardless you'll need to think about lapply. For example,
##Create some data
R> lists=list(c(1,2,3,4),c(5,6,7,8),c(9,7))
R> vector=c(1,2,3,4)
then we use lapply to go through each list element and apply a function. In this case, I've used the match function (since you mentioned that in your question):
R> lapply(lists, function(i) all(match(i, vector)))
[[1]]
[1] TRUE
[[2]]
[1] NA
[[3]]
[1] NA
It's probably worth converting to a vector, so
R> unlist(lapply(lists, function(i) all(match(i, vector))))
[1] TRUE NA NA
and to change NA to FALSE, something like:
m = unlist(lapply(lists, function(i) all(match(i, vector))))
m[is.na(m)] = FALSE

Resources