How to apply list of regex pattern on list - r

I have a list of strings and a list of patterns
like:
links <- c(
"http://www.google.com"
,"google.com"
,"www.google.com"
,"http://google.com"
,"http://google.com/"
,"www.google.com/#"
,"www.google.com/xpto"
,"http://google.com/xpto"
,"http://google.com/xpto&utml"
,"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA")
patterns <- c(".com$","/$")
what i want is wipe out all links that matches this patterns.
and get this result:
"www.google.com/#"
"www.google.com/xpto"
"http://google.com/xpto"
"http://google.com/xpto&utml"
"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
if i use
x<-lapply (patterns, grepl, links)
i get
[[1]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
what takes me to this 2 lists
> links[!x[[2]]]
[1] "http://www.google.com" "google.com"
[3] "www.google.com" "http://google.com"
[5] "www.google.com/#" "www.google.com/xpto"
[7] "http://google.com/xpto" "http://google.com/xpto&utml"
[9] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
> links[!x[[1]]]
[1] "http://google.com/" "www.google.com/#"
[3] "www.google.com/xpto" "http://google.com/xpto"
[5] "http://google.com/xpto&utml" "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
in this case each result list wiped 1 pattern out.. but i wanted 1 list with all patterns wiped... how to apply the regex to only one result ... or somehow to merge the n boolean vectors always choosing true.
like:
b[1] <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
b[2] <- c(FALSE,FALSE,TRUE,TRUE,FALSE)
b[3] <- c(FALSE,FALSE,FALSE,FALSE,FALSE)
res <- somefunction(b)
res
TRUE,FALSE,TRUE,TRUE,FALSE

In most cases the best solution will be to merge the regular expression patterns, and to apply a single pattern search, as shown in Thomas’ answer.
However, it is also trivial to merge logical vectors by combining them with logical operations. In your case, you want to compute the member-wise logical disjunction. Between two vectors, this can be computed as x | y. Between a list of multiple vectors, it can be computed using Reduce(|, logical_list).
In your case, this results in:
any_matching = Reduce(`|`, lapply(patterns, grepl, links))
result = links[! any_matching]

This should do what you want:
links[!sapply("(\\.com|/)$", grepl, links)]
Explanation:
You can use sapply so you get a vector and not a list
I'd use the pattern "(\\.com|/)$" (i.e. ends with .com OR /).
In the end I negate the resulting boolean vector using !.

You can try the base R code below, using grep
r <- grep(paste0(patterns,collapse = "|"),links,value = TRUE,invert = TRUE)
such that
> r
[1] "www.google.com/#"
[2] "www.google.com/xpto"
[3] "http://google.com/xpto"
[4] "http://google.com/xpto&utml"
[5] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"

You can do this using stringr::str_subset() function.
library(stringr)
str_subset(links, pattern = ".com$|/$", negate = TRUE)

Related

stringr::str_starts returns TRUE when it shouldn't

I am trying to detect whether a string starts with either of the provided strings (separated by | )
name = "KKSWAP"
stringr::str_starts(name, "RTT|SWAP")
returns TRUE, but
str_starts(name, "SWAP|RTT")
returns FALSE
This behaviour seems wrong, as KKSWAP doesn't start with "RTT" or "SWAP". I would expect this to be false in both above cases.
The reason can be found in the code of the function :
function (string, pattern, negate = FALSE)
{
switch(type(pattern), empty = , bound = stop("boundary() patterns are not supported."),
fixed = stri_startswith_fixed(string, pattern, negate = negate,
opts_fixed = opts(pattern)), coll = stri_startswith_coll(string,
pattern, negate = negate, opts_collator = opts(pattern)),
regex = {
pattern2 <- paste0("^", pattern)
attributes(pattern2) <- attributes(pattern)
str_detect(string, pattern2, negate)
})
}
You can see, it pastes '^' in front of the parttern, so in your example it looks for '^RR|SWAP' and finds 'SWAP'.
If you want to look at more than one pattern you should use a vector:
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE FALSE
If you want just one answer, you can combine with any()
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE
The advantage of stringr::str_starts() is the vectorisation of the pattern argument, but if you don't need it grepl('^RTT|^SWAP', name), as suggested by TTS, is a good base R alternative.
Alternatively, the base function startsWith() suggested by jpsmith offers both the vectorized and | options :
startsWith(name, c("RTT","SWAP"))
# [1] FALSE FALSE
startsWith(name, "RTT|SWAP")
# [1] FALSE
I'm not familiar with the stringr version, but the base R version startsWith returns your desired result. If you don't have to use stringr, this may be a solution:
startsWith(name, "RTT|SWAP")
startsWith(name, "SWAP|RTT")
startsWith(name, "KK")
# > startsWith(name, "RTT|SWAP")
# [1] FALSE
# > startsWith(name, "SWAP|RTT")
# [1] FALSE
# > startsWith(name, "KK")
# [1] TRUE
The help text describes str_starts: Detect the presence or absence of a pattern at the beginning or end of a string. This might be why it's not behaving quite as expected.
pattern is the Pattern with which the string starts or ends.
We can add ^ regex to make it search at the beginning of string and get the expected result.
name = 'KKSWAP'
str_starts(name, '^RTT|^SWAP')
I would prefer grepl in this instance because it seems less misleading.
grepl('^RTT|^SWAP', name)

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.
grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE
For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

Remove the numbers < 4 digits in list in a data frame in R

I have a data frame like this this, i need to remove the values less than 4 digits in the item column,
department item
xyz009 c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
Output
department item
xyz009 676547,434567,567369,987654,54546676732
Thank you for your help
Maybe you can try nchar+subset
> subset(v,nchar(v)>4)
[1] "676547" "434567" "567369"
[4] "987654" "54546676732"
DATA
v <- c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
1.Create a minimal reproducible example
xyz009 <- c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
2.Suggested solution using base R:
The vector xyz009 is of type character
typeof(xyz009)
[1] "character"
In order to do maths with it (i.e. use >) we have to cast it to numeric using as.numeric
num_xyz <- as.numeric(xyz009)
Now we can use an index to 'filter' values where an expression evaluates to TRUE:
test_result <- num_xyz > 9999
The vector test_result consists of booleans
test_result
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
We can use these booleans as an 'index' (R keeps only values where the index is TRUE):
num_xyz[test_result]
This returns:
[1] 676547 434567 567369 987654 54546676732
Using base R you can use unlist, and lapply:
xyz009<-c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
unlist(lapply(xyz009,function(x) x[nchar(x)>3]))
The result is:
[1] "676547" "434567" "567369" "987654" "54546676732"

R match part of string against vector of strings

I have a comma separated character class
A = "123,456,789"
and I am trying to get a logical vector for when one of the items in the character class are present in a character array.
B <- as.array(c("456", "135", "789", "111"))
I am looking for logical result of size 4 (length of B)
[1] TRUE FALSE TRUE FALSE
Fairly new to R so any help would be appreciated. Thanks in advance.
You can use a combination of sapply and grepl, which returns a logical if matched
sapply(B, grepl, x=A)
Since your comparison vector is comma-separated, you can use this as a non-looping method.
B %in% strsplit(A, ",")[[1]]
# [1] TRUE FALSE TRUE FALSE
And one other looping method would be to use Vectorize with grepl. This uses mapply internally.
Vectorize(grepl, USE.NAMES = FALSE)(B, A)
# [1] TRUE FALSE TRUE FALSE

Avoiding a loop on a strsplit list

I have a vector v where each entry is one or more strings (or possibly character(0)) seperated by semicolons:
ABC
DEF;ABC;QWE
TRF
character(0)
ABC;GFD
I need to find the indices of the vector which contain "ABC" (1,2,5 or a logical vector T,T,F,F,T) after splitting on ";"
I am currently using a loop as follows:
toSelect=integer(0)
for(i in c(1:length(v))){
if(length(v[i])==0) next
words=strsplit(v[i],";")[[1]]
if(!is.na(match("ABC",words))) toSelect=c(toSelect,i)
}
Unfortunately, my vector has 450k entries, so this takes far too long. I would prefer create a logical vector by doing something like
toSelect=(!is.na(match("ABC",strsplit(v,";")))
But since strsplit returns a list, I can't find a way to properly format strsplit(v,";") as a vector (unlist won't do since it would ruin the indices). Does anybody have any ideas on how to speed up this code?
Thanks!
Use regular expressions:
v = list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
grep("(^|;)ABC($|;)", v)
#[1] 1 2 5
The tricky part is dealing with character(0), which #BlueMagister fudges by replacing it with character(1) (this allows use of a vector, but doesn't allow representation of the original problem). Perhaps
v <- list("ABC", "DEF;ABC;QWE", "TRF", character(0), "ABC;GFD")
v[sapply(v, length) != 0] <- strsplit(unlist(v), ";", fixed=TRUE)
to do the string split. One might proceed in base R, but I'd recommend the IRanges package
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
to install, then
library(IRanges)
w = CharacterList(v)
which gives a list-like structure where all elements must be character vectors.
> w
CharacterList of length 5
[[1]] ABC
[[2]] DEF ABC QWE
[[3]] TRF
[[4]] character(0)
[[5]] ABC GFD
One can then do fun things like ask "are element members equal to ABC"
> w == "ABC"
LogicalList of length 5
[[1]] TRUE
[[2]] FALSE TRUE FALSE
[[3]] FALSE
[[4]] logical(0)
[[5]] TRUE FALSE
or "are any element members equal to ABC"
> any(w == "ABC")
[1] TRUE TRUE FALSE FALSE TRUE
This will scale very well. For operations not supported "out of the box", the strategy (computationally cheap) is to unlist then transform to an equal-length vector then relist using the original CharacterList as a skeleton, for instance to use reverse on each member:
> relist(reverse(unlist(w)), w)
CharacterList of length 5
[[1]] CBA
[[2]] FED CBA EWQ
[[3]] FRT
[[4]] character(0)
[[5]] CBA DFG
As #eddi points out, this is slower than grep. The motivation is (a) to avoid needing to formulate complicated regular expressions while (b) gaining flexibility for other operations one might like to do on data structured like this.
Using strsplit with sapply and %in%:
v <- c("ABC","DEF;ABC;QWE","TRF",character(1),"ABC;GFD")
sapply(strsplit(v,";"),function(x) "ABC" %in% x)
#[1] TRUE TRUE FALSE FALSE TRUE

Resources