regex fails with dollar sign - r

In R, I'm trying to match a series of strings from a vector of file names. I only want those without letters that end with .tif
allfiles <- c("181129_16_00_class_mlc.tif", "181129_16_00.tif.aux.xml", "181129_17_00_01_19.tif", "181129_17_00_01_20.tif", "181129_17_00_01_23.tif", "181129_17_00_01_24.tif", "181129_17_00_01_25.tif", "181129_17_00_01_26.tif", "181129_17_00_01_27.tif", "181129_17_00_01_28.tif", "181129_17_00_01_29.tif", "181129_17_00_01_30.tif")
grepl("^[0-9_]+[.tif]", allfiles)
grepl("^[0-9_]+[.tif]$", allfiles)
That returns:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Why does the dollar sign fail? The result I expected from the second grepl was:
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

It's not $ what fails but the usage of brackets. Instead you want
grepl("^[0-9_]+\\.tif$", allfiles)
# [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Meanwhile, ^[0-9_]+[.tif]$ means that after all the digits and/or _, at the end you just have t, i, f, or . That is, only one of those. For instance,
grepl("^[0-9_]+[.tif]$", "1234t")
# [1] TRUE
grepl("^[0-9_]+[.tif]$", "1234tt")
# [1] FALSE

Related

regex a before b with and without whitespaces

I am trying to find all the string which include 'true' when there is no 'act' before it.
An example of possible vector:
vector = c("true","trueact","acttrue","act true","act really true")
What I have so far is this:
grepl(pattern="(?<!act)true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE TRUE TRUE
what I'm hopping for is
[1] TRUE TRUE FALSE FALSE FALSE
May be this works - i.e. to SKIP the match when there is 'act' as preceding substring but match true otherwise
grepl("(act.*true)(*SKIP)(*FAIL)|\\btrue", vector,
perl = TRUE, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE FALSE
Here is one way to do so:
grepl(pattern="^(.(?<!act))*?true", vector, perl=T, ignore.case = T)
[1] TRUE TRUE FALSE FALSE FALSE
^: start of the string
.: matches any character
(?<=): negative lookbehind
act: matches act
*?: matches .(?<!act) between 0 and unlimited times
true: matches true
see here for the regex demo

(How) can multiple backreference be used in alternation patterns?

This question is a spin-off from that question Function to count of consecutive digits in a string vector.
Assume I have strings such as x:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
and want to detect those strings where any number between 2 and 5 occurs in immediate duplication as many times as itself. That is, match if the string contains 22, 333, 4444, and 55555.
If I approach this task in small chunks using backreference, everything is fine:
str_detect(x, "(2)\\1{1}")
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
str_detect(x, "(3)\\1{2}")
[1] FALSE **TRUE** FALSE FALSE FALSE FALSE
str_detect(x, "(4)\\1{3}")
[1] FALSE FALSE FALSE FALSE FALSE **TRUE**
However, if I pursue a single solution for all matches using a vector with the allowed numbers:
digits <- 2:5
and an alternation pattern, such as this:
patt <- paste0("(", digits, ")\\1{", digits - 1, "}", collapse = "|")
patt
[1] "(2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4}"
and input patt into str_detect, this only detects the first alternative, namely (2)\\1{1}:
str_detect(x, patt)
[1] FALSE FALSE FALSE FALSE **TRUE** FALSE
Is it the backreference which cannot be used in alternation patterns? If so, then why does a for loop iterating through each option separately not work either?
res <- c()
for(i in 2:5){
res <- str_detect(x, paste0("(", i, ")\\1{", i - 1, "}"))
}
res
[1] FALSE FALSE FALSE FALSE FALSE FALSE
Advice on this matter is greatly appreciated!
In your pattern (2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4} the quantifier repeats matching the backreference to the first capture group. That is why you only match the first alternative.
You could repeat the next capture group instead as there are multiple groups.
(2)\\1{1}|(3)\\2{2}|(4)\\3{3}|(5)\\4{4}
The (2)\\1{1} can be just (2)\\1 but this is ok as you assembling the pattern dynamically
What about this?
> grepl(
+ paste0(sapply(2:5, function(i) sprintf("(%s)\\%s{%s}", i, i - 1, i - 1)), collapse = "|"),
+ x
+ )
[1] FALSE TRUE FALSE FALSE TRUE TRUE
or
> rowSums(sapply(2:5, function(i) grepl(sprintf("(%s)\\1{%s}", i, i - 1), x))) > 0
[1] FALSE TRUE FALSE FALSE TRUE TRUE
As mentioned in the comments, you need to update the regex:
patt = paste0(
"(", digits, ")\\", digits - 1, "{", digits - 1, "}",
collapse = "|"
)
str_detect(x, patt)
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE
In your for loop, you are replacing res each time so when you print res at the end, you are seeing the result for when i is 5. If you use print() instead:
for(i in 2:5){
print(str_detect(x, paste0("(", i, ")\\1{", i - 1, "}")))
}
Output:
[1] FALSE FALSE FALSE FALSE TRUE FALSE
[1] FALSE TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
If you wanted to use a loop:
map_lgl(x, function(str) {
any(map_lgl(
2:5,
~ str_detect(str, paste0("(", .x, ")\\1{", .x - 1, "}"))
))
})
Output:
[1] FALSE TRUE FALSE FALSE TRUE TRUE

Algorithm that returns false If the result has already come out before in R

I want to make an algorithm that returns false If the result has already come out before.
Below i attach the code i'm using, but it seems not to be the correct one.
I'll be appreciate any help.
x1 <- c("LMP","Dp","LMP","LMP","Dp")
x2 <- c("Dp","Dp","LMP","LMP","Dp")
for(i in 1:length(x)){
if(i==1){TRUE}else{
if(length(unique(x[1:i]))==1){FALSE}else{TRUE}}}
# The result that i want is
# for x1:
TRUE, TRUE, FALSE, FALSE, FALSE
# for x2:
TRUE, FALSE, TRUE, FALSE, FALSE
Use the duplicated instead of a for loop - duplicated returns TRUE for each duplicate element, by negating (!), TRUE -> FALSE, and viceversa.
!duplicated(x1)
[1] TRUE TRUE FALSE FALSE FALSE
!duplicated(x2)
[1] TRUE FALSE TRUE FALSE FALSE
It can be done in a for loop as well.
f1 <- function(vec) {
un1 <- ""
out <- logical(length(vec))
for(i in seq_along(vec)) {
if(!vec[i] %in% un1) {
un1 <- c(un1, vec[i])
out[i] <- TRUE
}
}
out
}
-testing
> f1(x1)
[1] TRUE TRUE FALSE FALSE FALSE
> f1(x2)
[1] TRUE FALSE TRUE FALSE FALSE

Determining the infinite elements in a matrix-R

Let Loss be a 500x24 matrix. I control is there any infinite element in i by typing:
> any(abs(Loss==Inf))
[1] TRUE
Result says us there is at least one infinite element in the matrix.
So I search the column(s) which consists the infinite element. So I type the below code:
n1<-dim(Loss)[2]
xx<-sapply(1:n1, function(i) {any(abs(Loss[i])==Inf)})
> xx
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This time it seems that no column consists infinite element.
Why is that? I will be very glad for any help. Thanks a lot.

find where in boolean vector TRUE followed by FALSE

I have an boolean vector and need to indicate where it changes (from TRUE to FALSE).
input <- c(rep(TRUE,3), rep(FALSE,2), TRUE, FALSE)
input
[1] TRUE TRUE TRUE FALSE FALSE TRUE FALSE
The result should be c(4, 7). Does something for doing so already exist (in base)? thx, J

Resources