Sapply grepl data frames exact/complete matches - r

I have the same problem as in :
How to apply grepl for data frame
But I'm getting undesired matches, as in :
Complete word matching using grepl in R
How do I apply the \< or \b solution in a sapply environment when grepl is looping through vectors?

You'd used an anonymous function to be applied to each element of the columns in the data frame.
vec1 <- c("I don't want to match this", "This is what I want to match")
vec2 <- c('Why would I match this?', "What is a good match for this?")
df <- data.frame(vec1,vec2)
sapply(df, function(x) grepl("\\<is\\>", x))
vec1 vec2
[1,] FALSE FALSE
[2,] TRUE TRUE

I found a solution myself.
It's sufficient to paste a blank space before and after each element in the vector to be matched with the sentences.
vector <- paste(" ", vector, " ")
matches <- sapply(vector, grepl, sentences, ignore.case=TRUE )

Related

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

How to grepl with two pattern objects in R

I have a vector called
vec <- c("16S_s95_S112_R2_101.fastq.gz",
"16S_s95_S112_R1_001.fastq.gz",
"16S_s94_S103_R2_021.fastq.gz",
"16S_s94_S103_R1_001.fastq.gz")
I want to grepl items with sample <- "_s95_" and R1 <- "R1".
I want to use sample and R1 objects while doing grepl and find something matching _s95_ and R1 strings both.
Result I want is 16S_s95_S112_R1_001.fastq.gz.
I tried grepl(pattern = sample&R1, x= vec) which did not work for me.
I can do this with multiple grepl's, but I am trying to find something neat to do this.
For your specific use case where you know the order of the patterns, it's almost certainly going to be faster to follow Jilber Urbina's suggestion to programmatically compose a single regex.
For a more general solution that works regardless of order and on any number of patterns, we can use sapply to loop across each pattern, and then use rowSums to count the number of pattern matches and find the rows where all of them match:
patterns = c("_s95_", 'R1')
sapply(patterns, function(x) grepl(x, vec))
_s95_ R1
[1,] TRUE FALSE
[2,] TRUE TRUE
[3,] FALSE FALSE
[4,] FALSE TRUE
vec[which(rowSums(sapply(patterns, function(x) grepl(x, vec))) == length(patterns))]
[1] "16S_s95_S112_R1_001.fastq.gz"
You need to work a bit more in your pattern in order to get the match, try:
> grep(paste0(".*", sample, ".*", R1), vec, value=TRUE)
[1] "16S_s95_S112_R1_001.fastq.gz"

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

Extract expressions with square brackets

The example I have is as follow:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")
I want extract the terms in names which contains one of the patterns in toMatch.
This is what I tried
grep(toMatch, names, value=T)
But, it didn't work for me. Any suggestions?
The problem is that [ character used in toMatch is an reserved character with special meaning in regex/pattern. Hence, we need to first replace [ character with \\[.
Now, collapse toMatch with | and then use it as pattern in grepl function to search matching character in names.
The solution results are:
#Just for indexes
grepl(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names)
#[1] TRUE FALSE TRUE
#For values
grep(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names, value = TRUE)
#[1] "apple[1]" "apple[3]"
Data:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")
We could also remove the letter part and create a logical vector with %in%
names[sub("^[^[]*", "", names) %in% toMatch]
#[1] "apple[1]" "apple[3]"

Resources