How to grepl with two pattern objects in R - r

I have a vector called
vec <- c("16S_s95_S112_R2_101.fastq.gz",
"16S_s95_S112_R1_001.fastq.gz",
"16S_s94_S103_R2_021.fastq.gz",
"16S_s94_S103_R1_001.fastq.gz")
I want to grepl items with sample <- "_s95_" and R1 <- "R1".
I want to use sample and R1 objects while doing grepl and find something matching _s95_ and R1 strings both.
Result I want is 16S_s95_S112_R1_001.fastq.gz.
I tried grepl(pattern = sample&R1, x= vec) which did not work for me.
I can do this with multiple grepl's, but I am trying to find something neat to do this.

For your specific use case where you know the order of the patterns, it's almost certainly going to be faster to follow Jilber Urbina's suggestion to programmatically compose a single regex.
For a more general solution that works regardless of order and on any number of patterns, we can use sapply to loop across each pattern, and then use rowSums to count the number of pattern matches and find the rows where all of them match:
patterns = c("_s95_", 'R1')
sapply(patterns, function(x) grepl(x, vec))
_s95_ R1
[1,] TRUE FALSE
[2,] TRUE TRUE
[3,] FALSE FALSE
[4,] FALSE TRUE
vec[which(rowSums(sapply(patterns, function(x) grepl(x, vec))) == length(patterns))]
[1] "16S_s95_S112_R1_001.fastq.gz"

You need to work a bit more in your pattern in order to get the match, try:
> grep(paste0(".*", sample, ".*", R1), vec, value=TRUE)
[1] "16S_s95_S112_R1_001.fastq.gz"

Related

how to create loop for multiple output vectors with grabl function in stringdist

I'm trying to apply the grabl function of stringdist to a large character vector "testref". I want to check for whether the strings in another character vector "testtitle" can be found in "testref". However, grabl does only allow for a single string to be tested at a time.
How can I circumvent this limitation?
Example to reproduce
#in reality each of the elements contains a full bibliography of a scientific article
testref <- c("asdfd sfgdgags dgsd.dsfas.dfs.f.sfas.f My beatiful title asfsdf dsf asfd dsf dsfsdfdsfsd, fdsf sdfdf: fsd fsdfafsd (2000) dsdfsf sfda", "sdfasfdsd, sdfsddf, fsagsg: sfds sfasdf sdfsdf", "sadfsdf: sdfsdf sdfggsdg another title here sdfdfsds, asdgasg (2021) blablabal")
#the pattern vector can contain up to 500 titles of scientific articles that contain typos or formatting mistakes. Hence, I need to use approximate matching
testtitle <- c("holy cow", "random notes", "MI beautiful title", "quantitative research is hard", "an0ther title here")
What I want to get out of this is a list of logical TRUE/FALSE vectors
results_list
#[[1]]
#[1] FALSE FALSE FALSE
#[[2]]
#[1] FALSE FALSE FALSE
#[[3]]
#[1] TRUE FALSE FALSE
#[[4]]
#[1] FALSE FALSE FALSE
#[[5]]
#[1] FALSE FALSE TRUE
So far I, I tried to loop the process as per #Rui Barradas suggestion. Technically it works, but it takes a very long time.
results_list <- vector("list", length = 5)
for(i in 1:5) {
results_list[[i]] <- grabl(testref, testtitle[i], maxDist = 8)
}
I was wondering whether it is possible to use lapply in combination with the grabl function.
results_list <- lapply(testtitle, function(testtitle) grabl(testref, testtitle[], maxDist = 2))
But I get this error: Error in grabl(testref, testtitle[], maxDist = 2) :
could not find function "grabl"
I'm very grateful for your past suggestions and hope for more input!
Thank you!
Something like the following might do what you want. Untested, since there is no data.
# create a list to hold the results beforehand
results_list <- vector("list", length = 126)
for(i in 1:126) {
results_list[[i]] <- grabl(year2002$References, ref_year2002$Title[i], maxDist = 8))
}

Find best match for multiple substrings across multiple candidates

I have the following sample data:
targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")
Desired Output:
I would like to find sdassder as the Output since it includes the most Matches for targets (as substrings).
What i tried:
x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))
Goal:
As you can see, i found some dirty Code that technically yields the result, but i dont feel its a best practise.I hope this Question fits here otherwise i move to Code review.
I tried mapply, do.call, outer, but didnt manage to find a better Code.
Edit:
Adding another Option myself, after seeing the current answers.
Using pipes:
sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]
You can simplify it a little, I think.
matches <- sapply(targets, grepl, candidates)
matches
# der das
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] FALSE FALSE
And find the number of matches using rowSums:
rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"
(Note that this last part does not really inform about ties.)
If you want to see the individual matches per-candidate, you can always apply the names manually, though this is only an aesthetic thing, adding very little to the work itself.
rownames(matches) <- candidates
matches
# der das
# sdassder TRUE TRUE
# sderf TRUE FALSE
# fongs FALSE FALSE
rowSums(matches)
# sdassder sderf fongs
# 2 1 0
which.max(rowSums(matches))
# sdassder
# 1 <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"
One stringr option could be:
candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]
[1] "sdassder"
We could paste the targets together and create a pattern to match.
library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"
Use it in str_count to count the number of times pattern was matched.
str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0
Get the index of maximum value and subset it from original candidates
candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"

Sapply grepl data frames exact/complete matches

I have the same problem as in :
How to apply grepl for data frame
But I'm getting undesired matches, as in :
Complete word matching using grepl in R
How do I apply the \< or \b solution in a sapply environment when grepl is looping through vectors?
You'd used an anonymous function to be applied to each element of the columns in the data frame.
vec1 <- c("I don't want to match this", "This is what I want to match")
vec2 <- c('Why would I match this?', "What is a good match for this?")
df <- data.frame(vec1,vec2)
sapply(df, function(x) grepl("\\<is\\>", x))
vec1 vec2
[1,] FALSE FALSE
[2,] TRUE TRUE
I found a solution myself.
It's sufficient to paste a blank space before and after each element in the vector to be matched with the sentences.
vector <- paste(" ", vector, " ")
matches <- sapply(vector, grepl, sentences, ignore.case=TRUE )

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

R: manipulating data.frames containing strings and booleans

I have a data.frame in R; it's called p. Each element in the data.frame is either True or False. My variable p has, say, m rows and n columns. For every row there is strictly only one TRUE element.
It also has column names, which are strings. What I would like to do is the following:
For every row in p I see a TRUE I would like to replace with the name of the corresponding column
I would then like to collapse the data.frame, which now contains FALSEs and column names, to a single vector, which will have m elements.
I would like to do this in an R-thonic manner, so as to continue my enlightenment in R and contribute to a world without for-loops.
I can do step 1 using the following for loop:
for (i in seq(length(colnames(p)))) {
p[p[,i]==TRUE,i]=colnames(p)[i]
}
but theres's no beauty here and I have totally subscribed to this for-loops-in-R-are-probably-wrong mentality. Maybe wrong is too strong but they're certainly not great.
I don't really know how to do step 2. I kind of hoped that the sum of a string and FALSE would return the string but it doesn't. I kind of hoped I could use an OR operator of some kind but can't quite figure that out (Python responds to False or 'bob' with 'bob'). Hence, yet again, I appeal to you beautiful Rstats people for help!
Here's some sample data:
df <- data.frame(a=c(FALSE, TRUE, FALSE), b=c(TRUE, FALSE, FALSE), c=c(FALSE, FALSE, TRUE))
You can use apply to do something like this:
names(df)[apply(df, 1, which)]
Or without apply by using which directly:
idx <- which(as.matrix(df), arr.ind=T)
names(df)[idx[order(idx[,1]),"col"]]
Use apply to sweep your index through, and use that index to access the column names:
> df <- data.frame(a=c(TRUE,FALSE,FALSE),b=c(FALSE,FALSE,TRUE),
+ c=c(FALSE,TRUE,FALSE))
> df
a b c
1 TRUE FALSE FALSE
2 FALSE FALSE TRUE
3 FALSE TRUE FALSE
> colnames(df)[apply(df, 1, which)]
[1] "a" "c" "b"
>

Resources