R grep: is there an AND operator? - r

Suppose I have the following data frame:
User.Id Tags
34234 imageUploaded,people.jpg,more,comma,separated,stuff
34234 imageUploaded
12345 people.jpg
How might I use grep (or some other tool) to only grab rows that include both "imageUploaded" and "people"? In other words, how might I create a subset that includes just the rows with the strings "imageUploaded" AND "people.jpg", regardless of order.
I have tried:
data.people<-data[grep("imageUploaded|people.jpg",results$Tags),]
data.people<-data[grep("imageUploaded?=people.jpg",results$Tags),]
Is there an AND operator? Or perhaps another way to get the intended result?

Thanks to this answer, this regex seems to work. You want to use grepl() which returns a logical to index into your data object. I won't claim to fully understand the inner workings of the regex, but regardless:
x <- c("imageUploaded,people.jpg,more,comma,separated,stuff", "imageUploaded", "people.jpg")
grepl("(?=.*imageUploaded)(?=.*people\\.jpg)", x, perl = TRUE)
#-----
[1] TRUE FALSE FALSE

I love #Chase's answer, and it makes good sense to me, but it can be a bit dangerous to use constructs that one doesn't totally understand.
This answer is meant to reassure anyone who'd like to use #thelatemail's more straightforward approach that it works just as well and is completely competitive speedwise. It's certainly what I'd use in this case. (It's also reassuring that the more sophisticated Perl-compatible-regex pays no performance cost for its power and easy extensibility.)
library(rbenchmark)
x <- paste0(sample(letters, 1e6, replace=T), ## A longer vector of
sample(letters, 1e6, replace=T)) ## possible matches
## Both methods give identical results
tlm <- grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE)
pat <- "(?=.*a)(?=.*b)"
Chase <- grepl(pat, x, perl=TRUE)
identical(tlm, Chase)
# [1] TRUE
## Both methods are similarly fast
benchmark(
tlm = grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE),
Chase = grepl(pat, x, perl=TRUE))
# test replications elapsed relative user.self sys.self
# 2 Chase 100 9.89 1.105 9.80 0.10
# 1 thelatemail 100 8.95 1.000 8.47 0.48

For readability's sake, you could just do:
x <- c(
"imageUploaded,people.jpg,more,comma,separated,stuff",
"imageUploaded",
"people.jpg"
)
xmatches <- intersect(
grep("imageUploaded",x,fixed=TRUE),
grep("people.jpg",x,fixed=TRUE)
)
x[xmatches]
[1] "imageUploaded,people.jpg,more,comma,separated,stuff"

Below is an alternative to grep using hadley's stringr::str_detect(). This avoids the use of perl=true #jan-stanstrup. Additionally, the dplyr::filter() will return the rows within the dataframe itself so you never need to leave the df.
library(stringr)
libary(dplyr)
x <- data.frame(User.Id =c(34234,34234,12345),
Tags=c("imageUploaded,people.jpg,more,comma,separated,stuff",
"imageUploaded",
"people.jpg"))
data.people <- x %>% filter(str_detect(Tags,"(?=.*imageUploaded)(?=.*people\\.jpg)"))
data.people
# returns
# User.Id Tags
# 1 34234 imageUploaded,people.jpg,more,comma,separated,stuff
This is simpler and works if "people.jpg" always follows "imageUploaded"
str_extract(x,"imageUploaded.*people\\.jpg")

Related

How to merge two list elements/ strings of characters together in R [duplicate]

How can I concatenate (merge, combine) two values?
For example I have:
tmp = cbind("GAD", "AB")
tmp
# [,1] [,2]
# [1,] "GAD" "AB"
My goal is to concatenate the two values in "tmp" to one string:
tmp_new = "GAD,AB"
Which function can do this for me?
paste()
is the way to go. As the previous posters pointed out, paste can do two things:
concatenate values into one "string", e.g.
> paste("Hello", "world", sep=" ")
[1] "Hello world"
where the argument sep specifies the character(s) to be used between the arguments to concatenate,
or collapse character vectors
> x <- c("Hello", "World")
> x
[1] "Hello" "World"
> paste(x, collapse="--")
[1] "Hello--World"
where the argument collapse specifies the character(s) to be used between the elements of the vector to be collapsed.
You can even combine both:
> paste(x, "and some more", sep="|-|", collapse="--")
[1] "Hello|-|and some more--World|-|and some more"
help.search() is a handy function, e.g.
> help.search("concatenate")
will lead you to paste().
For the first non-paste() answer, we can look at stringr::str_c() (and then toString() below). It hasn't been around as long as this question, so I think it's useful to mention that it also exists.
Very simple to use, as you can see.
tmp <- cbind("GAD", "AB")
library(stringr)
str_c(tmp, collapse = ",")
# [1] "GAD,AB"
From its documentation file description, it fits this problem nicely.
To understand how str_c works, you need to imagine that you are building up a matrix of strings. Each input argument forms a column, and is expanded to the length of the longest argument, using the usual recyling rules. The sep string is inserted between each column. If collapse is NULL each row is collapsed into a single string. If non-NULL that string is inserted at the end of each row, and the entire matrix collapsed to a single string.
Added 4/13/2016: It's not exactly the same as your desired output (extra space), but no one has mentioned it either. toString() is basically a version of paste() with collapse = ", " hard-coded, so you can do
toString(tmp)
# [1] "GAD, AB"
As others have pointed out, paste() is the way to go. But it can get annoying to have to type paste(str1, str2, str3, sep='') everytime you want the non-default separator.
You can very easily create wrapper functions that make life much simpler. For instance, if you find yourself concatenating strings with no separator really often, you can do:
p <- function(..., sep='') {
paste(..., sep=sep, collapse=sep)
}
or if you often want to join strings from a vector (like implode() from PHP):
implode <- function(..., sep='') {
paste(..., collapse=sep)
}
Allows you do do this:
p('a', 'b', 'c')
#[1] "abc"
vec <- c('a', 'b', 'c')
implode(vec)
#[1] "abc"
implode(vec, sep=', ')
#[1] "a, b, c"
Also, there is the built-in paste0, which does the same thing as my implode, but without allowing custom separators. It's slightly more efficient than paste().
> tmp = paste("GAD", "AB", sep = ",")
> tmp
[1] "GAD,AB"
I found this from Google by searching for R concatenate strings: http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html
Alternatively, if your objective is to output directly to a file or stdout, you can use cat:
cat(s1, s2, sep=", ")
Another way:
sprintf("%s you can add other static strings here %s",string1,string2)
It sometimes useful than paste() function. %s denotes the place where the subjective strings will be included.
Note that this will come in handy as you try to build a path:
sprintf("/%s", paste("this", "is", "a", "path", sep="/"))
output
/this/is/a/path
You can create you own operator :
'%&%' <- function(x, y)paste0(x,y)
"new" %&% "operator"
[1] newoperator`
You can also redefine 'and' (&) operator :
'&' <- function(x, y)paste0(x,y)
"dirty" & "trick"
"dirtytrick"
messing with baseline syntax is ugly, but so is using paste()/paste0() if you work only with your own code you can (almost always) replace logical & and operator with * and do multiplication of logical values instead of using logical 'and &'
Given the matrix, tmp, that you created:
paste(tmp[1,], collapse = ",")
I assume there is some reason why you're creating a matrix using cbind, as opposed to simply:
tmp <- "GAD,AB"
Consider the case where the strings are columns and the result should be a new column:
df <- data.frame(a = letters[1:5], b = LETTERS[1:5], c = 1:5)
df$new_col <- do.call(paste, c(df[c("a", "b")], sep = ", "))
df
# a b c new_col
#1 a A 1 a, A
#2 b B 2 b, B
#3 c C 3 c, C
#4 d D 4 d, D
#5 e E 5 e, E
Optionally, skip the [c("a", "b")] subsetting if all columns needs to be pasted.
# you can also try str_c from stringr package as mentioned by other users too!
do.call(str_c, c(df[c("a", "b")], sep = ", "))
glue is a new function, data class, and package that has been developed as part of the tidyverse, with a lot of extended functionality. It combines features from paste, sprintf, and the previous other answers.
tmp <- tibble::tibble(firststring = "GAD", secondstring = "AB")
(tmp_new <- glue::glue_data(tmp, "{firststring},{secondstring}"))
#> GAD,AB
Created on 2019-03-06 by the reprex package (v0.2.1)
Yes, it's overkill for the simple example in this question, but powerful for many situations. (see https://glue.tidyverse.org/)
Quick example compared to paste with with below. The glue code was a bit easier to type and looks a bit easier to read.
tmp <- tibble::tibble(firststring = c("GAD", "GAD2", "GAD3"), secondstring = c("AB1", "AB2", "AB3"))
(tmp_new <- glue::glue_data(tmp, "{firststring} and {secondstring} went to the park for a walk. {firststring} forgot his keys."))
#> GAD and AB1 went to the park for a walk. GAD forgot his keys.
#> GAD2 and AB2 went to the park for a walk. GAD2 forgot his keys.
#> GAD3 and AB3 went to the park for a walk. GAD3 forgot his keys.
(with(tmp, paste(firststring, "and", secondstring, "went to the park for a walk.", firststring, "forgot his keys.")))
#> [1] "GAD and AB1 went to the park for a walk. GAD forgot his keys."
#> [2] "GAD2 and AB2 went to the park for a walk. GAD2 forgot his keys."
#> [3] "GAD3 and AB3 went to the park for a walk. GAD3 forgot his keys."
Created on 2019-03-06 by the reprex package (v0.2.1)
Another non-paste answer:
x <- capture.output(cat(data, sep = ","))
x
[1] "GAD,AB"
Where
data <- c("GAD", "AB")

How to grepl with two pattern objects in R

I have a vector called
vec <- c("16S_s95_S112_R2_101.fastq.gz",
"16S_s95_S112_R1_001.fastq.gz",
"16S_s94_S103_R2_021.fastq.gz",
"16S_s94_S103_R1_001.fastq.gz")
I want to grepl items with sample <- "_s95_" and R1 <- "R1".
I want to use sample and R1 objects while doing grepl and find something matching _s95_ and R1 strings both.
Result I want is 16S_s95_S112_R1_001.fastq.gz.
I tried grepl(pattern = sample&R1, x= vec) which did not work for me.
I can do this with multiple grepl's, but I am trying to find something neat to do this.
For your specific use case where you know the order of the patterns, it's almost certainly going to be faster to follow Jilber Urbina's suggestion to programmatically compose a single regex.
For a more general solution that works regardless of order and on any number of patterns, we can use sapply to loop across each pattern, and then use rowSums to count the number of pattern matches and find the rows where all of them match:
patterns = c("_s95_", 'R1')
sapply(patterns, function(x) grepl(x, vec))
_s95_ R1
[1,] TRUE FALSE
[2,] TRUE TRUE
[3,] FALSE FALSE
[4,] FALSE TRUE
vec[which(rowSums(sapply(patterns, function(x) grepl(x, vec))) == length(patterns))]
[1] "16S_s95_S112_R1_001.fastq.gz"
You need to work a bit more in your pattern in order to get the match, try:
> grep(paste0(".*", sample, ".*", R1), vec, value=TRUE)
[1] "16S_s95_S112_R1_001.fastq.gz"

format numeric without leading zero

What's the best way to format a numeric so that it does NOT show leading zero. For example:
test = .006
sprintf/format/formatC( ??? ) # should result in ".006"
I believe I answered this once before but can't find it. You cannot tell sprintf() et al about a format that drops the leading zero ... so you have to do it yourself, eg via substring():
R> val <- 0.006
R> aa <- substring(sprintf("%4.3f", val), 2)
R> aa
[1] ".006"
R>
f <- function(x) gsub("^(\\s*[+|-]?)0\\.", "\\1.", as.character(x))
f(0.006)
# ".006"
f(-0.006)
# "-.006"
f("+0.006")
# "+.006"
f(" 0.006")
# " .006"
f(10.05)
# "10.05"
You can always fix it up yourself with regular expression search-and-replace:
library(stringr)
test = .006
str_replace(as.character(test), "^0\\.", ".")
Not the most elegant answer, but it works. Substitute whatever string conversion you like for as.character, such as sprintf with your preferred floating point format.

gsub and pad inside of a parenthesis

I have vector like this:
x <- c("20(0.23)", "15(0.2)", "16(0.09)")
and I don't want to mess with the numbers on the outside of the parenthesis but want to remove the leading zero on the numbers inside and make everything have 2 digits. The output will look like:
"20(.23)", "15(.20)", "16(.09)"
Useful information:
I can remove leading zero and retain 2 digits using the function below taken from: LINK
numformat <- function(val) { sub("^(-?)0.", "\\1.", sprintf("%.2f", val)) }
numformat(c(0.2, 0.26))
#[1] ".20" ".26"
I know gsub can be used but I don't know how. I'll provide a strsplit answer but that's hackish at best.
The gsubfn package allows you to replace anything matched by a regex with a function applied to the match. So we could use what you have with your numformat function
library(gsubfn)
# Note that I added as.numeric in because what will be passed in
# is a character string
numformat <- function(val){sub("^(-?)0.", "\\1.", sprintf("%.2f", as.numeric(val)))}
gsubfn("0\\.\\d+", numformat, x)
#[1] "20(.23)" "15(.20)" "16(.09)"
pad.fix<-function(x){
y<-gsub('\\.(\\d)\\)','\\.\\10\\)',x)
gsub('0\\.','\\.',y)
}
the first gsub adds a trailing zero if needed the second gsub removes the leading zero.
That is yet another of these Tyler questions that seem to be complicated just for complications sake :)
So here you go:
R> x <- c("20(0.23)", "15(0.2)", "16(0.09)")
R> sapply(strsplit(gsub("^(\\d+)\\((.*)\\)$", "\\1 \\2", x), " "),
+ function(x) sprintf("%2d(.%02d)",
+ as.numeric(x[1]),
+ as.numeric(x[2])*100))
[1] "20(.23)" "15(.20)" "16(.09)"
R>
We do a few things here:
The gsub() picks off the two two numbers: first the one before the parens, then the one inside the parens. [With hindsight, should have picked after the decimal, see below.]
This prints them out just with whitespace, e.g. "20 0.23" for the first.
We then use a standard strsplit() on this.
We then use sapply to process the list we get from strsplit
We print the first number as a two-digit int.
The second one is more tricky -- the (s)printf() family cannot suppress a leading zero so we print the decimal, and the print two digits of an integer -- and convert the second number accordingly.
It is all concise and in one line, but it would be clearer broken out.
Edit: I don;t often provide the fastest solutions, but when I do, at least I can gloat:
R> dason <- function(x) { numformat <- function(val){sub("^(-?)0.", "\\1.", sprintf("%.2f", as.numeric(val)))}; gsubfn("0\\.\\d+", numformat, x) }
R> dirk <- function(x) { sapply(strsplit(gsub("^(\\d+)\\((.*)\\)$", "\\1 \\2", x), " "), function(x) sprintf("%2d(.%02d)", as.numeric(x[1]), as.numeric(x[2])*100)) }
R>
R> dason(x)
[1] "20(.23)" "15(.20)" "16(.09)"
R> dirk(x)
[1] "20(.23)" "15(.20)" "16(.09)"
R>
R> res <- benchmark(dason(x), dirk(x), replications=1000, order="relative")
R> res
test replications elapsed relative user.self sys.self user.child sys.child
2 dirk(x) 1000 0.133 1.000 0.132 0.000 0 0
1 dason(x) 1000 2.026 15.233 1.960 0.064 0 0
R>
So that's about 15 rimes faster. Not that it matters in this context, but speed never hurt anyone in the long run.
Non gsub answer that's ugly at best.
x <- c("20(0.23)", "15(0.2)", "16(0.09)")
numformat <- function(val) { sub("^(-?)0.", "\\1.", sprintf("%.2f", val)) }
z <- do.call(rbind, strsplit(gsub("\\)", "", x), "\\("))
z[, 2] <- numformat(as.numeric(z[, 2]))
paste0(z[, 1], "(", z[, 2], ")")

Matching multiple patterns

I want to see, if "001" or "100" or "000" occurs in a string of 4 characters of 0 and 1. For example, a 4 character string could be like "1100" or "0010" or "1001" or "1111". How do I match many strings in a string with a single command?
I know grep could be used for pattern matching, but using grep, I can check only one string at a time. I want to know if multiple strings can be used with some other command or with grep itself.
Yes, you can. The | in a grep pattern has the same meaning as or. So you can test for your pattern by using "001|100|000" as your pattern. At the same time, grep is vectorised, so all of this can be done in one step:
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
grep(pattern, x)
[1] 1 2 3
This returns an index of which of your vectors contained the matching pattern (in this case the first three.)
Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl:
grepl(pattern, x)
[1] TRUE TRUE TRUE FALSE
See ?regex for help about regular expressions in R.
Edit:
To avoid creating pattern manually we can use paste:
myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")
Here is one solution using stringr package
require(stringr)
mylist = c("1100", "0010", "1001", "1111")
str_locate(mylist, "000|001|100")
Use the -e argument to add additional patterns:
echo '1100' | grep -e '001' -e '110' -e '101'
If you want logical vector then you should check stri_detect function from stringi package. In your case the pattern is regex, so use this one:
stri_detect_regex(x, pattern)
## [1] TRUE TRUE TRUE FALSE
And some benchmarks:
require(microbenchmark)
test <- stri_paste(stri_rand_strings(100000, 4, "[0-1]"))
head(test)
## [1] "0001" "1111" "1101" "1101" "1110" "0110"
microbenchmark(stri_detect_regex(test, pattern), grepl(pattern, test))
Unit: milliseconds
expr min lq mean median uq max neval
stri_detect_regex(test, pattern) 29.67405 30.30656 31.61175 30.93748 33.14948 35.90658 100
grepl(pattern, test) 36.72723 37.71329 40.08595 40.01104 41.57586 48.63421 100
Sorry for making this an additonal answer, but it is too many lines for a comment.
I just wanted to remind, that the number of items that can be pasted together via paste(..., collapse = "|") to be used as a single matching pattern is limited - see below. Maybe somebody can tell where exactly the limit is? Admittedly the number might not be realistic, but depending on the task to be performed it should not entirely be excluded from our considerations.
For a really large number of items, a loop would be required to check each item of the pattern.
set.seed(0)
samplefun <- function(n, x, collapse){
paste(sample(x, n, replace=TRUE), collapse=collapse)
}
words <- sapply(rpois(10000000, 8) + 1, samplefun, letters, '')
text <- sapply(rpois(1000, 5) + 1, samplefun, words, ' ')
#since execution takes a while, I have commented out the following lines
#result <- grepl(paste(words, collapse = "|"), text)
# Error in grepl(pattern, text) :
# invalid regular expression
# 'wljtpgjqtnw|twiv|jphmer|mcemahvlsjxr|grehqfgldkgfu|
# ...
#result <- stringi::stri_detect_regex(text, paste(words, collapse = "|"))
# Error in stringi::stri_detect_regex(text, paste(words, collapse = "|")) :
# Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG)
You can also use the %like% operator from data.table library.
library(data.table)
# input
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
# check for pattern
x %like% pattern
> [1] TRUE TRUE TRUE FALSE

Resources