I'm trying to determine if a string is a subset of another string. For example:
chars <- "test"
value <- "es"
I want to return TRUE if "value" appears as part of the string "chars". In the following scenario, I would want to return false:
chars <- "test"
value <- "et"
Use the grepl function
grepl( needle, haystack, fixed = TRUE)
like so:
grepl(value, chars, fixed = TRUE)
# TRUE
Use ?grepl to find out more.
Answer
Sigh, it took me 45 minutes to find the answer to this simple question. The answer is: grepl(needle, haystack, fixed=TRUE)
# Correct
> grepl("1+2", "1+2", fixed=TRUE)
[1] TRUE
> grepl("1+2", "123+456", fixed=TRUE)
[1] FALSE
# Incorrect
> grepl("1+2", "1+2")
[1] FALSE
> grepl("1+2", "123+456")
[1] TRUE
Interpretation
grep is named after the linux executable, which is itself an acronym of "Global Regular Expression Print", it would read lines of input and then print them if they matched the arguments you gave. "Global" meant the match could occur anywhere on the input line, I'll explain "Regular Expression" below, but the idea is it's a smarter way to match the string (R calls this "character", eg class("abc")), and "Print" because it's a command line program, emitting output means it prints to its output string.
Now, the grep program is basically a filter, from lines of input, to lines of output. And it seems that R's grep function similarly will take an array of inputs. For reasons that are utterly unknown to me (I only started playing with R about an hour ago), it returns a vector of the indexes that match, rather than a list of matches.
But, back to your original question, what we really want is to know whether we found the needle in the haystack, a true/false value. They apparently decided to name this function grepl, as in "grep" but with a "Logical" return value (they call true and false logical values, eg class(TRUE)).
So, now we know where the name came from and what it's supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a matches the character "a", the regex a* matches the character "a" 0 or more times, and the regex a+ would match the character "a" 1 or more times. Hence in the example above, the needle we are searching for 1+2, when treated as a regex, means "one or more 1 followed by a 2"... but ours is followed by a plus!
So, if you used the grepl without setting fixed, your needles would accidentally be haystacks, and that would accidentally work quite often, we can see it even works for the OP's example. But that's a latent bug! We need to tell it the input is a string, not a regex, which is apparently what fixed is for. Why fixed? No clue, bookmark this answer b/c you're probably going to have to look it up 5 more times before you get it memorized.
A few final thoughts
The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldn't need to be an argument), the docs list 9 arguments here, which means there's at least 2^9=512 ways to invoke it, that's a lot of work to write, test, and remember... decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, don't give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesn't exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, it's a design flaw that things like TRUE and 0 and "abc" are all vectors.
You want grepl:
> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE
Also, can be done using "stringr" library:
> library(stringr)
> chars <- "test"
> value <- "es"
> str_detect(chars, value)
[1] TRUE
### For multiple value case:
> value <- c("es", "l", "est", "a", "test")
> str_detect(chars, value)
[1] TRUE FALSE TRUE FALSE TRUE
Use this function from stringi package:
> stri_detect_fixed("test",c("et","es"))
[1] FALSE TRUE
Some benchmarks:
library(stringi)
set.seed(123L)
value <- stri_rand_strings(10000, ceiling(runif(10000, 1, 100))) # 10000 random ASCII strings
head(value)
chars <- "es"
library(microbenchmark)
microbenchmark(
grepl(chars, value),
grepl(chars, value, fixed=TRUE),
grepl(chars, value, perl=TRUE),
stri_detect_fixed(value, chars),
stri_detect_regex(value, chars)
)
## Unit: milliseconds
## expr min lq median uq max neval
## grepl(chars, value) 13.682876 13.943184 14.057991 14.295423 15.443530 100
## grepl(chars, value, fixed = TRUE) 5.071617 5.110779 5.281498 5.523421 45.243791 100
## grepl(chars, value, perl = TRUE) 1.835558 1.873280 1.956974 2.259203 3.506741 100
## stri_detect_fixed(value, chars) 1.191403 1.233287 1.309720 1.510677 2.821284 100
## stri_detect_regex(value, chars) 6.043537 6.154198 6.273506 6.447714 7.884380 100
Just in case you would also like check if a string (or a set of strings) contain(s) multiple sub-strings, you can also use the '|' between two substrings.
>substring="as|at"
>string_vector=c("ass","ear","eye","heat")
>grepl(substring,string_vector)
You will get
[1] TRUE FALSE FALSE TRUE
since the 1st word has substring "as", and the last word contains substring "at"
Use grep or grepl but be aware of whether or not you want to use regular expressions.
By default, grep and related take a regular expression to match, not a literal substring. If you're not expecting that, and you try to match on an invalid regex, it doesn't work:
> grep("[", "abc[")
Error in grep("[", "abc[") :
invalid regular expression '[', reason 'Missing ']''
To do a true substring test, use fixed = TRUE.
> grep("[", "abc[", fixed = TRUE)
[1] 1
If you do want regex, great, but that's not what the OP appears to be asking.
You can use grep
grep("es", "Test")
[1] 1
grep("et", "Test")
integer(0)
Similar problem here: Given a string and a list of keywords, detect which, if any, of the keywords are contained in the string.
Recommendations from this thread suggest stringr's str_detect and grepl. Here are the benchmarks from the microbenchmark package:
Using
map_keywords = c("once", "twice", "few")
t = "yes but only a few times"
mapper1 <- function (x) {
r = str_detect(x, map_keywords)
}
mapper2 <- function (x) {
r = sapply(map_keywords, function (k) grepl(k, x, fixed = T))
}
and then
microbenchmark(mapper1(t), mapper2(t), times = 5000)
we find
Unit: microseconds
expr min lq mean median uq max neval
mapper1(t) 26.401 27.988 31.32951 28.8430 29.5225 2091.476 5000
mapper2(t) 19.289 20.767 24.94484 23.7725 24.6220 1011.837 5000
As you can see, over 5,000 iterations of the keyword search using str_detect and grepl over a practical string and vector of keywords, grepl performs quite a bit better than str_detect.
The outcome is the boolean vector r which identifies which, if any, of the keywords are contained in the string.
Therefore, I recommend using grepl to determine if any keywords are in a string.
Related
I've had pretty good experience witht the command "aregexec" for searching a string.
For example, I can adjust the tolerance by changing the parameters in "max".
By allowing 1 substitution, no insertion or deletion, I can extract "abcde" from "123accdefg"
srch<-aregexec("abcde", "123accdefg", max = list(sub=1,del=0,ins=0), ignore.case = TRUE)
srch
> srch
[[1]]
[1] 4
attr(,"match.length")
[1] 5
Previously, I've worked with parameters like, max = list(sub=30,del=0,ins=0), and it worked fine. Today I had a moderately complicated task for it but it failed. The following 2 strings are of the same length with 3 substitutions.
srch<-aregexec("CAGACGCCCCCAAAA", "CAGACCCCTCCAAGA", max = list(sub=8,del=0,ins=0), ignore.case = TRUE)
srch
I simplified the question by showing the probmatic part. I allowed 8 substitutions which is more than sufficient. But still the aregexec returned nothing.
Instead, if I simplify the task a bit more, like removing the last two characters, it worked.
srch<-aregexec("CAGACGCCCCCAA", "CAGACCCCTCCAA", max = list(sub=8,del=0,ins=0), ignore.case = TRUE)
srch
Thank you for you time. I seek for an alternative. Also an explanation would be even better.
Field
The manual of agrep says: If ‘cost’ is not given, ‘all’ defaults to 10%, so you have to give also a value for all:
aregexec("CAGACGCCCCCAAAA", "CAGACCCCTCCAAGA",
max.distance = list(all=3,sub=3,del=0,ins=0), ignore.case = TRUE)
#[[1]]
#[1] 1
#attr(,"match.length")
#[1] 15
I'm trying to determine if a string is a subset of another string. For example:
chars <- "test"
value <- "es"
I want to return TRUE if "value" appears as part of the string "chars". In the following scenario, I would want to return false:
chars <- "test"
value <- "et"
Use the grepl function
grepl( needle, haystack, fixed = TRUE)
like so:
grepl(value, chars, fixed = TRUE)
# TRUE
Use ?grepl to find out more.
Answer
Sigh, it took me 45 minutes to find the answer to this simple question. The answer is: grepl(needle, haystack, fixed=TRUE)
# Correct
> grepl("1+2", "1+2", fixed=TRUE)
[1] TRUE
> grepl("1+2", "123+456", fixed=TRUE)
[1] FALSE
# Incorrect
> grepl("1+2", "1+2")
[1] FALSE
> grepl("1+2", "123+456")
[1] TRUE
Interpretation
grep is named after the linux executable, which is itself an acronym of "Global Regular Expression Print", it would read lines of input and then print them if they matched the arguments you gave. "Global" meant the match could occur anywhere on the input line, I'll explain "Regular Expression" below, but the idea is it's a smarter way to match the string (R calls this "character", eg class("abc")), and "Print" because it's a command line program, emitting output means it prints to its output string.
Now, the grep program is basically a filter, from lines of input, to lines of output. And it seems that R's grep function similarly will take an array of inputs. For reasons that are utterly unknown to me (I only started playing with R about an hour ago), it returns a vector of the indexes that match, rather than a list of matches.
But, back to your original question, what we really want is to know whether we found the needle in the haystack, a true/false value. They apparently decided to name this function grepl, as in "grep" but with a "Logical" return value (they call true and false logical values, eg class(TRUE)).
So, now we know where the name came from and what it's supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a matches the character "a", the regex a* matches the character "a" 0 or more times, and the regex a+ would match the character "a" 1 or more times. Hence in the example above, the needle we are searching for 1+2, when treated as a regex, means "one or more 1 followed by a 2"... but ours is followed by a plus!
So, if you used the grepl without setting fixed, your needles would accidentally be haystacks, and that would accidentally work quite often, we can see it even works for the OP's example. But that's a latent bug! We need to tell it the input is a string, not a regex, which is apparently what fixed is for. Why fixed? No clue, bookmark this answer b/c you're probably going to have to look it up 5 more times before you get it memorized.
A few final thoughts
The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldn't need to be an argument), the docs list 9 arguments here, which means there's at least 2^9=512 ways to invoke it, that's a lot of work to write, test, and remember... decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, don't give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesn't exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, it's a design flaw that things like TRUE and 0 and "abc" are all vectors.
You want grepl:
> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE
Also, can be done using "stringr" library:
> library(stringr)
> chars <- "test"
> value <- "es"
> str_detect(chars, value)
[1] TRUE
### For multiple value case:
> value <- c("es", "l", "est", "a", "test")
> str_detect(chars, value)
[1] TRUE FALSE TRUE FALSE TRUE
Use this function from stringi package:
> stri_detect_fixed("test",c("et","es"))
[1] FALSE TRUE
Some benchmarks:
library(stringi)
set.seed(123L)
value <- stri_rand_strings(10000, ceiling(runif(10000, 1, 100))) # 10000 random ASCII strings
head(value)
chars <- "es"
library(microbenchmark)
microbenchmark(
grepl(chars, value),
grepl(chars, value, fixed=TRUE),
grepl(chars, value, perl=TRUE),
stri_detect_fixed(value, chars),
stri_detect_regex(value, chars)
)
## Unit: milliseconds
## expr min lq median uq max neval
## grepl(chars, value) 13.682876 13.943184 14.057991 14.295423 15.443530 100
## grepl(chars, value, fixed = TRUE) 5.071617 5.110779 5.281498 5.523421 45.243791 100
## grepl(chars, value, perl = TRUE) 1.835558 1.873280 1.956974 2.259203 3.506741 100
## stri_detect_fixed(value, chars) 1.191403 1.233287 1.309720 1.510677 2.821284 100
## stri_detect_regex(value, chars) 6.043537 6.154198 6.273506 6.447714 7.884380 100
Just in case you would also like check if a string (or a set of strings) contain(s) multiple sub-strings, you can also use the '|' between two substrings.
>substring="as|at"
>string_vector=c("ass","ear","eye","heat")
>grepl(substring,string_vector)
You will get
[1] TRUE FALSE FALSE TRUE
since the 1st word has substring "as", and the last word contains substring "at"
Use grep or grepl but be aware of whether or not you want to use regular expressions.
By default, grep and related take a regular expression to match, not a literal substring. If you're not expecting that, and you try to match on an invalid regex, it doesn't work:
> grep("[", "abc[")
Error in grep("[", "abc[") :
invalid regular expression '[', reason 'Missing ']''
To do a true substring test, use fixed = TRUE.
> grep("[", "abc[", fixed = TRUE)
[1] 1
If you do want regex, great, but that's not what the OP appears to be asking.
You can use grep
grep("es", "Test")
[1] 1
grep("et", "Test")
integer(0)
Similar problem here: Given a string and a list of keywords, detect which, if any, of the keywords are contained in the string.
Recommendations from this thread suggest stringr's str_detect and grepl. Here are the benchmarks from the microbenchmark package:
Using
map_keywords = c("once", "twice", "few")
t = "yes but only a few times"
mapper1 <- function (x) {
r = str_detect(x, map_keywords)
}
mapper2 <- function (x) {
r = sapply(map_keywords, function (k) grepl(k, x, fixed = T))
}
and then
microbenchmark(mapper1(t), mapper2(t), times = 5000)
we find
Unit: microseconds
expr min lq mean median uq max neval
mapper1(t) 26.401 27.988 31.32951 28.8430 29.5225 2091.476 5000
mapper2(t) 19.289 20.767 24.94484 23.7725 24.6220 1011.837 5000
As you can see, over 5,000 iterations of the keyword search using str_detect and grepl over a practical string and vector of keywords, grepl performs quite a bit better than str_detect.
The outcome is the boolean vector r which identifies which, if any, of the keywords are contained in the string.
Therefore, I recommend using grepl to determine if any keywords are in a string.
I have the following vector in R and I would like to find all the strings containing A's and B's but not the number 2.
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_Aa")
The following does not work:
grep("A|B|!2", vec1)
It gives me back all the strings:
[1] 1 2 3 4 5
The same is true for this example:
grep("A|B|-2", vec1)
What would be the correct syntax?
You can do this with a fairly simple regular expression:
grep("^[^2]*[AB][^2]*$", vec1)
In words, it means:
^ match the start of the string
[^2]* match anything except "2", zero or more times
[AB] match "A" or "B"
[^2]* match anything except "2", zero or more times
$ match the end of the string
I would use two grep calls:
intersect(grep("A|B",vec1),grep("2",vec1,invert=TRUE))
#[1] 1 3
OP, your attempt is pretty close, try this:
grep('^(A|B|[^2])*$', vec1)
grep generally does not work very well for doing a positive and a negative search in one invocation. You might be able to make it work with a complex regular expression, but you might be better off just doing:
grep '[AB]' somefile.txt | grep -v '2'
The R equivalent of that would be:
grep("2", grep("A|B", vec1, value = T), invert = T)
I extended the answer provided by #eddi. I have tested it in R and it works for me. I changed the last variable in your example since they all contained A|B.
# Create the vector from the OP with one change
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_dd")
I then ran the following code. It will tell you which results you should expect from each section of grep.
First, tell me which columns contain A or B
> grepl("A|B", vec1)
[1] TRUE TRUE TRUE TRUE FALSE
Now tell me which columns contain a "2"
> grepl("2", vec1)
[1] FALSE TRUE FALSE TRUE TRUE
The index we want is 2,4
> grep("2", grep("A|B", vec1, value = T))
[1] 2 4
Done!
I'm trying to determine if a string is a subset of another string. For example:
chars <- "test"
value <- "es"
I want to return TRUE if "value" appears as part of the string "chars". In the following scenario, I would want to return false:
chars <- "test"
value <- "et"
Use the grepl function
grepl( needle, haystack, fixed = TRUE)
like so:
grepl(value, chars, fixed = TRUE)
# TRUE
Use ?grepl to find out more.
Answer
Sigh, it took me 45 minutes to find the answer to this simple question. The answer is: grepl(needle, haystack, fixed=TRUE)
# Correct
> grepl("1+2", "1+2", fixed=TRUE)
[1] TRUE
> grepl("1+2", "123+456", fixed=TRUE)
[1] FALSE
# Incorrect
> grepl("1+2", "1+2")
[1] FALSE
> grepl("1+2", "123+456")
[1] TRUE
Interpretation
grep is named after the linux executable, which is itself an acronym of "Global Regular Expression Print", it would read lines of input and then print them if they matched the arguments you gave. "Global" meant the match could occur anywhere on the input line, I'll explain "Regular Expression" below, but the idea is it's a smarter way to match the string (R calls this "character", eg class("abc")), and "Print" because it's a command line program, emitting output means it prints to its output string.
Now, the grep program is basically a filter, from lines of input, to lines of output. And it seems that R's grep function similarly will take an array of inputs. For reasons that are utterly unknown to me (I only started playing with R about an hour ago), it returns a vector of the indexes that match, rather than a list of matches.
But, back to your original question, what we really want is to know whether we found the needle in the haystack, a true/false value. They apparently decided to name this function grepl, as in "grep" but with a "Logical" return value (they call true and false logical values, eg class(TRUE)).
So, now we know where the name came from and what it's supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a matches the character "a", the regex a* matches the character "a" 0 or more times, and the regex a+ would match the character "a" 1 or more times. Hence in the example above, the needle we are searching for 1+2, when treated as a regex, means "one or more 1 followed by a 2"... but ours is followed by a plus!
So, if you used the grepl without setting fixed, your needles would accidentally be haystacks, and that would accidentally work quite often, we can see it even works for the OP's example. But that's a latent bug! We need to tell it the input is a string, not a regex, which is apparently what fixed is for. Why fixed? No clue, bookmark this answer b/c you're probably going to have to look it up 5 more times before you get it memorized.
A few final thoughts
The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldn't need to be an argument), the docs list 9 arguments here, which means there's at least 2^9=512 ways to invoke it, that's a lot of work to write, test, and remember... decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, don't give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesn't exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, it's a design flaw that things like TRUE and 0 and "abc" are all vectors.
You want grepl:
> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE
Also, can be done using "stringr" library:
> library(stringr)
> chars <- "test"
> value <- "es"
> str_detect(chars, value)
[1] TRUE
### For multiple value case:
> value <- c("es", "l", "est", "a", "test")
> str_detect(chars, value)
[1] TRUE FALSE TRUE FALSE TRUE
Use this function from stringi package:
> stri_detect_fixed("test",c("et","es"))
[1] FALSE TRUE
Some benchmarks:
library(stringi)
set.seed(123L)
value <- stri_rand_strings(10000, ceiling(runif(10000, 1, 100))) # 10000 random ASCII strings
head(value)
chars <- "es"
library(microbenchmark)
microbenchmark(
grepl(chars, value),
grepl(chars, value, fixed=TRUE),
grepl(chars, value, perl=TRUE),
stri_detect_fixed(value, chars),
stri_detect_regex(value, chars)
)
## Unit: milliseconds
## expr min lq median uq max neval
## grepl(chars, value) 13.682876 13.943184 14.057991 14.295423 15.443530 100
## grepl(chars, value, fixed = TRUE) 5.071617 5.110779 5.281498 5.523421 45.243791 100
## grepl(chars, value, perl = TRUE) 1.835558 1.873280 1.956974 2.259203 3.506741 100
## stri_detect_fixed(value, chars) 1.191403 1.233287 1.309720 1.510677 2.821284 100
## stri_detect_regex(value, chars) 6.043537 6.154198 6.273506 6.447714 7.884380 100
Just in case you would also like check if a string (or a set of strings) contain(s) multiple sub-strings, you can also use the '|' between two substrings.
>substring="as|at"
>string_vector=c("ass","ear","eye","heat")
>grepl(substring,string_vector)
You will get
[1] TRUE FALSE FALSE TRUE
since the 1st word has substring "as", and the last word contains substring "at"
Use grep or grepl but be aware of whether or not you want to use regular expressions.
By default, grep and related take a regular expression to match, not a literal substring. If you're not expecting that, and you try to match on an invalid regex, it doesn't work:
> grep("[", "abc[")
Error in grep("[", "abc[") :
invalid regular expression '[', reason 'Missing ']''
To do a true substring test, use fixed = TRUE.
> grep("[", "abc[", fixed = TRUE)
[1] 1
If you do want regex, great, but that's not what the OP appears to be asking.
You can use grep
grep("es", "Test")
[1] 1
grep("et", "Test")
integer(0)
Similar problem here: Given a string and a list of keywords, detect which, if any, of the keywords are contained in the string.
Recommendations from this thread suggest stringr's str_detect and grepl. Here are the benchmarks from the microbenchmark package:
Using
map_keywords = c("once", "twice", "few")
t = "yes but only a few times"
mapper1 <- function (x) {
r = str_detect(x, map_keywords)
}
mapper2 <- function (x) {
r = sapply(map_keywords, function (k) grepl(k, x, fixed = T))
}
and then
microbenchmark(mapper1(t), mapper2(t), times = 5000)
we find
Unit: microseconds
expr min lq mean median uq max neval
mapper1(t) 26.401 27.988 31.32951 28.8430 29.5225 2091.476 5000
mapper2(t) 19.289 20.767 24.94484 23.7725 24.6220 1011.837 5000
As you can see, over 5,000 iterations of the keyword search using str_detect and grepl over a practical string and vector of keywords, grepl performs quite a bit better than str_detect.
The outcome is the boolean vector r which identifies which, if any, of the keywords are contained in the string.
Therefore, I recommend using grepl to determine if any keywords are in a string.
I want to see, if "001" or "100" or "000" occurs in a string of 4 characters of 0 and 1. For example, a 4 character string could be like "1100" or "0010" or "1001" or "1111". How do I match many strings in a string with a single command?
I know grep could be used for pattern matching, but using grep, I can check only one string at a time. I want to know if multiple strings can be used with some other command or with grep itself.
Yes, you can. The | in a grep pattern has the same meaning as or. So you can test for your pattern by using "001|100|000" as your pattern. At the same time, grep is vectorised, so all of this can be done in one step:
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
grep(pattern, x)
[1] 1 2 3
This returns an index of which of your vectors contained the matching pattern (in this case the first three.)
Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl:
grepl(pattern, x)
[1] TRUE TRUE TRUE FALSE
See ?regex for help about regular expressions in R.
Edit:
To avoid creating pattern manually we can use paste:
myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")
Here is one solution using stringr package
require(stringr)
mylist = c("1100", "0010", "1001", "1111")
str_locate(mylist, "000|001|100")
Use the -e argument to add additional patterns:
echo '1100' | grep -e '001' -e '110' -e '101'
If you want logical vector then you should check stri_detect function from stringi package. In your case the pattern is regex, so use this one:
stri_detect_regex(x, pattern)
## [1] TRUE TRUE TRUE FALSE
And some benchmarks:
require(microbenchmark)
test <- stri_paste(stri_rand_strings(100000, 4, "[0-1]"))
head(test)
## [1] "0001" "1111" "1101" "1101" "1110" "0110"
microbenchmark(stri_detect_regex(test, pattern), grepl(pattern, test))
Unit: milliseconds
expr min lq mean median uq max neval
stri_detect_regex(test, pattern) 29.67405 30.30656 31.61175 30.93748 33.14948 35.90658 100
grepl(pattern, test) 36.72723 37.71329 40.08595 40.01104 41.57586 48.63421 100
Sorry for making this an additonal answer, but it is too many lines for a comment.
I just wanted to remind, that the number of items that can be pasted together via paste(..., collapse = "|") to be used as a single matching pattern is limited - see below. Maybe somebody can tell where exactly the limit is? Admittedly the number might not be realistic, but depending on the task to be performed it should not entirely be excluded from our considerations.
For a really large number of items, a loop would be required to check each item of the pattern.
set.seed(0)
samplefun <- function(n, x, collapse){
paste(sample(x, n, replace=TRUE), collapse=collapse)
}
words <- sapply(rpois(10000000, 8) + 1, samplefun, letters, '')
text <- sapply(rpois(1000, 5) + 1, samplefun, words, ' ')
#since execution takes a while, I have commented out the following lines
#result <- grepl(paste(words, collapse = "|"), text)
# Error in grepl(pattern, text) :
# invalid regular expression
# 'wljtpgjqtnw|twiv|jphmer|mcemahvlsjxr|grehqfgldkgfu|
# ...
#result <- stringi::stri_detect_regex(text, paste(words, collapse = "|"))
# Error in stringi::stri_detect_regex(text, paste(words, collapse = "|")) :
# Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG)
You can also use the %like% operator from data.table library.
library(data.table)
# input
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
# check for pattern
x %like% pattern
> [1] TRUE TRUE TRUE FALSE