Lets say I have a string "Hello." I want to see if this string contains a period:
text <- "Hello."
results <- grepl(".", text)
This returns results as TRUE, but it would return that as well if text is "Hello" without the period.
I'm confused, I can't find anything about this in the documentation and it only does this for the period.
Any ideas?
See the differences with these examples
> grepl("\\.", "Hello.")
[1] TRUE
> grepl("\\.", "Hello")
[1] FALSE
the . means anything as pointed out by SimonO101, if you want to look for an explicit . then you have to skip it by using \\. which means look for a .
R documentation is extensive on regular expressions, you can also take a look at this link to understand the use of the dot.
I use Jilber's approach usually but here are two other ways:
> grepl("[.]", "Hello.")
[1] TRUE
> grepl("[.]", "Hello")
[1] FALSE
> grepl(".", "Hello.", fixed = TRUE)
[1] TRUE
> grepl(".", "Hello", fixed = TRUE)
[1] FALSE
Related
I'm working on a project where I define some nouns like Haus, Boot, Kampf, ... and what to detect every version (singular/plurar) and every combination of these words in sentences. For example, the algorithm should return true if a sentences does contain one of : Häuser, Hausboot, Häuserkampf, Kampfboot, Hausbau, Bootsanleger, ....
Are you familiar with an algorithm that can do such a thing (preferable in R)? Of course I could implement this manually, but I'm pretty sure that something should already exist.
Thanks!
you can use stringr library and the grepl function as it is done in this example:
> # Toy example text
> text1 <- c(" This is an example where Hausbau appears twice (Hausbau)")
> text2 <- c(" Here it does not appear the name")
> # Load library
> library(stringr)
> # Does it appear "Hausbau"?
> grepl("Hausbau", text1)
[1] TRUE
> grepl("Hausbau", text2)
[1] FALSE
> # Number of "Hausbau" in the text
> str_count(text1, "Hausbau")
[1] 2
check <- c("Der Häuser", "Das Hausboot ist", "Häuserkampf", "Kampfboot im Wasser", "NotMe", "Hausbau", "Bootsanleger", "Schauspiel")
base <- c("Haus", "Boot", "Kampf")
unlist(lapply(str_to_lower(stringi::stri_trans_general(check, "Latin-ASCII")), function(x) any(str_detect(x, str_to_lower(base)) == T)))
# [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Breaking it down
Note the comment of Roland, you will match false TRUE values in words like "Schauspiel"
You need to get rid of the special characters, you can use stri_trans_general to translate them to Latin-ASCII
You need to convert your strings to lowercase (i.e. match Boot in Kampfboot)
Then apply over the strings to test and check if they are in the base list, if any of those values is true. You got a match.
I am trying to detect whether a string starts with either of the provided strings (separated by | )
name = "KKSWAP"
stringr::str_starts(name, "RTT|SWAP")
returns TRUE, but
str_starts(name, "SWAP|RTT")
returns FALSE
This behaviour seems wrong, as KKSWAP doesn't start with "RTT" or "SWAP". I would expect this to be false in both above cases.
The reason can be found in the code of the function :
function (string, pattern, negate = FALSE)
{
switch(type(pattern), empty = , bound = stop("boundary() patterns are not supported."),
fixed = stri_startswith_fixed(string, pattern, negate = negate,
opts_fixed = opts(pattern)), coll = stri_startswith_coll(string,
pattern, negate = negate, opts_collator = opts(pattern)),
regex = {
pattern2 <- paste0("^", pattern)
attributes(pattern2) <- attributes(pattern)
str_detect(string, pattern2, negate)
})
}
You can see, it pastes '^' in front of the parttern, so in your example it looks for '^RR|SWAP' and finds 'SWAP'.
If you want to look at more than one pattern you should use a vector:
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE FALSE
If you want just one answer, you can combine with any()
name <- "KKSWAP"
stringr::str_starts(name, c("RTT","SWAP"))
# [1] FALSE
The advantage of stringr::str_starts() is the vectorisation of the pattern argument, but if you don't need it grepl('^RTT|^SWAP', name), as suggested by TTS, is a good base R alternative.
Alternatively, the base function startsWith() suggested by jpsmith offers both the vectorized and | options :
startsWith(name, c("RTT","SWAP"))
# [1] FALSE FALSE
startsWith(name, "RTT|SWAP")
# [1] FALSE
I'm not familiar with the stringr version, but the base R version startsWith returns your desired result. If you don't have to use stringr, this may be a solution:
startsWith(name, "RTT|SWAP")
startsWith(name, "SWAP|RTT")
startsWith(name, "KK")
# > startsWith(name, "RTT|SWAP")
# [1] FALSE
# > startsWith(name, "SWAP|RTT")
# [1] FALSE
# > startsWith(name, "KK")
# [1] TRUE
The help text describes str_starts: Detect the presence or absence of a pattern at the beginning or end of a string. This might be why it's not behaving quite as expected.
pattern is the Pattern with which the string starts or ends.
We can add ^ regex to make it search at the beginning of string and get the expected result.
name = 'KKSWAP'
str_starts(name, '^RTT|^SWAP')
I would prefer grepl in this instance because it seems less misleading.
grepl('^RTT|^SWAP', name)
I am dealing with two strings like this below
x1 <- "Unknown, because not discussed"
x2 <- "Not at goal, no."
How do i use grepl function to distinguish between these two strings ?
When I use grepl("no", x1), it shows TRUE, which is not correct. This is picking up the no in not or Unknown. How do i use string parsing function to detect strings with the word no explicitly ? Any advise is much appreciated.
You can use word boundary \\b to distinguish them. \\bno\\b will match no only without preceding and following word characters:
grepl("\\bno\\b", x1)
# [1] FALSE
grepl("\\bno\\b", x2)
# [1] TRUE
I can think of a couple of options for matching "no" but not "not":
Using the \b "word boundary" pattern:
> x = c("Unknown, because not discussed", "Not at goal, no.")
> grepl("\\bno\\b", x)
[1] FALSE TRUE
Using [^t] to exclude "not":
> grepl("\\bno[^t]", x)
[1] FALSE TRUE
For matching the word "no" by itself the word boundary option "\\bno\\b" is probably best.
I have an R string, with the format
s = `"[some letters and numbers]_[a number]_[more numbers, letters, punctuation, etc, anything]"`
I simply want a way of checking if s contains "_2" in the first position. In other words, after the first _ symbol, is the single number a "2"? How do I do this in R?
I'm assuming I need some complicated regex expresion?
Examples:
39820432_2_349802j_32hfh = TRUE
43lda821_9_428fj_2f = FALSE (notice there is a _2 there, but not in the right spot)
> grepl("^[^_]+_1",s)
[1] FALSE
> grepl("^[^_]+_2",s)
[1] TRUE
basically, look for everything at the beginning except _, and then the _2.
+1 to #Ananda_Mahto for suggesting grepl instead of grep.
I think it's worth answering the generic question "R - test if string contains string" here.
For that, use the
grep function.
# example:
> if(length(grep("ab","aacd"))>0) print("found") else print("Not found")
[1] "Not found"
> if(length(grep("ab","abcd"))>0) print("found") else print("Not found")
[1] "found"
I have the following vector in R and I would like to find all the strings containing A's and B's but not the number 2.
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_Aa")
The following does not work:
grep("A|B|!2", vec1)
It gives me back all the strings:
[1] 1 2 3 4 5
The same is true for this example:
grep("A|B|-2", vec1)
What would be the correct syntax?
You can do this with a fairly simple regular expression:
grep("^[^2]*[AB][^2]*$", vec1)
In words, it means:
^ match the start of the string
[^2]* match anything except "2", zero or more times
[AB] match "A" or "B"
[^2]* match anything except "2", zero or more times
$ match the end of the string
I would use two grep calls:
intersect(grep("A|B",vec1),grep("2",vec1,invert=TRUE))
#[1] 1 3
OP, your attempt is pretty close, try this:
grep('^(A|B|[^2])*$', vec1)
grep generally does not work very well for doing a positive and a negative search in one invocation. You might be able to make it work with a complex regular expression, but you might be better off just doing:
grep '[AB]' somefile.txt | grep -v '2'
The R equivalent of that would be:
grep("2", grep("A|B", vec1, value = T), invert = T)
I extended the answer provided by #eddi. I have tested it in R and it works for me. I changed the last variable in your example since they all contained A|B.
# Create the vector from the OP with one change
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_dd")
I then ran the following code. It will tell you which results you should expect from each section of grep.
First, tell me which columns contain A or B
> grepl("A|B", vec1)
[1] TRUE TRUE TRUE TRUE FALSE
Now tell me which columns contain a "2"
> grepl("2", vec1)
[1] FALSE TRUE FALSE TRUE TRUE
The index we want is 2,4
> grep("2", grep("A|B", vec1, value = T))
[1] 2 4
Done!