I am dealing with two strings like this below
x1 <- "Unknown, because not discussed"
x2 <- "Not at goal, no."
How do i use grepl function to distinguish between these two strings ?
When I use grepl("no", x1), it shows TRUE, which is not correct. This is picking up the no in not or Unknown. How do i use string parsing function to detect strings with the word no explicitly ? Any advise is much appreciated.
You can use word boundary \\b to distinguish them. \\bno\\b will match no only without preceding and following word characters:
grepl("\\bno\\b", x1)
# [1] FALSE
grepl("\\bno\\b", x2)
# [1] TRUE
I can think of a couple of options for matching "no" but not "not":
Using the \b "word boundary" pattern:
> x = c("Unknown, because not discussed", "Not at goal, no.")
> grepl("\\bno\\b", x)
[1] FALSE TRUE
Using [^t] to exclude "not":
> grepl("\\bno[^t]", x)
[1] FALSE TRUE
For matching the word "no" by itself the word boundary option "\\bno\\b" is probably best.
Related
I'm working on a project where I define some nouns like Haus, Boot, Kampf, ... and what to detect every version (singular/plurar) and every combination of these words in sentences. For example, the algorithm should return true if a sentences does contain one of : Häuser, Hausboot, Häuserkampf, Kampfboot, Hausbau, Bootsanleger, ....
Are you familiar with an algorithm that can do such a thing (preferable in R)? Of course I could implement this manually, but I'm pretty sure that something should already exist.
Thanks!
you can use stringr library and the grepl function as it is done in this example:
> # Toy example text
> text1 <- c(" This is an example where Hausbau appears twice (Hausbau)")
> text2 <- c(" Here it does not appear the name")
> # Load library
> library(stringr)
> # Does it appear "Hausbau"?
> grepl("Hausbau", text1)
[1] TRUE
> grepl("Hausbau", text2)
[1] FALSE
> # Number of "Hausbau" in the text
> str_count(text1, "Hausbau")
[1] 2
check <- c("Der Häuser", "Das Hausboot ist", "Häuserkampf", "Kampfboot im Wasser", "NotMe", "Hausbau", "Bootsanleger", "Schauspiel")
base <- c("Haus", "Boot", "Kampf")
unlist(lapply(str_to_lower(stringi::stri_trans_general(check, "Latin-ASCII")), function(x) any(str_detect(x, str_to_lower(base)) == T)))
# [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Breaking it down
Note the comment of Roland, you will match false TRUE values in words like "Schauspiel"
You need to get rid of the special characters, you can use stri_trans_general to translate them to Latin-ASCII
You need to convert your strings to lowercase (i.e. match Boot in Kampfboot)
Then apply over the strings to test and check if they are in the base list, if any of those values is true. You got a match.
I am working on data where the words are in French.
I want grepl to take into account the word, whether the vowel has an accent or not.
Here is a part of code:
Here I want grepl to spot all word that are radiothérapie or radiotherapie, that is to ignore the accent
ifelse(grepl("Radiothe(é)rapie",mydata$word),"yes","no")
ifelse(grepl("Radioth[eé]rapie", c("Radiotherapie", "Radiothérapie", "Radio")),"yes","no")
Well the brute force way of doing this would be to use a character class containing all variations of the letter, e.g.
ifelse(grepl("Radioth[eé]rapie", mydata$word), "yes", "no")
possible solution: convert to Latin-ASCII before using grepl.
x <- c("radiothérapie", "radiotherapie")
grepl("radiotherapie", stringi::stri_trans_general(x,"Latin-ASCII"))
[1] TRUE TRUE
this should work for most (all?) accents..
You can take a more universal approach
string = "ábçdêfgàõp"
iconv(string, to='ASCII//TRANSLIT')
# [1] "abcdefgaop"
For your scenario
x <- "Radiotherapie"
y <- c("Radiotherapie", "Radiothérapie", "Radio")
grepl(iconv(x, to='ASCII//TRANSLIT'), iconv(y, to='ASCII//TRANSLIT'))
# [1] TRUE TRUE FALSE
Here is a trick using grepl with regex:
Use caret inside the group which negates your selection:
x <- c("radiothérapie", "radiotherapie")
grepl('radioth[é^e]rapie', x)
[1] TRUE TRUE
I have a dataset in which a column (the result variable) contains data in both numeric and character form [e.g. positive, negative, <0.1, 600, >1000 etc].
I want to extract only the numeric data in this column (i.e. <0.1, 600, >1000). Ideally without the use of any external packages.
I tried the following:
x<-gsub('\\D','', x)
But it removes the decimals or less than/more than sign (e.g. 1.56 became 156, <1.0 became 10)
I then tried the following:
x<-as.numeric(gsub("(\\D)\\.","", x))
This time round it keeps the decimal but coerced other values such as <0.1, >100 to become NAs instead.
So my question is, is there any way I can modify the function such that it will keep values containing the '<' or '>" as it is without replacement.
Meaning from
x = c("negative","positive","1.22","<1.0",">200")
I will be able to get back
x = c("","","1.22","<1.0",">200)
I would really appreciate if someone can teach me how to resolve this issue thanks!
Do you need this?
> gsub("[^0-9.<>]", "", x)
[1] "" "" "1.22" "<1.0" ">200"
Does this work for you ? Using grep we can find which all items of the vectors contains numbers, then using value=TRUE will give us those items present. Another way could be using grepl to get logical output for the match. Also in your case \\D would not work as it is match to all non digits including dot, greater than signs.
grep('\\d+', x, value=TRUE)
would yield : [1] "1.22" "<1.0" ">200"
grepl('\\d+', x)
would yield: [1] FALSE FALSE TRUE TRUE TRUE
You may also try gsub using:
> gsub('[a-zA-Z]+', '', x)
[1] "" "" "1.22" "<1.0" ">200"
Using str_remove
library(stringr)
str_remove_all(x, "[A-Za-z]+")
#[1] "" "" "1.22" "<1.0" ">200"
What, what about something like this? Find that elements that do not match your conditions and set them to an empty string.
x <- x[grep('[a-zA-Z]', x)] <- ""
I'm trying to remove all fields that have special characters (#?.* etc) in their text.
I think I should be using
Filter(function(x) {grepl('|[^[:punct:]]).*?', x)} == FALSE, data$V1)
where data$V1 contains my data. However, it seems like
grepl('|[^[:punct:]]).*?', x)
fails with trivial examples like
grepl('|[^[:punct:]]).*?', 'M')
which outputs TRUE even though M has no special characters. How should I be using grepl to remove fields with special characters from a column of data?
To search for "special characters", you can search for the negation of alphanumeric characters as such:
grepl('[^[:alnum:]_]+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
or use the symbol \W
grepl('\\W+', c('m','m#','M9*'))
# [1] FALSE TRUE TRUE
\W is explained in the regular expression help:
"The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]̀)."
Starting a regular expression with a | make it literally useless since it will match anything.
See this JS example:
console.log('With the starting pipe => ' + /|([\W]).*?/.test('M'));
console.log('Without the starting pipe => ' + /([\W]).*?/.test('M'));
Simply put those inside [...] and provide this to the pattern argument to grepl, then negate.
data$V1[!grepl("[#?.*]", data$V1)]
For example,
> x <- c("M", "3#3", "8.*x")
> x[!grepl("[#?.*]", x)]
[1] "M"
I have an R string, with the format
s = `"[some letters and numbers]_[a number]_[more numbers, letters, punctuation, etc, anything]"`
I simply want a way of checking if s contains "_2" in the first position. In other words, after the first _ symbol, is the single number a "2"? How do I do this in R?
I'm assuming I need some complicated regex expresion?
Examples:
39820432_2_349802j_32hfh = TRUE
43lda821_9_428fj_2f = FALSE (notice there is a _2 there, but not in the right spot)
> grepl("^[^_]+_1",s)
[1] FALSE
> grepl("^[^_]+_2",s)
[1] TRUE
basically, look for everything at the beginning except _, and then the _2.
+1 to #Ananda_Mahto for suggesting grepl instead of grep.
I think it's worth answering the generic question "R - test if string contains string" here.
For that, use the
grep function.
# example:
> if(length(grep("ab","aacd"))>0) print("found") else print("Not found")
[1] "Not found"
> if(length(grep("ab","abcd"))>0) print("found") else print("Not found")
[1] "found"