I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?
In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')
I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]
Related
I have the following corpus:
corpus_rhyme <- c("helter-skelter", "lovey-dovey", "riff-raff", "hunter-gatherer",
"day-to-day", "second-hand", "chock-a-block")
Out of all of these words I only need words like "helter-skelter", "lovey-dovey" and "chock-a-block", which are rhyming reduplicatives with a change of consonants. They are usually spelled with a hyphen, and may have a medial component in between the elements, such as "a" in "chock-a-block". I only need to find rhyming reduplicative expressions that have the same number of syllables. For example, although "phoney-baloney" is a rhyming reduplicative, I do not need it.
I was using the following code to find rhyming reduplicatives:
rhyme<- grep("\\b(\\w*)(\\w{2,}?)-(\\w{1,}?-)?\\w*\\2\\b", corpus_rhyme,
ignore.case = TRUE, perl = TRUE, value = TRUE)
This code produces too many false positives. The output is as follows:
rhyme
[1] "helter-skelter" "lovey-dovey" "riff-raff" "hunter-gatherer" "day-to-day" "second-hand"
[7] "chock-a-block"
I was sifting out these false positives manually, which takes too much time. Can anyone advise a better version of this line of code to reduce the number of false positives? For example, "riff-raff" comes up because I want at least 2 last letters to be the same at the end of a reduplicative expression, otherwise I will miss expressions like "rat-tat". But can we specify in this code that these two letters have to be different from each other, so that "rat-tat" is found ("a" is different from "t"), but "riff-raff" ("f" and "f" are the same) does not come up?
Another possible improvement: how can I get rid of words like "day-to-day" where the two elements are exactly the same? I only need rhyming reduplicatives that have a difference in the initial consonants.
Finally, I am not sure if anything can be done about "hunter-gatherer", unless there is a way to calculate the number of syllables and make sure that both elements of the expression have the same number of syllables.
Base R solution, there is most probably an easier way:
# Count the number of hyphens in each element: hyphen_count => integer vector
hyphen_count <- lengths(regmatches(corpus_rhyme, gregexpr("\\-", corpus_rhyme)))
# Check if the expression rhymes: named logical vector => stdout(console)
vapply(ifelse(hyphen_count == 1,
gsub('^[^aeiouAEIOU]+([aeiou]+\\w+)\\s*\\-[^aeiouAEIOU]+([aeiou]+\\w+)', '\\1 \\2', corpus_rhyme),
gsub('^[^aeiouAEIOU]+([aeiou]+\\w+)\\s*\\-\\w+\\-[^aeiouAEIOU]+([aeiou]+\\w+)', '\\1 \\2', corpus_rhyme)
), function(x){
split_string <- unlist(strsplit(x, "\\s"))
identical(split_string[1], split_string[2])
}, logical(1)
)
Data:
corpus_rhyme <- c("helter-skelter", "lovey-dovey", "riff-raff", "hunter-gatherer",
"day-to-day", "second-hand", "chock-a-block")
If it is general phenomenon that last three letters or more of the first word are the same as last three letters or more of the last word, then you could try this:
grep(".*(\\w{3,})-([aeiou]-)?(\\w+)(\\1)", corpus_rhyme,ignore.case = TRUE,value=TRUE)
#[1] "helter-skelter" "lovey-dovey" "chock-a-block"
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}
I am looking to partial match string using %in% operator in R when I run below I get FALSE
'I just want to partial match string' %in% 'partial'
FALSE
Expected Output is TRUE in above case (because it is matched partially)
Since you want to match partially from a sentence you should try using %like% from data.table, check below
library(data.table)
'I just want to partial match string' %like% 'partial'
TRUE
The output is TRUE
`%in_str%` <- function(pattern,s){
grepl(pattern, s)
}
Usage:
> 'a' %in_str% 'abc'
[1] TRUE
You need to strsplit the string so each word in it is its own element in a vector:
"partial" %in% unlist(strsplit('I just want to partial match string'," "))
[1] TRUE
strsplit takes a string and breaks it into a vector of shorter strings. In this case, it breaks on the space (that's the " " at the end), so that you get a vector of individual words. Unfortunately, strstring defaults to save its results as a list, which is why I wrapped it in an unlist - so we get a single vector.
Then we do the %in%, which works in the opposite direction from the one you used: you're trying to find out if string partial is %in% the sentence, not the other way around.
Of course, this is an annoying way of doing it, so it's probably better to go with a grep-based solution if you want to stay within base-R, or Priyanka's data.table solution above -- both of which will also be better at stuff like matching multiple-word strings.
Let's say I have the following strings:
quiz.1.player.chat_results
and
partner_quiz.1.player.chat_results
I have hundreds of strings like this where the only difference is that one is prefixed with "partner" and the other is not. I'm trying to match one but not the other.
The specific pattern I'd like to match looks like so:
index <- grep('^(quiz.)[1-5]{1}.player.chat_results', names(data))
But this will match both strings. I'm guessing I have to use some negative lookahead like so:
^((?!partner).)
But I'm not sure where to use it.
I'll answer your title question, as it will be the most useful to other people finding this question.
How to match strings that do not contain a given pattern? Easy, match the pattern and invert it.
index <- grep('^partner', names(data), invert = TRUE)
Another approach: use str_detect from stringr
> library(stringr)
> str_detect(string, "partner", negate=TRUE)
[1] TRUE FALSE
You can even use one grepl and negate the result
> !grepl("partner", string)
[1] TRUE FALSE
Just for fun: you can split the string using as separator \\. or _ and then iterate over each element of the resulting list comparing each element to partner and finally invert the result
> sapply(strsplit(string, "\\.|_"), function(x) !"partner" %in% x)
[1] TRUE FALSE
We can use two grepl to avoid any confusion
grepl('quiz', names(data)) & !grepl('partner', names(data))
#[1] TRUE FALSE
For someone who is a bit regex-blind like myself, sub can help,
sub('_.*', '', x) == 'partner'
#[1] TRUE FALSE
If you want to match the pattern including the digits, you could use a word boundary \b followed by a negative lookahead (?!partner) to assert what is directly on the right is not partner.
Note to escape the dot to match it literally and you can omit {1}. If you are not the value of the captured group around quiz, you might omit it as well.
To match the rest of the string, you might use \S+ to match not a non whitespace char.
\b(?!partner)quiz\.[1-5]\.player\S*
Regex demo | R demo
For example
regmatches(txt1,regexpr("\\b(?!partner)quiz\\.[1-5]\\.player\\S*",txt, per=TRUE))
I am trying to get to grips with the world of regular expressions in R.
I was wondering whether there was any simple way of combining the functionality of "grep" and "gsub"?
Specifically, I want to append some additional information to anything which matches a specific pattern.
For a generic example, lets say I have a character vector:
char_vec <- c("A","A B","123?")
Then lets say I want to append any letter within any element of char_vec with
append <- "_APPEND"
Such that the result would be:
[1] "A_APPEND" "A_APPEND B_APPEND" "123?"
Clearly a gsub can replace the letters with append, but this does not keep the original expression (while grep would return the letters but not append!).
Thanks in advance for any / all help!
It seems you are not familiar with backreferences that you may use in the replacement patterns in (g)sub. Once you wrap a part of the pattern with a capturing group, you can later put this value back into the result of the replacement.
So, a mere gsub solution is possible:
char_vec <- c("A","A B","123?")
append <- "_APPEND"
gsub("([[:alpha:]])", paste0("\\1", append), char_vec)
## => [1] "A_APPEND" "A_APPEND B_APPEND" "123?"
See this R demo.
Here, ([[:alpha:]]) matches and captures into Group 1 any letter and \1 in the replacement reinserts this value into the result.
Definatly not as slick as #Wiktor Stribiżew but here is what i developed for another method.
char_vars <- c('a', 'b', 'a b', '123')
grep('[A-Za-z]', char_vars)
gregexpr('[A-Za-z]', char_vars)
matches = regmatches(char_vars,gregexpr('[A-Za-z]', char_vars))
for(i in 1:length(matches)) {
for(found in matches[[i]]){
char_vars[i] = sub(pattern = found,
replacement = paste(found, "_append", sep=""),
x=char_vars[i])
}
}