I have a dataframe with a variable called text, which includes news transcripts. I want to identify transcripts that include the word "Republican" OR "Democrat" AND one of a list of words in a given proximity (let's say within 5 words). For example, if one of the list of words is "Congress," I want to pick up these transcripts:
"Republicans in Congress today voted on a bill." (proximity < 5)
"Democrats in Congress today voted on a bill." (proximity < 5)
And I do NOT want to pick up these transcripts:
"Republicans today passed a bill to allocate funds for Congress." (proximity > 5)
"Democrats today passed a bill to allocate funds for Congress." (proximity > 5)
I can match the list of words without the proximity restraint like this:
transcripts <- data.frame(text=c("Republicans in congress today voted on a bill","Republicans today passed a bill to allocate funds for Congress")
dictionary <- data.frame(word=c("Congress","Capitol"))
transcripts_subset <- transcripts %>%
filter(grepl(paste(dictionary$word, collapse="|"), text))
and I tried looking up the regex to do this correctly, but it throws an error:
transcripts_subset <- transcripts %>%
filter(grepl("\b(paste(dictionary$dehumanizing, collapse="|"))(?:\\W+\\w+){0,5}?\\W+(Republican|Democrat)\b", text))
Error in "\b ..." :
operations are possible only for numeric, logical or complex types
How can I make this work?
Your dplyr filter code looks fine, so here is just the regex bit:
dictionary <- data.frame(word=c("Congress","Capitol"), stringsAsFactors = FALSE)
pattern_after <- paste0("\\b(", paste0(dictionary$word, collapse="|"), ")\\W+(?:\\w+\\W+){0,5}?(Republican(s)*|Democrat(s)*)")
pattern_before <- paste0("\\b(Republican(s)*|Democrat(s)*)\\W+(?:\\w+\\W+){0,5}?", paste0(dictionary$word, collapse="|"), collapse="|")
pattern <- paste0(c(pattern_after, pattern_before), collapse="|")
pattern
#> [1] "\\b(Congress|Capitol)\\W+(?:\\w+\\W+){0,5}?(Republican(s)*|Democrat(s)*)|\\b(Republican(s)*|Democrat(s)*)\\W+(?:\\w+\\W+){0,5}?Congress|Capitol"
grepl(pattern, "Republicans in congress today voted on a bill", perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
grepl(pattern, "Democrats today passed a bill to allocate funds for Congress", perl = TRUE, ignore.case = TRUE)
#> [1] FALSE
grepl(pattern, "A Democrat in Congress", perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
Created on 2019-10-01 by the reprex package (v0.3.0)
To disect this, the regex to find two words sperated by 0 to 5 other words inR is
"\\bword1\\W+(?:\\w+\\W+){0,5}word2"
\\b is a word boundary, meaning whitespace, punctuation or the end of a string.
\\W+ is one or more Non-word characters (i.e., word boundaries or something else)
\\w+ means Word one or more characters i.e. a sequence of letters or numbers
(?:\\w+\\W+) is a a group consisting of word characters followed by non word characters (i.e., words with spaces)
{0,5} indicates the group is matched between 0 and 5 times
You need to set perl = TRUE for this to work. "Republican(s)*" means that "Republican" either followed by an "s" or not. The two separate pattern s are to make sure it works no matter if the word or Republican/Democrat are mentioned first.
You can try the following which splits you string and tests with grep where the words in the dictionary are located. It they are <5 they are selected.
transcripts[sapply(strsplit(as.character(transcripts$text), " "), grep
, pattern=paste(dictionary$word, collapse="|"), ignore.case = TRUE) < 5,]
#[1] Republicans in congress today voted on a bill
#Or using sub to get the first 5 words
transcripts[sapply(sub("((\\S+\\s*){0,5}).*", "\\1", transcripts$text), grepl
, pattern=paste(dictionary$word, collapse="|"), ignore.case = TRUE),]
Related
I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it
You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind
A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"
How to find index or position of a word in a given string, below code says the starting position of word and length. After finding the position of the word, I want to extract preceding and succeeding words in my project.
library(stringr)
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was somewhat unique")
word_pos <- regexpr('termination', Output_text)
Output:
[1] 45
attr(,"match.length")
[1] 11
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
45 - It is counting each and every character and displaying starting position of "termination"
11- is length
Here, "termination", is at 7th position, how to find it using r programming
Appreciate your help.
Here it is:
library(stringr)
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was somewhat unique")
words <- unlist(str_split(Output_text, " "))
which(words == "termination")
[1] 7
Edit:
For multiple occurrences of the word in text and generating next and previous keywords:
# Adding a few random "termination" words to the string:
Output_text <- c("applicable to any future potential contract termination disputes as the tepco dispute was termination somewhat unique termination")
words <- unlist(str_split(Output_text, " "))
t1 <- which(words == "termination")
next_keyword <- words[t1+1]
previous_keywords <- words[t1-1]
> next_keyword
[1] "disputes" "somewhat" NA
> previous_keywords
[1] "contract" "was" "unique"
You can do this without worrying about character indices using regular expressions without any external package.
# replace whole string by the words preceding and following 'termination'
(words <- sub("[\\S\\s]+ (\\S+) termination (\\S+) [\\S\\s]+", "\\1 \\2", Output_text, perl = T))
# [1] "contract disputes"
# Split the resulting string into two individual strings
(words <- unlist(strsplit(words, " ")))
# [1] "contract" "disputes"
The easiest way is just the match termination and the surrounding words in str_extract and then str_remove termination.
str_remove(str_extract(Output_text,"\\w+ termination \\w+"),"termination ")
[1] "contract disputes"
I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I donĀ“t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE
I've used to great satisfaction quanteda's textstat_collocation() for extracting MWE. Now I'm trying to extract all matches that match a specific pattern, irrespective of their frequency.
My objective is to create a character vector by extracting featnames from a dfm() built with a regex pattern. I will then use this character vector in the "select" argument for building a dfm. I might also want to use this character vector to add to a dictionary I use as an ontology for building dfms at later stages of the pipeline.
The pattern is: "aged xx-xx" where x is a digit.
I used the regex pattern "aged\s([0-9]{2}-[0-9]{2})" here and got the desired matches. But when I try it in R (adding an additional "\" before "\s"), I don't get any matches.
When I do:
txt <- c("In India, male smokers aged 20-45 perceive brandX positively.",
"In Spain, female buyers aged 30-39 don't purchase brandY.")
ageGroups <- dfm(txt, select = "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)
I get:
character(0)
However, when I try:
ageGroups <- dfm(txt, select = "([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)
I get:
[1] "20-45" "30-39"
It seems I'm unable to capture the white space in the regex. I've gone through many similar questions in SO, with perhaps this being the most relevant, but still can't get to make my specific objective to work.
I also tried:
tokens <- tokens(txt, remove_punct = FALSE, remove_numbers = FALSE, remove_symbols = FALSE)
tokensCompunded <- tokens_compound(tokens, pattern = "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
attr(tokensCompunded, "types")
But I get all tokens back:
[1] "In" " " "India" "," "male" "smokers" "aged" "20-45" "perceive"
[10] "brandX" "positively" "." "Spain" "female" "buyers" "30-39" "don't" "purchase"
[19] "brandY"
I think there might be several other more efficient approaches for extracting character vectors using regex (or glob) with quanteda, and I'm happy to learn new ways how to use this amazing R package.
Thanks for your help!
Edit to original question:
This other question in SO has a similar requirement, i.e. detecting multi-word phrases using kwic objects, and can be further expanded to achieve the objectives stated above with the following addition:
kwicObject <- kwic(corpus, pattern = phrase("aged ([0-9]{2}-[0-9]{2})"), valuetype = "regex")
unique(kwicObject$keyword)
The problem here is that the target text and the multi-word pattern (which contains white space) are not being tokenised the same way. In your example, you have applied a regex for multiple tokens (which includes the whitespace separator) but the target for search has already been split into individual tokens.
We devised a solution to this, a function called phrase(). From ?pattern:
Whitespace is not privileged, so that in a character vector, white
space is interpreted literally. If you wish to consider
whitespace-separated elements as sequences of tokens, wrap the
argument in phrase().
So in this case:
pat <- "aged [0-9]{2}-[0-9]{2}"
toks2 <- tokens_select(toks, pattern = phrase(pat), valuetype = "regex")
toks2
# tokens from 2 documents.
# text1 :
# [1] "aged" "20-45"
#
# text2 :
# [1] "aged" "30-39"
Here, we see that the selection worked, because the phrase() wrapper converted the pattern into a sequence of matches.
If you want to make these a single token, you can send the same pattern argument to tokens_compound():
toks3 <- tokens_compound(toks2, pattern = phrase(pat),
valuetype = "regex", concatenator = " ")
toks3
# tokens from 2 documents.
# text1 :
# [1] "aged 20-45"
#
# text2 :
# [1] "aged 30-39"
Finally, you can use that to construct a dfm, where each multi-word match is a feature. This cannot work unless you have first performed the concatenation at the tokens stage, since by definition a dfm has no order in its features.
dfm(toks3)
# Document-feature matrix of: 2 documents, 2 features (50% sparse).
# 2 x 2 sparse Matrix of class "dfm"
# features
# docs aged 20-45 aged 30-39
# text1 1 0
# text2 0 1
You can change the regex pattern:
select = "aged.*([0-9]{2}-[0-9]{2})"
i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")