grepl for finding words - r

I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I donĀ“t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?

As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE

Related

Count the amount of bigrams in sentences in a dataframe

I have a dataset that looks a bit like this:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
I would like to have a count where I can see the occurrence of certain bigrams. So lets say I have:
trigger_bg_1 <- "sample text"
I expect the output of 2 (as there are two occurrences of "sample text" in the two sentences. I know how to do a word count like this:
trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")
for(i in unlist(words)){
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}
But I cant get something working for a bigram. Any thoughts on how I should change the code to get it working?
But as I have a long test of trigger-words which I need to count in over
In case you want to count the sentences where you have a match you can use grep:
length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2
In case you want to count how many times you find trigger_bg_1 you can use gregexpr:
sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2
You could sum a grepl
sum(grepl(trigger_bg_1, df$sentences))
[1] 2
If you are really interested in bigrams rather than just set word combinations, the package quanteda can offer a more substantial and systematic way forward:
Data:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
Solution:
library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)
Result:
bigrams
in sentence sample text text in sentence 1 sentence 2
2 2 2 1 1
If you want to inspect the frequency count of a specific bigram:
bigrams["in sentence"]
in sentence
2

Exact match from list of words from a text in R

I have list of words and I am looking for words that are there in the text.
The result is that in the last column is always found as it is searching for patterns. I am looking for exact match that is there in words. Not the combinations. For the first three records it should be not found.
Please guide where I am going wrong.
col_1 <- c(1,2,3,4,5)
col_2 <- c("work instruction change",
"technology npi inspections",
" functional locations",
"Construction has started",
" there is going to be constn coon")
df <- as.data.frame(cbind(col_1,col_2))
df$col_2 <- tolower(df$col_2)
words <- c("const","constn","constrction","construc",
"construct","construction","constructs","consttntype","constypes","ct","ct#",
"ct2"
)
pattern_words <- paste(words, collapse = "|")
df$result<- ifelse(str_detect(df$col_2, regex(pattern_words)),"Found","Not Found")
Use word boundaries around the words.
library(stringr)
pattern_words <- paste0('\\b', words, '\\b', collapse = "|")
df$result <- c('Not Found', 'Found')[str_detect(df$col_2, pattern_words) + 1]
#OR with `ifelse`
#df$result <- ifelse(str_detect(df$col_2, pattern_words), "Found", "Not Found")
df
# col_1 col_2 result
#1 1 work instruction change Not Found
#2 2 technology npi inspections Not Found
#3 3 functional locations Not Found
#4 4 construction has started Found
#5 5 there is going to be constn coon Found
You can also use grepl here to keep it in base R :
grepl(pattern_words, df$col_2)

Find documents with two words in a given proximity in R

I have a dataframe with a variable called text, which includes news transcripts. I want to identify transcripts that include the word "Republican" OR "Democrat" AND one of a list of words in a given proximity (let's say within 5 words). For example, if one of the list of words is "Congress," I want to pick up these transcripts:
"Republicans in Congress today voted on a bill." (proximity < 5)
"Democrats in Congress today voted on a bill." (proximity < 5)
And I do NOT want to pick up these transcripts:
"Republicans today passed a bill to allocate funds for Congress." (proximity > 5)
"Democrats today passed a bill to allocate funds for Congress." (proximity > 5)
I can match the list of words without the proximity restraint like this:
transcripts <- data.frame(text=c("Republicans in congress today voted on a bill","Republicans today passed a bill to allocate funds for Congress")
dictionary <- data.frame(word=c("Congress","Capitol"))
transcripts_subset <- transcripts %>%
filter(grepl(paste(dictionary$word, collapse="|"), text))
and I tried looking up the regex to do this correctly, but it throws an error:
transcripts_subset <- transcripts %>%
filter(grepl("\b(paste(dictionary$dehumanizing, collapse="|"))(?:\\W+\\w+){0,5}?\\W+(Republican|Democrat)\b", text))
Error in "\b ..." :
operations are possible only for numeric, logical or complex types
How can I make this work?
Your dplyr filter code looks fine, so here is just the regex bit:
dictionary <- data.frame(word=c("Congress","Capitol"), stringsAsFactors = FALSE)
pattern_after <- paste0("\\b(", paste0(dictionary$word, collapse="|"), ")\\W+(?:\\w+\\W+){0,5}?(Republican(s)*|Democrat(s)*)")
pattern_before <- paste0("\\b(Republican(s)*|Democrat(s)*)\\W+(?:\\w+\\W+){0,5}?", paste0(dictionary$word, collapse="|"), collapse="|")
pattern <- paste0(c(pattern_after, pattern_before), collapse="|")
pattern
#> [1] "\\b(Congress|Capitol)\\W+(?:\\w+\\W+){0,5}?(Republican(s)*|Democrat(s)*)|\\b(Republican(s)*|Democrat(s)*)\\W+(?:\\w+\\W+){0,5}?Congress|Capitol"
grepl(pattern, "Republicans in congress today voted on a bill", perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
grepl(pattern, "Democrats today passed a bill to allocate funds for Congress", perl = TRUE, ignore.case = TRUE)
#> [1] FALSE
grepl(pattern, "A Democrat in Congress", perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
Created on 2019-10-01 by the reprex package (v0.3.0)
To disect this, the regex to find two words sperated by 0 to 5 other words inR is
"\\bword1\\W+(?:\\w+\\W+){0,5}word2"
\\b is a word boundary, meaning whitespace, punctuation or the end of a string.
\\W+ is one or more Non-word characters (i.e., word boundaries or something else)
\\w+ means Word one or more characters i.e. a sequence of letters or numbers
(?:\\w+\\W+) is a a group consisting of word characters followed by non word characters (i.e., words with spaces)
{0,5} indicates the group is matched between 0 and 5 times
You need to set perl = TRUE for this to work. "Republican(s)*" means that "Republican" either followed by an "s" or not. The two separate pattern s are to make sure it works no matter if the word or Republican/Democrat are mentioned first.
You can try the following which splits you string and tests with grep where the words in the dictionary are located. It they are <5 they are selected.
transcripts[sapply(strsplit(as.character(transcripts$text), " "), grep
, pattern=paste(dictionary$word, collapse="|"), ignore.case = TRUE) < 5,]
#[1] Republicans in congress today voted on a bill
#Or using sub to get the first 5 words
transcripts[sapply(sub("((\\S+\\s*){0,5}).*", "\\1", transcripts$text), grepl
, pattern=paste(dictionary$word, collapse="|"), ignore.case = TRUE),]

R count the number of words starts with given letter in a phrase

i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")

stringr: extract words containing a specific word

Consider this simple example
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
Here I want to extract the words containing WIFF, that is I want to end up with a dataframe like this
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
I tried to use
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
but this only retuns NAs. Any ideas?
Thanks!
A classic, non-regex approach via base R would be,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"
You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:
(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.

Resources