Matching two strings and extracting the matched character in R - r

Consider I have the below mentioned input character;
text_input <- c("ADOPT", "A", "FAIL", "FAST")
test <- c("TEST", "INPUT", "FAIL", "FAST")
I would like to match both the inputs and extract the words which occurred in common in text_input, I would like to something similar to str_extract.
I do understand that str_extract uses a matching pattern or word to do it, but,my test data consists of around 500,000 words. Any inputs would be really helpful.
Expected Outcome:
"FAIL", "FAST"
EDIT
Just adding one more question here... What happens when the Input is a pure string, like, the one provided below;
text_input <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")
test <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")
Is it possible to perform string match even in this case, as mentioned above.
Note: Changed the input and testing string.

If we need the characters to be extracted
library(stringr)
str_extract(text_input, paste0("[", test, "]+"))
If we are looking for full string match
library(data.table)
fintersect(data.table(col1 = text_input), data.table(col1 = test))

For the easy example you may use intersect() as already was stated in the comments.
text_input1 <- c("ADOPT", "A", "FAIL", "FAST")
test1 <- c("TEST", "INPUT", "FAIL", "FAST")
intersect(text_input1, test1)
# [1] "FAIL" "FAST"
The long example is a little more complicated.
text_input2 <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")
phrases <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")
The test string vector you've defined - I'll call it phrases contains compound terms of two (or probably more) words i.e. containing spaces. Therefore, we need a regular expression rx1 that can handle it. It is not clear if you want case sensitive matches or not, you'd need tolower() both the phrases and the text for the latter. Next we test whether there's a match or not. If so we extend the regex to rx2 so that we can use it well with gsub() replacement functionality. We Vectorize() our function that it can handle vectors of phrases.
matchPhrase <- Vectorize(function(phr, txt, tol=FALSE) {
rx1 <- gsub(" ", "\\\\s", phr) # handle spaces
if (tol) { # optional tolower
rx1 <- tolower(rx1)
txt <- tolower(txt)
}
if (regexpr(rx1, txt) > 0) { # test for matches
rx2 <- paste0(".*(", rx1, ").*")
return(gsub(rx2, "\\1", txt)) # gsub extraction
} else {
return(NA) # we want NA for no matches
}
})
Default without case-sensitivity.
matchPhrase(phrases, text_input2, tol=FALSE)
# Data Scientist McKinsey ORGANIZATIONS FAST
# "Data Scientist" "McKinsey" NA NA
Non-case-sensitive also finds "organizations".
matchPhrase(phrases, text_input2, tol=TRUE)
# Data Scientist McKinsey ORGANIZATIONS FAST
# "data scientist" "mckinsey" "organizations" NA
For a clean output just do:
as.character(na.omit(matchPhrase(phrases, text_input2, tol=TRUE)))
# [1] "data scientist" "mckinsey" "organizations"
Note: Probably you need to adapt the function several times for your specific needs/desired outputs. Actually the quanteda package is quite sophisticated in doing this kind of stuff.

This can also be achieved using the package fuzzyjoin, which contains a way to join df's based on regex.
text_input <- c("ADOPT", "A", "FAIL", "FAST")
regex <- c("TEST", "INPUT", "FAIL", "FAST")
library(fuzzyjoin)
library(dplyr)
df <- tibble( text = text_input )
df.regex <- tibble( regex_name = regex )
# now we can regex match them
df %>%
regex_left_join( df.regex, by = c( text = "regex_name" ) )
# # A tibble: 4 x 2
# text regex_name
# <chr> <chr>
# 1 ADOPT NA
# 2 A NA
# 3 FAIL FAIL
# 4 FAST FAST
#or only regex 'hits'
df %>%
regex_inner_join( df.regex, by = c( text = "regex_name" ) )
# # A tibble: 2 x 2
# text regex_name
# <chr> <chr>
# 1 FAIL FAIL
# 2 FAST FAST

Related

String match error "invalid regular expression, reason 'Out of memory'"

I have a table that is shaped like this called df (the actual table is 16,263 rows):
title date brand
big farm house 2022-01-01 A
ranch modern 2022-01-01 A
town house 2022-01-01 C
Then I have a table like this called match_list (the actual list is 94,000 rows):
words_for_match
farm
town
clown
beach
city
pink
And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:
match_list <- match_list$words_for_match
match_list <- paste(match_list, collapse = "|")
match_list <- sprintf("\\b(%s)\\b", match_list)
df %>%
filter(grepl(match_list, title))
But then I get the following error:
Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'
If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!
You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words.
Here a simple implementation :
titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
w <- sprintf("\\b(%s)\\b", w)
is.match <- grepl(w, titles)
titles.to.keep <- c(titles.to.keep, titles[is.match])
titles <- titles[!is.match]
print(paste(length(titles), "remaining titles"))
}
titles.to.keep
If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones.
UPDATE
You can also make a mix with your previous strategy to make it faster :
gr.size <- 20
gr.words <- split(words_for_match, ceiling(seq_along(words_for_match) / gr.size))
gr.words <- sapply(gr.words, function(words)
{
words <- paste(words, collapse = "|")
sprintf("\\b(%s)\\b", words)
})
and then iterate on gr.words and not on words_for_match in the first code chunk.

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.
You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

In R, How do you extract multiple matched terms as string and match if TRUE with Regex or Grep?

I'm still a beginner in R. I need help with some code that searches a vector for terms in a list and returns TRUE. If TRUE, return a string of matched terms.
I have it set to tell me if terms match and return the first matched term but I'm not sure how to get the rest of the matched terms.
In the attached code, I have my Desired_Output and the imperfect Final_Output.
#create dataset of 2 columns/vectors. 1st column is "Job Title", 2nd column is "Work Experience"
'Work Experience' <- c("cooked food; cleaned house; made beds", "analyzed data; identified gaps; used sql, python, and r", "used tableau to make dashboards for clients; applied advanced macro excel functions", "financial planning and strategy; consulted with leaders and clients")
'Job Title' <- c("dad", "research analyst", "business intelligence consultant", "finance consultant")
Job_Hist <- data.frame(`Job Title`, `Work Experience`)
#create list of terms to search for in Job_Hist
Term_List <- c("python", " r", "sql", "tableau", "excel")
#use grepl to search the Work Experience vector for terms in CS_Term_List THEN return TRUE or FALSE
Term_TF<- grepl(paste(Term_List, collapse = '|'),Job_Hist$Work.Experience)
#add a new column to our final output dataframe that shows if the job experience matched our terms
Final_Output<-Job_Hist
Final_Output$Term_Test <- Term_TF
#Let's see what what terms caused the TRUE Flag in the Final_Output
m<-regexpr(paste(Term_List, collapse = '|'),
Job_Hist$Work.Experience, perl=TRUE)
T_Match <- regmatches(Job_Hist$Work.Experience,m)
#Compare Final_Output to my Desired_Output and please help me :)
Desired_T_Match <- c("NA", "sql, python, r", "tableau, excel", "NA")
Desired_Output <- data.frame(`Job Title`, `Work Experience`, Term_TF, Desired_T_Match)
#I need 2 things.
#1) a way to tie T_Match back to Final_Output... something like if, TRUE then match
#2) a way to return every term matched in a coma delimited string. Example: research analyst analyzed data... TRUE sql, python
You can use stringr::str_extract_all to get a list of matches from each row:
library(stringr)
library(tidyverse)
Job_Hist$matches <- str_extract_all(Job_Hist$Work.Experience,
paste(Term_List, collapse = '|'), simplify = TRUE)
Work.Experience Term matches.1 matches.2
1 cooked food; cleaned house; made beds FALSE
2 analyzed data; identified gaps; used sql, python, and r TRUE sql python
3 used tableau to make dashboards for clients; applied advanced macro excel functions TRUE tableau excel
4 financial planning and strategy; consulted with leaders and clients FALSE
matches.3
1
2 r
3
4
Edit: if you'd rather have matches in one column as a comma separated string, you can use:
str_extract_all(Job_Hist$Work.Experience, paste(Term_List, collapse = '|')) %>%
sapply(., paste, collapse = ", ")
matches
1
2 sql, python, r
3 tableau, excel
4
Note that if you use the default argument simplify = FALSE in str_extract_all, your column matches will look correct, like the result we get with sapply above. However, if you inspect with str() you'd see each element is actually it's own list, which will cause problems for some types of analysis.

How to find the position of a word in one column with another column in same dataframe in R?

I have two columns in a dataframe train:
Subject |Keyword
the box is beautiful |box
delivery reached in time |delivery
they serve well serve |serve
How to find the position of keyword in the subject ?
Currently, I am using the for loop:
for(k in 1:nrow(train)){
l <- unlist(gregexpr(train$keyword[k],train$subject[k],ignore.case = T))
train$position[k] <- l}
Is there any other way ?
No need for a loop, just use the locate functions in the stringr or stringi package.
train <- data.frame(subject = c("the box is beauty", "delivery reached on time", "they serve well"),
keyword = c("box", "delivery", "serve"),
stringsAsFactors = FALSE)
library(stringr)
train$position_stringr <- str_locate(train$subject, train$keyword)[,1]
#locate returns a matrix and we are just interested in the start of keyword.
library(stringi)
train$position_stringi <- stri_locate_first(train$subject, regex = train$keyword)[,1]
#locate returns a matrix and we are just interested in the start of keyword.
train
subject keyword position_stringr position_stringi
1 the box is beauty box 5 5
2 delivery reached on time delivery 1 1
3 they serve well serve 6 6
You could use the below.
#data.frame created using the below statements
Subject <- c("the box is beauty","delivery reached on time","they serve well")
Keyword <- c("box","delivery","serve")
train <- data.frame(Subject,Keyword)
#Solution
library(stringr)
for(k in 1:nrow(train))
{
t1 <- as.character(train$Subject[k])
t2 <- as.character(train$Keyword[k])
locate_vector <- str_locate(t1,regex(t2,ignore.case=true))[[1]]
train$start_position[k] <- locate_vector
#If end position is also required, the second column from str_locate
#function could be used.
}

Split Speaker and Dialogue in RStudio

I have documents such as :
President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
And when I read those .txt documents I would like to create a second column indicating the speaker name.
So what I tried was to first create a list of all possible names and replace them..
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("#Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","#President Dr. Norbert Lammert:")
prok <- scan(".txt", what = "character", sep = "\n")
prok <- mgsub(members,members_r,prok)
prok <- as.data.frame(prok)
prok$speaker <- grepl("#[^\\#:]*:",prok$prok, ignore.case = T)
My plan was to then get the name between # and : via regex if speaker == true and apply it downwards until there is a different name (and remove all applause/shout brackets obviously), but that is also where I am not sure how I could do that.
Here is the approach:
require (qdap)
#text is the document text
# remove round brackets and text b/w ()
a <- bracketX(text, "round")
names <- c("President Dr. Norbert Lammert","Alexander Dobrindt" )
searchString <- paste(names[1],names[2], sep = ".+")
# Get string from names[1] till names[2] with the help of searchString
string <- regmatches(a, regexpr(searchString, a))
# remove names[2] from string
string <- gsub(names[2],"",string)
This code can be looped when there are more than 2 names
Here is an approach leaning heavily on dplyr.
First, I added a sentence to your sample text to illustrate why we can't just use a colon to identify speaker names.
sampleText <-
"President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
This sentence right here: it is an example of a problem"
I then split the text to simulate the format that it appears you are reading it in (which also puts each speech in a part of a list).
splitText <- strsplit(sampleText, "\n")
Then, I am pulling out all of the potential speakers (anything that precedes a colon) to
allSpeakers <- lapply(splitText, function(thisText){
grep(":", thisText, value = TRUE) %>%
gsub(":.*", "", .) %>%
gsub("\\(", "", .)
}) %>%
unlist() %>%
unique()
Which gives us:
[1] "President Dr. Norbert Lammert"
[2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure"
[3] "Volker Kauder [CDU/CSU]"
[4] "This sentence right here"
Obviously, the last one is not a legitimate name, so should be excluded from our list of speakers:
legitSpeakers <-
allSpeakers[-4]
Now, we are ready to work through the speech. I have included stepwise comments below, instead of describing in text here
speechText <- lapply(splitText, function(thisText){
# Remove applause and interjections (things in parentheses)
# along with any blank lines; though you could leave blanks if you want
cleanText <-
grep("(^\\(.*\\)$)|(^$)", thisText
, value = TRUE, invert = TRUE)
# Split each line by a semicolor
strsplit(cleanText, ":") %>%
lapply(function(x){
# Check if the first element is a legit speaker
if(x[1] %in% legitSpeakers){
# If so, set the speaker, and put the statement in a separate portion
# taking care to re-collapse any breaks caused by additional colons
out <- data.frame(speaker = x[1]
, text = paste(x[-1], collapse = ":"))
} else{
# If not a legit speaker, set speaker to NA and reset text as above
out <- data.frame(speaker = NA
, text = paste(x, collapse = ":"))
}
# Return whichever version we made above
return(out)
}) %>%
# Bind all of the rows together
bind_rows %>%
# Identify clusters of speech that go with a single speaker
mutate(speakingGroup = cumsum(!is.na(speaker))) %>%
# Group by those clusters
group_by(speakingGroup) %>%
# Collapse that speaking down into a single row
summarise(speaker = speaker[1]
, fullText = paste(text, collapse = "\n"))
})
This yields
[[1]]
speakingGroup speaker fullText
1 President Dr. Norbert Lammert I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt.
2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem
If you prefer to have each line of text separately, replace the summarise at the end with mutate(speaker = speaker[1]) and you will get one line for each line of the speech, like this:
speaker text speakingGroup
President Dr. Norbert Lammert I declare the session open. 1
President Dr. Norbert Lammert I will now give the floor to Bundesminister Alexander Dobrindt. 1
Alexander Dobrindt, Minister for Transport and Digital Infrastructure 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure This sentence right here: it is an example of a problem 2
This seems to work
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("#Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","#President Dr. Norbert Lammert:")
testprok <- read.table("txt",header=FALSE,quote = "\"",comment.char="",sep="\t")
testprok$V1 <- mgsub(members,members_r,testprok$V1)
testprok$V2 <- ifelse(grepl("#[^\\#:]*:",testprok$V1),testprok$V1,NA)
####function from http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
testprok$V2 = repeat.before(testprok$V2)

Resources