matching words in strings with variables in R - r

I have a data set, like the following:
cp<-data.frame("name"=c("billy", "jean", "jean", "billy","billy", "dawn", "dawn"),
"answer"=c("michael jackson is my favorite", "I like flowers", "flower is red","hey michael",
"do not touch me michael","i am a girl","girls have hair"))
Every variable called name has a string attached to it, stored in the variable answer. I would like to find out what specific words, or parts of words, or whole sentences, in the answer variable, that is common for the different names in name:
For example, the name "billy" would have "michael" connected to it.
EDIT:
A data frame with following variables called ddd:
name: debby answer: "did you go to dallas?"
name: debby answer: "debby did dallas"
function(name=debby,data=ddd) {...} ,
which gives output "did debby dallas".

Here's a (not very efficient) function I've made that uses pmatch in order to match partial matches. The problem with it that it will also match a and am or i and is because they are also very close.
freqFunc <- function(x){
temp <- tolower(unlist(strsplit(as.character(x), " ")))
temp2 <- length(temp)
temp3 <- lapply(temp, function(x){
temp4 <- na.omit(temp[pmatch(rep(x, temp2), temp)])
temp4[length(temp4) > 1]
})
list(unique(unlist(temp3)))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean i,is,flower,flowers
# 3: dawn a,am,girl,girls
If you satisfied with just exact matches, this can be very simplified and improve performance (I also added tolower so it will match different cases too)
freqFunc2 <- function(x){
temp <- table(tolower(unlist(strsplit(as.character(x), " "))))
list(names(temp[temp > 1]))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc2), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean
# 3: dawn

With the caveat of understanding correctly, I think this is what you''re looking for. Doesn't handle plurals of words though, as David mentioned. This just finds words that are exactly the same.
billyAnswers<-cp$answer[cp$name=="billy"]
#output of billyAnswers
#[1] "michael jackson is my favorite" "hey michael"
#[3] "do not touch me michael"
Now we get all the words
allWords<-unlist(strsplit(billyAnswer, " "))
#outputvof allWords
# [1] "michael" "jackson" "is" "my" "favorite" "hey"
# [7] "michael" "do" "not" "touch" "me" "michael"
We can find the common ones
common<-allWords[duplicated(allWords)]
#output of common
#[1] "michael" "michael"
Of course there are two michaels because there are multiple instances of michael in billy's answers! So let's pair it down once more.
unique(common)
#[1] "michael"
And there you go, apply that to all names and you got it.
For jean and dawn, there are no common words in their answers, so this method returns two character vectors of length 0
#jean's words
#[1] "I" "like" "flowers" "flower" "is" "red"
#dawn's words
#[1] "i" "am" "a" "girl" "girls" "have" "hair"

Related

R: using \\b and \\B in regex

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.

How to extract text using delimiters when some delimiters missing

I am trying to extract text according to the headers in a semi-structured text document.
Input
Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"
The output here is
Order Subject Name Grade Report Conclusion
1223442 History Bilbo Johnson Bad Need to complete Dud
I can achieve this with the following (messy but it works) function:
dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")
Extractor <- function(dataframeIn, Column, delim) {
dataframeInForLater<-dataframeIn
ColumnForLater<-Column
Column <- rlang::sym(Column)
dataframeIn <- data.frame(dataframeIn)
dataframeIn<-dataframeIn %>%
tidyr::separate(!!Column, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)
dataframeIn<-data.frame(dataframeIn)
#Add the original column back in so have the original reference
dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
dataframeIn<-data.frame(dataframeIn)
return(dataframeIn)
}
Extractor(dataframeIn, "Column", delim)
However, sometimes the delimiters are missing eg
Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud
In which case the desired output is
Order Subject Name Grade Conclusion
1223442 History Bilbo Johnson Bad Dud
but the actual output becomes:
Order Subject Name Grade Report Conclusion
:1223442 :History Bilbo Johnson : Bad : Dud <NA>
How can I account for missing delimiters although they are in the same order (including delimiters that are missing in the middle of the text as well as the end as in the example above) ?
We may do the following (it's only text extraction, I leave constructing the output for you):
library(stringr)
Extractor <- function(x, delim) {
pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" NA "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA "History" "Bilbo Johnson" NA NA NA
Since we have NA's it's clear what delimiters were missing and what weren't.
The way it works in your case is that we have a series of patterns
pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
Then str_match nice extracts the (.*?) part to the second output columns and we get rid of any spaces with trimws. Ah and we use lazy matching in (.*?) as not to match too much.

Find the names contained in each sentence cycling through a large vector of names

This question is an extension of this one: Find the names contained in each sentence (not the other way around)
I'll write the relevant part here. From this:
> sentences
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments."
[5] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
We obtained this result:
library(stringr)
lst <- str_extract_all(sentences, paste(toMatch, collapse="|"))
lst[lengths(lst)==0] <- NA
lst
#[[1]]
#[1] "Martin Luther"
#[[2]]
#[1] "Melanchthon" "Martin Luther"
#[[3]]
#[1] "Paul"
#[[4]]
#[1] NA
#[[5]]
#[1] "Melanchthon"
But for a large toMatch vector, concatenating its values with the OR operator might not be very efficient. So my question is, how can be the same result be obtained using a function or a loop? Maybe this way it can be used a regular expression like \< or \b aroung the toMatch values so the system only looks for the whole words instead of strings.
I've tried this but don't know how to save the matches in lst to get the same result as above.
for(i in 1:length(sentences)){
for(j in 1:length(toMatch)){
lst<-str_extract_all(sentences[i], toMatch[j])
}}
Are you expecting something like this?
library(stringr)
sentences <- c(
"Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin",
" Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther",
" He studied the Scripture, especially of Paul, and Evangelical doctrine",
" He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments.",
" Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium")
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
for(i in 1:length(sentences)){
lst[[i]] <- NA * seq(length(toMatch))
for(j in 1:length(toMatch)){
tmp = str_extract_all(sentences[i], toMatch[j])
if (length(tmp[[1]]) > 0) {
lst[[i]][j] <- tmp[[1]]
}
}}
lapply(lst, function(x) x[!is.na(x)])
lst

Splitting merged words (with mini-dictionary)

I have a set of words: some of which are merged terms, and others that are just simple words. I also have a separate list of words that I am going to use to compare with my first list (as a dictionary) in order to 'un-merge' certain words.
Here's an example:
ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")
My general procedure would be something like this:
search for pattern from ListB that occurs twice in a word in ListA where the merged terms are consecutive (no spare letters in the word). So for example, from ListA 'lowerswim' would match with 'lower' and 'swim' not 'owe' and 'swim'.
for each selected word, check if that word exists in ListB. If yes, then keep it in ListA. Otherwise, split the word into the two words matched with words from ListB
Does this sound sensible? And if so, how do I implement it in R? Maybe it sounds quite routine but at the moment I'm having trouble with:
searching for words inside words. I can match words from lists no problem but I'm not sure how I use grep or equivalent to go further than this
declaring that the words must be consecutive. I've been thinking about this for a while but I can't get to seem to try anything that has worked
Can anyone please send me in the right direction?
I think the first step would be to build all the combined pairs from ListB:
pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
# [1] "dodo" "minedo" "anddo" "thedo" "lowerdo" "owedo" "swimdo"
# [8] "domine" "minemine" "andmine" "themine" "lowermine" "owemine" "swimmine"
# [15] "doand" "mineand" "andand" "theand" "lowerand" "oweand" "swimand"
# [22] "dothe" "minethe" "andthe" "thethe" "lowerthe" "owethe" "swimthe"
# [29] "dolower" "minelower" "andlower" "thelower" "lowerlower" "owelower" "swimlower"
# [36] "doowe" "mineowe" "andowe" "theowe" "lowerowe" "oweowe" "swimowe"
# [43] "doswim" "mineswim" "andswim" "theswim" "lowerswim" "oweswim" "swimswim"
You can use str_extract from the stringr package to extract the element of combos that is contained within each element of ListA, if such an element exists:
library(stringr)
matches <- str_extract(ListA, paste(combos, collapse="|"))
matches
# [1] NA "andthe" "lowerswim" NA NA
Finally, you want to split the words in ListA that matched a pair of elements from ListB, unless this word is already in ListB. I suppose there are lots of ways to do this, but I'll use lapply and unlist:
newA <- unlist(lapply(seq_along(ListA), function(idx) {
if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
return(ListA[idx])
} else {
return(as.vector(as.matrix(pairings[combos == matches[idx],])))
}
}))
newA
# [1] "dopamine" "and" "the" "lower" "swim" "other" "different"

how to get value when a variable name is passed as a string

i write this code in R
paste("a","b","c")
which returns the value "abc"
Variable abc has a value of 5(say) how do i get "abc" to give me the value 5 is there any function like as.value(paste("a","b","c")) which will give me the answer 5? I am making my doubt sound simple and this is exactly what i want. So please help me. Thanks in advance
paste("a","b","c") gives "a b c" not "abc"
Anyway, I think you are looking for get():
> abc <- 5
> get("abc")
[1] 5
An addition to Sacha's answer. If you want to assign a value to an object "abc" using paste():
assign(paste("a", "b", "c", sep = ""), 5)
This is certainly one of the most-asked questions about the R language, along with its evil twin brother "How do I turn x='myfunc' into an executable function?"
In summary, get, parse, eval , expression are all good things to learn about. The most useful (IMHO) and least-well-known is do.call , which takes care of a lot of the string-to-object conversion work for you.
Here is an example to demonstrate eval() and get(eval())
a <- 1
b <- 2
var_list <- c('a','b')
for(var in var_list)
{
print(paste(eval(var),' : ', get(eval(var))))
}
This gives:
[1] "a : 1"
[1] "b : 2"
Here is a purrr example to do this for multiple vectors
text1 = "Somewhere over the rainbow"
text2 = "All I want for Christmas is you"
text3 = "All too well"
text4 = "Save your tears"
text5 = "Meet me at our spot"
songs = (map(paste0("text", 1:5), get) %>% unlist)
songs
This gives
[1] "Somewhere over the rainbow"
[2] "All I want for Christmas is you"
[3] "All too well"
[4] "Save your tears"
[5] "Meet me at our spot"

Resources