Is there a way to find all words that are associated with a given phrase?
For example, say I want to find all of the words next to the word "illness" in a string. The string has the word "illness" quite a bit, and I want to find all of the terms surrounding it, such as "has illness," "does not have illness," "might have illness," etc...
If you just have a vector of strings (isolated or as a column in a data.frame), then perhaps:
s <- c("hello world illness quux bar not action illness",
"not an illness", "no positive", "illness", "illness goodbye")
ret <- lapply(list(gregexpr("(\\S+)(?= illness)", s, perl = TRUE),
gregexpr("(?<=illness )(\\S+)", s, perl = TRUE)),
regmatches, x = s)
Map(c, ret[[1]], ret[[2]])
# [[1]]
# [1] "world" "action" "quux"
# [[2]]
# [1] "an"
# [[3]]
# character(0)
# [[4]]
# character(0)
# [[5]]
# [1] "goodbye"
Each gregexpr finds a word (well, contiguous non-whitespace) followed by the literal " illness" or preceded by the literal "illness ". Because it returns a list with enough information to extract the substrings from the original, we use regmatches(x=s, ...) to extract the components.
(The Map command deals with the fact that the lapply result is a list, length 2, one for each regex. It merely concatenates the substrings from the first regex with the substrings from the second regex, "zipping" the matches for each string within the vector. If you look at ret[[1]] and/or ret as a whole, the benefit of this might make more sense.)
If you don't care in which string in the vector/column the surrounding words are found, then you can simply unlist this:
unlist(ret)
# [1] "world" "action" "an" "quux" "goodbye"
Quite unclear what pattern exactly you want to match. So here's some options, using str_extractfrom the stringrpackage and positive lookahead:
dt <- c("I mean he's got to have some illness I suppose that",
"whether you've had a serious illness or [unclear]",
"No not what illnesses, terminal illness.",
"Oh, Terminal illness, oh, sorry.",
"Illness benefit on a joint life, last survivor plan?")
str_extract(dt, ".*(?=.(i|I)llness)") # any string prior to "illness"
[1] "I mean it's got to be some some" "whether you've had a serious" "No not what illnesses, terminal"
[4] "Oh, Terminal" NA
str_extract(dt, "(some|serious|terminal)(?=.(i|I)llness)") # specific words prior to "illness"
[1] "some" "serious" "terminal" NA NA
str_extract(dt, "\\w+\\b(?=.(i|I)llness)") # last word prior to "illness"
[1] "some" "serious" "what" "Terminal" NA
Using the kwic function from quanteda. You can specify the number of words surrounding the term you are looking for. In the example I'm using 3, but that can be anything. The advantage is that you can store the outcome in a data.frame and then do some more investigations.
Using the example from #Chris Ruehlemann:
library(quanteda)
dt <- c("I mean he's got to have some illness I suppose that",
"whether you've had a serious illness or [unclear]",
"No not what illnesses, terminal illness.",
"Oh, Terminal illness, oh, sorry.",
"Illness benefit on a joint life, last survivor plan?")
out <- kwic(dt, pattern = "illness", window = 3, valuetype = "regex")
out
[text1, 8] to have some | illness | I suppose that
[text2, 6] had a serious | illness | or[ unclear
[text3, 4] No not what | illnesses | , terminal illness
[text3, 7] illnesses, terminal | illness | .
[text4, 4] Oh, Terminal | illness | , oh,
[text5, 1] | Illness | benefit on a
data.frame(out)
docname from to pre keyword post pattern
1 text1 8 8 to have some illness I suppose that illness
2 text2 6 6 had a serious illness or [ unclear illness
3 text3 4 4 No not what illnesses , terminal illness illness
4 text3 7 7 illnesses , terminal illness . illness
5 text4 4 4 Oh , Terminal illness , oh , illness
6 text5 1 1 Illness benefit on a illness
Related
I am trying to count the number of | in a string. This is my code but it is giving the incorrect answer of 32 instead of 2? Why is this happening and how do I get a function that returns 2? Thanks!
> levels
[1] "Completely|Partially|Not at all"
> str_count(levels, '|')
[1] 32
Also how do I separate the string by the | character? I would like the output to be a character vector of length 3: 'Completely', 'Partially', 'Not at all'.
The | is meaningful in regex as an "or"-like operator. Escape it with backslashes.
stringr::str_count("Completely|Partially|Not at all", "\\|")
# [1] 2
To show what | is normally used for, let's count the occurrences of el and al:
stringr::str_count("Completely|Partially|Not at all", "al")
# [1] 2
stringr::str_count("Completely|Partially|Not at all", "el")
# [1] 1
stringr::str_count("Completely|Partially|Not at all", "el|al")
# [1] 3
To look for the literal | symbol, it needs to be escaped.
To split the string by the | symbol, we can use strsplit (base R) or stringr::str_split:
strsplit("Completely|Partially|Not at all", "\\|")
# [[1]]
# [1] "Completely" "Partially" "Not at all"
It's returned as a list, because the argument may be a vector. For instance, it might be more clear if we do
vec <- c("Completely|Partially|Not at all", "something|else")
strsplit(vec, "\\|")
# [[1]]
# [1] "Completely" "Partially" "Not at all"
# [[2]]
# [1] "something" "else"
The pipe | character is a regex metacharacter and needs to be escaped:
levels <- "Completely|Partially|Not at all"
str_count(levels, '\\|')
Another general trick you can use here is to compare the length of the input against the same with all pipes stripped:
nchar(levels) - nchar(gsub("|", "", levels, fixed=TRUE))
[1] 2
Addendum: Use strsplit:
unlist(strsplit(levels, "\\|"))
[1] "Completely" "Partially" "Not at all"
(This is a follow-up to Regex in R: match collocates of node word.)
I want to extract word combinations (collocates) to the left and to the right of a target word (node) and store the three elements in a dataframe.
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")
Aim:
The target word is the verb GO in any of its possible realizations, be it 'go', 'going', goes', 'gone, or 'went' and I'm interested in extracting 3 words to the left of GO and to the right of GO. The three words can cross sentence boundaries but the extracted strings should not include punctuation.
What I've tried so far:
To extract left-hand collocates I've used str_extract_all from stringr:
unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and" " It was" "s still"
[5] " And will" " and"
This captures most but not all matches and includes spaces.
The extraction of the node, by contrast, looks okay:
unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went" "went" "going" "Going" "going" "go" "go"
To extract the right hand collocates:
unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went" " on" " on for quite" " on for ages" " on" " on and on"
[7] " on forever"
Again the matches are incomplete and unwanted spaces are included.
And finally assembling all the matches in a dataframe throws an error:
collocates <- data.frame(
Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")), :
arguments imply differing number of rows: 6, 7
Expected output:
Left Node Right
This little sentence went on and went
went on and went on It was
on It was going on for quite
quite a while Going on for ages
ages It’s still going on And will
on And will go on and on
and on and go on forever
Does anyone know how to fix this? Suggestions much appreciated.
If you use Quanteda, you can get the following result. When you deal with texts, you want to use small letters. I converted capital letters with tolower(). I also removed . and , with gsub(). Then, I applied kwic() to the text. If you do not mind losing capital letters, dots, and commas, you get pretty much what you want.
library(quanteda)
library(dplyr)
library(splitstackshape)
myvec <- c("go", "going", "goes", "gone", "went")
mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>%
as_tibble %>%
select(pre, keyword, post) %>%
cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>%
select(contains("pre"), keyword, contains("post"))
pre_1 pre_2 pre_3 keyword post_1 post_2 post_3
1: this little sentence went on and went
2: went on and went on it was
3: on it was going on for quite
4: quite a while going on for ages
5: ages it's still going on and will
6: on and will go on and on
7: and on and go on forever <NA>
A little late but not too late for posterity or contemporaries doing collocation research on unannotated text, here's my own answer to my question. Full credit is given to #jazzurro's pointer to quantedaand his answer.
My question was: how to compute collocates of a given node in a text and store the results in a dataframe (that's the part not addressed by #jazzurro).
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while.
Going on for ages. It's still going on. And will go on and on, and go on forever.")
Step 1: Prepare data for analysis
go <- gsub("[.!?;,:]", "", tolower(GO)) # get rid of punctuation
go <- gsub("'", " ", tolower(go)) # separate clitics from host
Step 2: Extract KWIC using regex pattern and argument valuetype = "regex"
concord <- kwic(go, "go(es|ing|ne)?|went", window = 3, valuetype = "regex")
concord
[text1, 4] this little sentence | went | on and went
[text1, 7] went on and | went | on it was
[text1, 11] on it was | going | on for quite
[text1, 17] quite a while | going | on for ages
[text1, 24] it s still | going | on and will
[text1, 28] on and will | go | on and on
[text1, 33] and on and | go | on forever
Step 3: Identify strings with fewer collocates than defined by window:
# Number of collocates on the left:
concord$nc_l <- unlist(lengths(strsplit(concordance$pre, " "))); concord$nc_l
[1] 3 3 3 3 3 3 3 # nothing missing here
# Number of collocates on the right:
concord$nc_r <- unlist(lengths(strsplit(concordance$post, " "))); concord$nc_r
[1] 3 3 3 3 3 3 2 # last string has only two collocates
Step 4: Add NA to strings with missing collocates:
# define window:
window <- 3
# change string:
concord$post[!concord$nc_r == window] <- paste(concord$post[!concord$nc_r == window], NA, sep = " ")
Step 5: Fill dataframe with slots for collocates and node, using str_extract from library stringras well as regex with lookarounds to determine split points for collocates:
library(stringr)
L3toR3 <- data.frame(
L3 = str_extract(concord$pre, "^\\w+\\b"),
L2 = str_extract(concord$pre, "(?<=\\s)\\w+\\b(?=\\s)"),
L1 = str_extract(concord$pre, "\\w+\\b$"),
Node = concord$keyword,
R1 = str_extract(concord$post, "^\\w+\\b"),
R2 = str_extract(concord$post, "(?<=\\s)\\w+\\b(?=\\s)"),
R3 = str_extract(concord$post, "\\w+\\b$")
)
Result:
L3toR3
L3 L2 L1 Node R1 R2 R3
1 this little sentence went on and went
2 went on and went on it was
3 on it was going on for quite
4 quite a while going on for ages
5 it s still going on and will
6 on and will go on and on
7 and on and go on forever NA
Is there any way to replace range of numbers wih single numbers in a character string? Number can range from n-n, most probably around 1-15, 4-10 ist also possible.
the range could be indicated with a) -
a <- "I would like to buy 1-3 cats"
or with a word b) for example: to, bis, jusqu'à
b <- "I would like to buy 1 jusqu'à 3 cats"
The results should look like
"I would like to buy 1,2,3 cats"
I found this: Replace range of numbers with certain number but could not really use it in R.
gsubfn in the gsubfn package is like gsub but instead of replacing the match with a replacement string it allows the user to specify a function (possibly in formula notation as done here). It then passes the matches to the capture groups in the regular expression, i.e. the matches to the parenthesized parts of the regular expression, as separate arguments and replaces the entire match with the output of the function. Thus we match "(\\d+)(-| to | bis | jusqu'à )(\\d+)" which results in three capture groups so 3 arguments to the function. In the function we use seq with the first and third of these. Note that seq can take character arguments and interpret them as numeric so we did not have to convert the arguments to numeric.
Thus we get this one-liner:
library(gsubfn)
s <- c(a, b) # test input strings
gsubfn("(\\d+)(-| to | bis | jusqu'à )(\\d+)", ~ paste(seq(..1, ..3), collapse = ","), s)
giving:
[1] "I would like to buy 1,2,3 cats" "I would like to buy 1,2,3 cats"
Not the most efficient, but ...
s <- c("I would like to buy 1-3 cats",
"I would like to buy 1 jusqu'à 3 cats",
"foo 22-33",
"quux 11-3 bar")
gre <- gregexpr("([0-9]+(-| to | bis | jusqu'à )[0-9]+)", s)
gre2 <- gregexpr('[0-9]+', regmatches(s, gre))
regmatches(s, gre) <- lapply(regmatches(regmatches(s, gre), gre2),
function(a) paste(do.call(seq, as.list(as.integer(a))), collapse = ","))
s
# [1] "I would like to buy 1,2,3 cats" "I would like to buy 1,2,3 cats"
# [3] "foo 22,23,24,25,26,27,28,29,30,31,32,33" "quux 11,10,9,8,7,6,5,4,3 bar"
This is, in fact, a little tricky, unless someone has already written a package that does this (that I'm not aware of).
a <- "I would like to buy 1-3 cats"
pos <- unlist(gregexpr("\\d+\\D+", a))
a_split <- unlist(strsplit(a, ""))
replacement <- paste(seq.int(a_split[pos[1]], a_split[pos[2]]), collapse = ",")
gsub("\\d+\\D+\\d+", replacement, a)
# [1] "I would like to buy 1,2,3 cats"
EDIT: To show that the same solution works for arbitrary non digit characters between two numbers:
b <- "I would like to buy 1 jusqu'à 3 cats"
pos_b <- unlist(gregexpr("\\d+\\D+", b))
b_split <- unlist(strsplit(b, ""))
replacement <- paste(seq.int(b_split[pos_b[1]], b_split[pos_b[2]]), collapse = ",")
gsub("\\d+\\D+\\d+", replacement, b)
# [1] "I would like to buy 1,2,3 cats"
You can add arbitrary requirements for the run of nondigit characters if you'd like. If you need help with that, just share what the limits on the words or symbols that are between the numbers are!
I have a data set, like the following:
cp<-data.frame("name"=c("billy", "jean", "jean", "billy","billy", "dawn", "dawn"),
"answer"=c("michael jackson is my favorite", "I like flowers", "flower is red","hey michael",
"do not touch me michael","i am a girl","girls have hair"))
Every variable called name has a string attached to it, stored in the variable answer. I would like to find out what specific words, or parts of words, or whole sentences, in the answer variable, that is common for the different names in name:
For example, the name "billy" would have "michael" connected to it.
EDIT:
A data frame with following variables called ddd:
name: debby answer: "did you go to dallas?"
name: debby answer: "debby did dallas"
function(name=debby,data=ddd) {...} ,
which gives output "did debby dallas".
Here's a (not very efficient) function I've made that uses pmatch in order to match partial matches. The problem with it that it will also match a and am or i and is because they are also very close.
freqFunc <- function(x){
temp <- tolower(unlist(strsplit(as.character(x), " ")))
temp2 <- length(temp)
temp3 <- lapply(temp, function(x){
temp4 <- na.omit(temp[pmatch(rep(x, temp2), temp)])
temp4[length(temp4) > 1]
})
list(unique(unlist(temp3)))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean i,is,flower,flowers
# 3: dawn a,am,girl,girls
If you satisfied with just exact matches, this can be very simplified and improve performance (I also added tolower so it will match different cases too)
freqFunc2 <- function(x){
temp <- table(tolower(unlist(strsplit(as.character(x), " "))))
list(names(temp[temp > 1]))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc2), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean
# 3: dawn
With the caveat of understanding correctly, I think this is what you''re looking for. Doesn't handle plurals of words though, as David mentioned. This just finds words that are exactly the same.
billyAnswers<-cp$answer[cp$name=="billy"]
#output of billyAnswers
#[1] "michael jackson is my favorite" "hey michael"
#[3] "do not touch me michael"
Now we get all the words
allWords<-unlist(strsplit(billyAnswer, " "))
#outputvof allWords
# [1] "michael" "jackson" "is" "my" "favorite" "hey"
# [7] "michael" "do" "not" "touch" "me" "michael"
We can find the common ones
common<-allWords[duplicated(allWords)]
#output of common
#[1] "michael" "michael"
Of course there are two michaels because there are multiple instances of michael in billy's answers! So let's pair it down once more.
unique(common)
#[1] "michael"
And there you go, apply that to all names and you got it.
For jean and dawn, there are no common words in their answers, so this method returns two character vectors of length 0
#jean's words
#[1] "I" "like" "flowers" "flower" "is" "red"
#dawn's words
#[1] "i" "am" "a" "girl" "girls" "have" "hair"
I have a file with several string (text) variables where each respondent has written a sentence or two for each variable. I want to be able to find the frequency of each combination of words (i.e. how often "capability" occurs with "performance").
My code so far goes:
#Setting up the data file
data.text <- scan("C:/temp/tester.csv", what="char", sep="\n")
#Change everything to lower text
data.text <- tolower(data.text)
#Split the strings into separate words
data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
data.words.vector <- unlist(data.words.list)
#List each word and frequency
data.freq.list <- table(data.words.vector)
This gives me a list of each word and how often it appears in the string variables. Now I want to see the frequency of every 2 word combination. Is this possible?
Thanks!
An example of the string data:
ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
I'm not sure if this is what yu mean, but rather than splitting on every two word boundaires (which I found a pain to try and regex) you could paste every two words together using the trusty head and tails slip trick...
# How I read your data
df <- read.table( text = 'ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
' , h = TRUE , stringsAsFactors = FALSE )
# Split to words
wlist <- sapply( df[,-1] , strsplit , split = "\\W+", perl=TRUE)
# Paste word pairs together
outl <- sapply( wlist , function(x) paste( head(x,-1) , tail(x,-1) , sep = " ") )
# Table as per usual
table(unlist( outl ) )
are overchanging at other bad service better value customer service
1 1 1 1 1
happy with not happy of same old thing other place
1 1 1 1 1
overchanging me poor customer same old the service they are
1 1 1 1 1
tired of value at with the
1 1 1