Extract sentences from texts in data frame - r

I have a data frame with a column "text" and in each row of my data frame "text" contains several sentences (maybe only two, maybe 100 or more). Now I would like to analyze the text in every row of my data frame for specific keywords. If a keyword is found in the text of this row I would like to extract the sentences, which contain keywords, to a separate column, f.e.
needles = c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c("This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."),
findings = c("This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA)
)
So column "text" contains the sentences I want to check for keywords, "findings" is the result I would like to have in the end.
Can anyone help me how to apply the solution for all rows of the data frame?
Thank you!

We can work with list columns by splitting each row in different sentences and look for the needles inside each resulting sentence of each row.
The reduce functions are to take levels of depth of the lists.
code:
library(tidyverse)
needles <- c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c(
"This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."
),
findings = c(
"This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA
)
)
df <- as_tibble(mydata) %>%
mutate(mydata, findings = str_split(text, "\\.\\s") %>%
map(~str_subset(., rebus::or1(needles))) %>%
map_if(~length(.) > 1, ~reduce(., ~paste(.x, .y, sep = '. '))),
findings = map_if(findings, ~length(.) == 0, ~NA) %>% reduce(c))
df
#> # A tibble: 4 × 2
#> text findings
#> <chr> <chr>
#> 1 This is the first sentence. It is the … This is the first sentence
#> 2 My second sentence is this. I hope thi… I hope this project will work fine. T…
#> 3 And this is the last sentence. Finally… I am really happy about that.
#> 4 These sentences do not contain any rel… <NA>
Created on 2021-11-27 by the reprex package (v2.0.1)

What about something like this:
find_sentence <- function(text, word){
require(stringr)
x <- c(str_split(text, "\\..", simplify=TRUE))
inds <- which(str_detect(x, word))
if(length(inds) > 0){
list(x[inds])
}else{
list(NA)
}
}
mydata %>%
rowwise %>%
mutate(res = find_sentence(text, "the")) %>%
unnest(res)
# # A tibble: 4 × 3
# text findings res
# <chr> <chr> <chr>
# 1 This is the first sentence. It is the beginning of this project This is the first sentence. This is the fi…
# 2 This is the first sentence. It is the beginning of this project This is the first sentence. It is the begi…
# 3 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence. I hope this project will wo… Then I will an…
# 4 And this is the last sentence. Finally my work ends. I am really happy about that. I am really happy about tha… And this is th…
This returns a new variable called res that has a different row for each occurrence of the keyword in a sentence. So, if two sentences contained the word (as in the first sentence in text), the text and findings columns will be replicated for each of the relevant sentences in res.

With Base R,
lookup <- strsplit(as.character(mydata[,1]),"\\.")
out <- lapply(lookup,function(x) {
logic <- grepl(paste0(needles,collapse="|"),x)
paste0(x[logic],collapse=".")
})
data.frame(findings = do.call(rbind,out) )
gives,
# findings
#1 This is the first sentence
#2 I hope this project will work fine. Then I will analyze the third sentence
#3 I am really happy about that
#4

This uses grep and a strsplit to get the matches.
mydata$findings <- sapply( strsplit( t(mydata), "\\. " ), function(x)
x[unlist( lapply( needles, function(y) grep(y, x) ) )] )
text
1 This is the first sentence. It is the beginning of this project
2 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.
3 And this is the last sentence. Finally my work ends. I am really happy about that.
4 These sentences do not contain any relevant information. There is no keyword. And it is not relevant.
findings
1 This is the first sentence
2 I hope this project will work fine, Then I will analyze the third sentence.
3 I am really happy about that.
4

Related

Stringr pattern to detect capitalized words

I am trying to write a function to detect capitalized words that are all capitalised
currently, code:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
Where output is:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
DONT
WAS
The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS".
ie:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
THIS
DONT
WANT
does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.
If you run the code with your regex you'll realise 'THIS' is not included in the output at all.
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
This is because you are extracting words with leading and lagging whitespace. 'THIS' does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b) instead.
str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"
Your code would work if you use the above pattern in it.
Or you could also use :
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS

In R, compare string variable of two dataframes to create new flag variable indicating match in both dataframes, using a for-loop?

I have two dataframes which I would like to compare. One of them contains a complete list of sentences as a string variable as well as manually assigned codes of 0 and 1 (i.e. data.1). The second dataframe contains a subset of the sentences of the first dataframe and is reduced to those sentences that were matched by a dictionary.
This is, in essence, what these two datasets look like:
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
I would like to merge the results of the data.2 into data.1 and ideally create a new code_2 variable there that indicates whether a sentence was matched by the dictionary. This would yield something like this:
> data.1
texts code code_2
1 This is a sentence 1 1
2 This is another sentence 1 0
3 This is not a sentence 0 1
4 Yet another sentence 1 0
To make this slightly more difficult, and as you can see above, the sentences in data.2 are not just a subset of data.1 but they may also be in a different order (e.g. "This is not a sentence" is in the third row of the first dataframe but in the first row of the second dataframe).
I was thinking that looping through all of the texts of data.1 would do the trick, but I'm not sure how to implement this.
for (i in 1:nrow(data.1)) {
# For each i in data.1...
# compare sentence to ALL sentences in data.2...
# create a new variable called "code_2"...
# assign a 1 if a sentence occurs in both dataframes...
# and a 0 otherwise (i.e. if that sentence only occurs in `data.1` but not in `data.2`).
}
Note: My question is similar to this one, where the string variable "Letter" corresponds to my "texts", yet the problem is somewhat different, since the matching of sentences itself is the basis for the creation of a new flag variable in my case (which is not the case in said other question).
can you just join the dataframes?
NOTE: Added replace_na to substitue with 0
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
data.1 %>% dplyr::left_join(data.2, by = 'texts') %>%
dplyr::mutate(code.y = tidyr::replace_na(code.y, 0))
I believe the following match based solution does what the question asks for.
i <- match(data.2$texts, data.1$texts)
i <- sort(i)
data.1$code_2 <- 0L
data.1$code_2[i] <- data.2$code[seq_along(i)]
data.1
# texts code code_2
#1 This is a sentence 1 1
#2 This is another sentence 1 0
#3 This is not a sentence 0 1
#4 Yet another sentence 1 0

Exact match from list of words from a text in R

I have list of words and I am looking for words that are there in the text.
The result is that in the last column is always found as it is searching for patterns. I am looking for exact match that is there in words. Not the combinations. For the first three records it should be not found.
Please guide where I am going wrong.
col_1 <- c(1,2,3,4,5)
col_2 <- c("work instruction change",
"technology npi inspections",
" functional locations",
"Construction has started",
" there is going to be constn coon")
df <- as.data.frame(cbind(col_1,col_2))
df$col_2 <- tolower(df$col_2)
words <- c("const","constn","constrction","construc",
"construct","construction","constructs","consttntype","constypes","ct","ct#",
"ct2"
)
pattern_words <- paste(words, collapse = "|")
df$result<- ifelse(str_detect(df$col_2, regex(pattern_words)),"Found","Not Found")
Use word boundaries around the words.
library(stringr)
pattern_words <- paste0('\\b', words, '\\b', collapse = "|")
df$result <- c('Not Found', 'Found')[str_detect(df$col_2, pattern_words) + 1]
#OR with `ifelse`
#df$result <- ifelse(str_detect(df$col_2, pattern_words), "Found", "Not Found")
df
# col_1 col_2 result
#1 1 work instruction change Not Found
#2 2 technology npi inspections Not Found
#3 3 functional locations Not Found
#4 4 construction has started Found
#5 5 there is going to be constn coon Found
You can also use grepl here to keep it in base R :
grepl(pattern_words, df$col_2)

Regex in R: how to fill dataframe with multiple matches to left and right of target string

(This is a follow-up to Regex in R: match collocates of node word.)
I want to extract word combinations (collocates) to the left and to the right of a target word (node) and store the three elements in a dataframe.
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")
Aim:
The target word is the verb GO in any of its possible realizations, be it 'go', 'going', goes', 'gone, or 'went' and I'm interested in extracting 3 words to the left of GO and to the right of GO. The three words can cross sentence boundaries but the extracted strings should not include punctuation.
What I've tried so far:
To extract left-hand collocates I've used str_extract_all from stringr:
unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and" " It was" "s still"
[5] " And will" " and"
This captures most but not all matches and includes spaces.
The extraction of the node, by contrast, looks okay:
unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went" "went" "going" "Going" "going" "go" "go"
To extract the right hand collocates:
unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went" " on" " on for quite" " on for ages" " on" " on and on"
[7] " on forever"
Again the matches are incomplete and unwanted spaces are included.
And finally assembling all the matches in a dataframe throws an error:
collocates <- data.frame(
Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")), :
arguments imply differing number of rows: 6, 7
Expected output:
Left Node Right
This little sentence went on and went
went on and went on It was
on It was going on for quite
quite a while Going on for ages
ages It’s still going on And will
on And will go on and on
and on and go on forever
Does anyone know how to fix this? Suggestions much appreciated.
If you use Quanteda, you can get the following result. When you deal with texts, you want to use small letters. I converted capital letters with tolower(). I also removed . and , with gsub(). Then, I applied kwic() to the text. If you do not mind losing capital letters, dots, and commas, you get pretty much what you want.
library(quanteda)
library(dplyr)
library(splitstackshape)
myvec <- c("go", "going", "goes", "gone", "went")
mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>%
as_tibble %>%
select(pre, keyword, post) %>%
cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>%
select(contains("pre"), keyword, contains("post"))
pre_1 pre_2 pre_3 keyword post_1 post_2 post_3
1: this little sentence went on and went
2: went on and went on it was
3: on it was going on for quite
4: quite a while going on for ages
5: ages it's still going on and will
6: on and will go on and on
7: and on and go on forever <NA>
A little late but not too late for posterity or contemporaries doing collocation research on unannotated text, here's my own answer to my question. Full credit is given to #jazzurro's pointer to quantedaand his answer.
My question was: how to compute collocates of a given node in a text and store the results in a dataframe (that's the part not addressed by #jazzurro).
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while.
Going on for ages. It's still going on. And will go on and on, and go on forever.")
Step 1: Prepare data for analysis
go <- gsub("[.!?;,:]", "", tolower(GO)) # get rid of punctuation
go <- gsub("'", " ", tolower(go)) # separate clitics from host
Step 2: Extract KWIC using regex pattern and argument valuetype = "regex"
concord <- kwic(go, "go(es|ing|ne)?|went", window = 3, valuetype = "regex")
concord
[text1, 4] this little sentence | went | on and went
[text1, 7] went on and | went | on it was
[text1, 11] on it was | going | on for quite
[text1, 17] quite a while | going | on for ages
[text1, 24] it s still | going | on and will
[text1, 28] on and will | go | on and on
[text1, 33] and on and | go | on forever
Step 3: Identify strings with fewer collocates than defined by window:
# Number of collocates on the left:
concord$nc_l <- unlist(lengths(strsplit(concordance$pre, " "))); concord$nc_l
[1] 3 3 3 3 3 3 3 # nothing missing here
# Number of collocates on the right:
concord$nc_r <- unlist(lengths(strsplit(concordance$post, " "))); concord$nc_r
[1] 3 3 3 3 3 3 2 # last string has only two collocates
Step 4: Add NA to strings with missing collocates:
# define window:
window <- 3
# change string:
concord$post[!concord$nc_r == window] <- paste(concord$post[!concord$nc_r == window], NA, sep = " ")
Step 5: Fill dataframe with slots for collocates and node, using str_extract from library stringras well as regex with lookarounds to determine split points for collocates:
library(stringr)
L3toR3 <- data.frame(
L3 = str_extract(concord$pre, "^\\w+\\b"),
L2 = str_extract(concord$pre, "(?<=\\s)\\w+\\b(?=\\s)"),
L1 = str_extract(concord$pre, "\\w+\\b$"),
Node = concord$keyword,
R1 = str_extract(concord$post, "^\\w+\\b"),
R2 = str_extract(concord$post, "(?<=\\s)\\w+\\b(?=\\s)"),
R3 = str_extract(concord$post, "\\w+\\b$")
)
Result:
L3toR3
L3 L2 L1 Node R1 R2 R3
1 this little sentence went on and went
2 went on and went on it was
3 on it was going on for quite
4 quite a while going on for ages
5 it s still going on and will
6 on and will go on and on
7 and on and go on forever NA

Frequency of occurrence of two-pair combinations in text data in R

I have a file with several string (text) variables where each respondent has written a sentence or two for each variable. I want to be able to find the frequency of each combination of words (i.e. how often "capability" occurs with "performance").
My code so far goes:
#Setting up the data file
data.text <- scan("C:/temp/tester.csv", what="char", sep="\n")
#Change everything to lower text
data.text <- tolower(data.text)
#Split the strings into separate words
data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
data.words.vector <- unlist(data.words.list)
#List each word and frequency
data.freq.list <- table(data.words.vector)
This gives me a list of each word and how often it appears in the string variables. Now I want to see the frequency of every 2 word combination. Is this possible?
Thanks!
An example of the string data:
ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
I'm not sure if this is what yu mean, but rather than splitting on every two word boundaires (which I found a pain to try and regex) you could paste every two words together using the trusty head and tails slip trick...
# How I read your data
df <- read.table( text = 'ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
' , h = TRUE , stringsAsFactors = FALSE )
# Split to words
wlist <- sapply( df[,-1] , strsplit , split = "\\W+", perl=TRUE)
# Paste word pairs together
outl <- sapply( wlist , function(x) paste( head(x,-1) , tail(x,-1) , sep = " ") )
# Table as per usual
table(unlist( outl ) )
are overchanging at other bad service better value customer service
1 1 1 1 1
happy with not happy of same old thing other place
1 1 1 1 1
overchanging me poor customer same old the service they are
1 1 1 1 1
tired of value at with the
1 1 1

Resources