Extract words that are repeated from one sentence to the next - r

I have sentences from spoken conversation and would like to identify the words that are repeated fom sentence to sentence; here's some illustartive data (in reproducible format below)
df
# A tibble: 10 x 1
Orthographic
<chr>
1 "like I don't understand sorry like how old's your mom"
2 "eh sixty-one"
3 "yeah (...) yeah yeah like I mean she's not like in the risk age group but still"
4 "yeah"
5 "HH"
6 "I don't know"
7 "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks…
8 "yeah"
9 "she said you should come home probably "
10 "no and like why would you go to the airport where people have corona sit in the plane where peop…
I'm not unsuccessful at extracting the repeated words using a forloop but do also get some strange results: Here's what I've been doing so far:
# initialize pattern and new column `rept` in `df`:
pattern1 <- c()
df$rept <- NA
# for loop:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(df$Orthographic[i-1], " ")), collapse = "|"), ")\\b")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
The results are these; result # 10 is strange/incorrect - it should be character(0). How can the code be improved so that no such strange results are obtained?
df$rept
[[1]]
[1] NA
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "" "" "" "" "" "" "" "" "" "" "you" "" "" "" "" ""
[17] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[33] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[49] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[65] "" "" "" "" "" "" "" "" "" "" "" ""
Reproducible data:
structure(list(Orthographic = c("like I don't understand sorry like how old's your mom",
"eh sixty-one", "yeah (...) yeah yeah like I mean she's not like in the risk age group but still",
"yeah", "HH", "I don't know", "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks ago and they at that time they were already like maybe you should just get on a plane and come home and like you can't just be here and and then last night they were like are you sure you don't wanna come home and I was I don't think I can and my mom said the same thing",
"yeah", "she said you should come home probably ", "no and like why would you go to the airport where people have corona sit in the plane where people have corona to get there where people have corona and then go and take it to your family"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

When you debug such regex issues concerning dynamic patterns with word boundaries, there are a lot of things to keep in mind (so as to understand how to best approach the whole issue).
First, check the patterns you get,
for(i in 2:nrow(df)) {
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+"))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
Here is the list of regexps:
[1] "\\b(like|I|don't|understand|sorry|like|how|old's|your|mom)\\b"
[1] "\\b(eh|sixty-one)\\b"
[1] "\\b(yeah|(...)|yeah|yeah|like|I|mean|she's|not|like|in|the|risk|age|group|but|still)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(HH)\\b"
[1] "\\b(I|don't|know)\\b"
[1] "\\b(yeah|I|talked|to|my|grandparents|last|night|and|last|time|I|talked|to|them|it|was|like|two|weeks|ago|and|they|at|that|time|they|were|already|like|maybe|you|should|just|get|on|a|plane|and|come|home|and|like|you|can't|just|be|here|and|and|then|last|night|they|were|like|are|you|sure|you|don't|wanna|come|home|and|I|was|I|don't|think|I|can|and|my|mom|said|the|same|thing)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(she|said|you|should|come|home|probably|)\\b"
Look at the second pattern: \b(eh|sixty-one)\b. What if the first word was sixty? The \b(sixty|sixty-one)\b regex will never match sixty-one because sixty would have matched first and the other alternative would not even have been considered. You need to always sort the alternatives by length in the descending order to assure you always match the longest alternative first when you use word boundaries and you know there can be alternatives with more than one word in them. Here, you do not need to sort the alternatives because you only have single word alternatives.
See the next pattern containing |(...)| alternative. It matches any three chars other than line break chars and captures them into a group. However, the string contained a (...) substring where the parentheses and dots are literal chars. To match them with a regex, you need to escape all special chars.
Next, you consider "words" to be non-whitespace chunks of chars because you use str_split(df$Orthographic[i-1], " "). This invalidates the approach with \b altogether, you need to use whitespace boundaries, (?<!\S) at the start and (?!\S) at the end instead of \bs. More, since you only split with a single space, you may get empty alternatives if there are two or more consecutive spaces in the input string. You need to use \s+ pattern here to split by one or more whitespaces.
Next, there is a trailing space in the last but one string, and it creates an empty alternative. You need to trimws your input before splitting into tokens/words.
This is what you need to do with the regex solution: add the escape.for.regex function:
## Escape for regex
escape.for.regex <- function(string) {
gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
and then use it to escape the tokens that you obtain by splitting the trimmed df$Orthographic[i-1] with \s+ regex, appy unique to remove duplicates to make the pattern more efficient and shorter, and add the whitespace boundaries:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unique(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+")))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
See the list of regexps:
[1] "(?<!\\S)(?:like|I|don't|understand|sorry|how|old's|your|mom)(?!\\S)"
[1] "(?<!\\S)(?:eh|sixty-one)(?!\\S)"
[1] "(?<!\\S)(?:yeah|\\(\\.\\.\\.\\)|like|I|mean|she's|not|in|the|risk|age|group|but|still)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:HH)(?!\\S)"
[1] "(?<!\\S)(?:I|don't|know)(?!\\S)"
[1] "(?<!\\S)(?:yeah|I|talked|to|my|grandparents|last|night|and|time|them|it|was|like|two|weeks|ago|they|at|that|were|already|maybe|you|should|just|get|on|a|plane|come|home|can't|be|here|then|are|sure|don't|wanna|think|can|mom|said|the|same|thing)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:she|said|you|should|come|home|probably)(?!\\S)"
Output:
> df$rept
[[1]]
NULL
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "you"

Depending on whether it is sufficient to identify repeated words, or also their repeat frequencies, you might want to modify the function, but here is one approach using the dplyr::lead function:
library(stringr)
library(dplyr)
# general function that identifies intersecting words from multiple strings
getRpt <- function(...){
l <- lapply(list(...), function(x) unlist(unique(
str_split(as.character(x), pattern=boundary(type="word")))))
Reduce(intersect, l)
}
df$rept <- mapply(getRpt, df$Orthographic, lead(df$Orthographic), USE.NAMES=FALSE)

Related

Extracting non-capitalized words using Regex

I'm trying to extract non-capitalized words using Regex in R. The data contains several columns (e.g. word, word duration, syllable, syllable duration ...), and in the Word column, there are tons of words that are either capitalized (e.g. EAT), non-capitalized (e.g. see), or in curly brackets (e.g. {VAO}). I want to extract all the words that are not capitalized in the word column. The following is a small example data frame with an expected outcome.
file word
1 sp
2 WHAT
3 ISN'
4 'EM
5 O
6 {PPC}
OUTCOME:
"sp", "{PPC}"
> unique(full_dat$word[!grepl("^[A-Z].*[A-Z]|\\d", full_dat$word) & !grepl(" [[:punct:]] ", full_dat$word)]
This result is the following:
[1] "sp" "{OOV}"
[3] "O" "I"
[5] "A" NA
[7] "{XX}" "'S"
[9] "{LG}" "Y"
[11] "B" "'VE"
[13] "N" "{GAP_ANONYMIZATION_NAME}"
[15] "'EM" "W"
[17] "{GAP_ANONYMIZATION}" "K"
This looks good, since I can easily recognize the non-capitalized words, but there are still some capitalized words in this list.... How can I modify the code, so it shows only lower-case words and curly-bracketed words?
With the library stringr you can just simply do that:
library(stringr)
x <- c("HELLO WORLD", "hello world", "Hello World", "hello World", "HeLLo wOrlD")
str_extract(x, "[A-Z]+")
Which results in all the upper cases found in each word:
[1] "HELLO" NA "H" "W" "H"
You can omit NAs by applying the na.omit function, and you will also obtain in which positions there are NAs, that is, in which positions there are not capitalized words:
na.omit(str_extract(x, "[A-Z]+"))
[1] "HELLO" "H" "W" "H"
attr(,"na.action")
[1] 2
attr(,"class")
[1] "omit"
But you can also see in in which positions there are not capitalized words by doing:
is.na(str_extract(x, "[A-Z]+"))
[1] FALSE TRUE FALSE FALSE FALSE
I hope this is helpful 😀

R: Possible to extract groups of words from each sentence(rows)? and create data frame(or matrix)?

I created lists for each word to extract words from sentences, for example like this
hello<- NULL
for (i in 1:length(text)){
hello[i]<-as.character(regmatches(text[i], gregexpr("[H|h]ello?", text[i])))
}
But I have more than 25 words list to extract, that's very long coding.
Is it possible to extract a group of characters(words) from text data?
Below is just a pseudo set.
words<-c("[H|h]ello","you","so","tea","egg")
text=c("Hello! How's you and how did saturday go?",
"hello, I was just texting to see if you'd decided to do anything later",
"U dun say so early.",
"WINNER!! As a valued network customer you have been selected" ,
"Lol you're always so convincing.",
"Did you catch the bus ? Are you frying an egg ? ",
"Did you make a tea and egg?"
)
subsets<-NULL
for ( i in 1:length(text)){
.....???
}
Expected output as below
[1] Hello you
[2] hello you
[3] you
[4] you so
[5] you you egg
[6] you tea egg
in base R, you could do:
regmatches(text,gregexpr(sprintf("\\b(%s)\\b",paste0(words,collapse = "|")),text))
[[1]]
[1] "Hello" "you"
[[2]]
[1] "hello" "you"
[[3]]
[1] "so"
[[4]]
[1] "you"
[[5]]
[1] "you" "so"
[[6]]
[1] "you" "you" "egg"
[[7]]
[1] "you" "tea" "egg"
depending on how you want the results:
trimws(gsub(sprintf(".*?\\b(%s).*?|.*$",paste0(words,collapse = "|")),"\\1 ",text))
[1] "Hello you" "hello you" "so" "you" "you so" "you you egg"
[7] "you tea egg"
You say that you have a long list of word-sets. Here's a way to turn each wordset into a regex, apply it to a corpus (a list of sentences) and pull out the hits as character-vectors. It's case-insensitive, and it checks for word boundaries, so you don't pull age out of agent or rage.
wordsets <- c(
"oak dogs cheese age",
"fire open jail",
"act speed three product"
)
library(tidyverse)
harvSent <- read_table("SENTENCE
Oak is strong and also gives shade.
Cats and dogs each hate the other.
The pipe began to rust while new.
Open the crate but don't break the glass.
Add the sum to the product of these three.
Thieves who rob friends deserve jail.
The ripe taste of cheese improves with age.
Act on these orders with great speed.
The hog crawled under the high fence.
Move the vat over the hot fire.") %>%
pull(SENTENCE)
aWset builds the regexs from the wordsets, and applies them to the sentences
aWset <- function(harvSent, wordsets){
# Turn out a vector of regex like "(?ix) \\b (oak|dogs|cheese) \\b"
regexS <- paste0("(?ix) \\b (",
str_replace_all(wordsets, " ", "|" ),
") \\b")
# Apply each regex to the sentences
map(regexS,
~ str_extract_all(harvSent, .x, simplify = TRUE) %>%
# str_extract_all return a character matrix of hits. Paste it together by row.
apply( MARGIN = 1,
FUN = function(x){
str_trim(paste(x, collapse = " "))}))
}
Giving us
aWset(harvSent , wordsets)
[[1]]
[1] "Oak" "dogs" "" "" "" "" "cheese age" ""
[9] "" ""
[[2]]
[1] "" "" "" "Open" "" "jail" "" "" "" "fire"
[[3]]
[1] "" "" "" "" "product three" "" ""

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Removing "" elements from a list in R

I have a list where I would like to remove empty characters: "".
I seem to be subsetting the elements incorrectly:
> sample2[which(sample2 == "")]
list()
> sample2[which(sample2 != "")]
[[1]]
[1] "" "03JAN1990" "" "" ""
[6] "" "23.4" "0.4" "" ""
[11] "" "" "25.1" "0.3" ""
[16] "" "" "" "26.6" "0.0"
[21] "" "" "" "" "28.6"
[26] "0.3"
What should I do to subset and remove the empty characters?
From your output, it looks like sample2 is not a character vector, but it is a list containing a character vector. You should be using
sample2[[1]][which(sample2[[1]] != "")]
(It would help to include dput(sample2) just to confirm)
Or even better, take the character vector out of the list first
sample3 <- sample2[[1]]
# or maybe sample3 <- unlist(sample2)
sample3[which(sample3 != "")]
A very basic solution:
> lst = list(1,2,"dog","","boss","")
> x = unlist(lst)
> list(x[x!=""])
[[1]]
[1] "1" "2" "dog" "boss"

How to use the strsplit function with a period

I would like to split the following string by its periods. I tried strsplit() with "." in the split argument, but did not get the result I want.
s <- "I.want.to.split"
strsplit(s, ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
The output I want is to split s into 4 elements in a list, as follows.
[[1]]
[1] "I" "want" "to" "split"
What should I do?
When using a regular expression in the split argument of strsplit(), you've got to escape the . with \\., or use a charclass [.]. Otherwise you use . as its special character meaning, "any single character".
s <- "I.want.to.split"
strsplit(s, "[.]")
# [[1]]
# [1] "I" "want" "to" "split"
But the more efficient method here is to use the fixed argument in strsplit(). Using this argument will bypass the regex engine and search for an exact match of ".".
strsplit(s, ".", fixed = TRUE)
# [[1]]
# [1] "I" "want" "to" "split"
And of course, you can see help(strsplit) for more.
You need to either place the dot . inside of a character class or precede it with two backslashes to escape it since the dot is a character of special meaning in regex meaning "match any single character (except newline)"
s <- 'I.want.to.split'
strsplit(s, '\\.')
# [[1]]
# [1] "I" "want" "to" "split"
Besides strsplit(), you can also use scan(). Try:
scan(what = "", text = s, sep = ".")
# Read 4 items
# [1] "I" "want" "to" "split"

Resources