Frequency of occurrence of two-pair combinations in text data in R - r

I have a file with several string (text) variables where each respondent has written a sentence or two for each variable. I want to be able to find the frequency of each combination of words (i.e. how often "capability" occurs with "performance").
My code so far goes:
#Setting up the data file
data.text <- scan("C:/temp/tester.csv", what="char", sep="\n")
#Change everything to lower text
data.text <- tolower(data.text)
#Split the strings into separate words
data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
data.words.vector <- unlist(data.words.list)
#List each word and frequency
data.freq.list <- table(data.words.vector)
This gives me a list of each word and how often it appears in the string variables. Now I want to see the frequency of every 2 word combination. Is this possible?
Thanks!
An example of the string data:
ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"

I'm not sure if this is what yu mean, but rather than splitting on every two word boundaires (which I found a pain to try and regex) you could paste every two words together using the trusty head and tails slip trick...
# How I read your data
df <- read.table( text = 'ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
' , h = TRUE , stringsAsFactors = FALSE )
# Split to words
wlist <- sapply( df[,-1] , strsplit , split = "\\W+", perl=TRUE)
# Paste word pairs together
outl <- sapply( wlist , function(x) paste( head(x,-1) , tail(x,-1) , sep = " ") )
# Table as per usual
table(unlist( outl ) )
are overchanging at other bad service better value customer service
1 1 1 1 1
happy with not happy of same old thing other place
1 1 1 1 1
overchanging me poor customer same old the service they are
1 1 1 1 1
tired of value at with the
1 1 1

Related

Extract sentences from texts in data frame

I have a data frame with a column "text" and in each row of my data frame "text" contains several sentences (maybe only two, maybe 100 or more). Now I would like to analyze the text in every row of my data frame for specific keywords. If a keyword is found in the text of this row I would like to extract the sentences, which contain keywords, to a separate column, f.e.
needles = c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c("This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."),
findings = c("This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA)
)
So column "text" contains the sentences I want to check for keywords, "findings" is the result I would like to have in the end.
Can anyone help me how to apply the solution for all rows of the data frame?
Thank you!
We can work with list columns by splitting each row in different sentences and look for the needles inside each resulting sentence of each row.
The reduce functions are to take levels of depth of the lists.
code:
library(tidyverse)
needles <- c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c(
"This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."
),
findings = c(
"This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA
)
)
df <- as_tibble(mydata) %>%
mutate(mydata, findings = str_split(text, "\\.\\s") %>%
map(~str_subset(., rebus::or1(needles))) %>%
map_if(~length(.) > 1, ~reduce(., ~paste(.x, .y, sep = '. '))),
findings = map_if(findings, ~length(.) == 0, ~NA) %>% reduce(c))
df
#> # A tibble: 4 × 2
#> text findings
#> <chr> <chr>
#> 1 This is the first sentence. It is the … This is the first sentence
#> 2 My second sentence is this. I hope thi… I hope this project will work fine. T…
#> 3 And this is the last sentence. Finally… I am really happy about that.
#> 4 These sentences do not contain any rel… <NA>
Created on 2021-11-27 by the reprex package (v2.0.1)
What about something like this:
find_sentence <- function(text, word){
require(stringr)
x <- c(str_split(text, "\\..", simplify=TRUE))
inds <- which(str_detect(x, word))
if(length(inds) > 0){
list(x[inds])
}else{
list(NA)
}
}
mydata %>%
rowwise %>%
mutate(res = find_sentence(text, "the")) %>%
unnest(res)
# # A tibble: 4 × 3
# text findings res
# <chr> <chr> <chr>
# 1 This is the first sentence. It is the beginning of this project This is the first sentence. This is the fi…
# 2 This is the first sentence. It is the beginning of this project This is the first sentence. It is the begi…
# 3 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence. I hope this project will wo… Then I will an…
# 4 And this is the last sentence. Finally my work ends. I am really happy about that. I am really happy about tha… And this is th…
This returns a new variable called res that has a different row for each occurrence of the keyword in a sentence. So, if two sentences contained the word (as in the first sentence in text), the text and findings columns will be replicated for each of the relevant sentences in res.
With Base R,
lookup <- strsplit(as.character(mydata[,1]),"\\.")
out <- lapply(lookup,function(x) {
logic <- grepl(paste0(needles,collapse="|"),x)
paste0(x[logic],collapse=".")
})
data.frame(findings = do.call(rbind,out) )
gives,
# findings
#1 This is the first sentence
#2 I hope this project will work fine. Then I will analyze the third sentence
#3 I am really happy about that
#4
This uses grep and a strsplit to get the matches.
mydata$findings <- sapply( strsplit( t(mydata), "\\. " ), function(x)
x[unlist( lapply( needles, function(y) grep(y, x) ) )] )
text
1 This is the first sentence. It is the beginning of this project
2 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.
3 And this is the last sentence. Finally my work ends. I am really happy about that.
4 These sentences do not contain any relevant information. There is no keyword. And it is not relevant.
findings
1 This is the first sentence
2 I hope this project will work fine, Then I will analyze the third sentence.
3 I am really happy about that.
4

In R, compare string variable of two dataframes to create new flag variable indicating match in both dataframes, using a for-loop?

I have two dataframes which I would like to compare. One of them contains a complete list of sentences as a string variable as well as manually assigned codes of 0 and 1 (i.e. data.1). The second dataframe contains a subset of the sentences of the first dataframe and is reduced to those sentences that were matched by a dictionary.
This is, in essence, what these two datasets look like:
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
I would like to merge the results of the data.2 into data.1 and ideally create a new code_2 variable there that indicates whether a sentence was matched by the dictionary. This would yield something like this:
> data.1
texts code code_2
1 This is a sentence 1 1
2 This is another sentence 1 0
3 This is not a sentence 0 1
4 Yet another sentence 1 0
To make this slightly more difficult, and as you can see above, the sentences in data.2 are not just a subset of data.1 but they may also be in a different order (e.g. "This is not a sentence" is in the third row of the first dataframe but in the first row of the second dataframe).
I was thinking that looping through all of the texts of data.1 would do the trick, but I'm not sure how to implement this.
for (i in 1:nrow(data.1)) {
# For each i in data.1...
# compare sentence to ALL sentences in data.2...
# create a new variable called "code_2"...
# assign a 1 if a sentence occurs in both dataframes...
# and a 0 otherwise (i.e. if that sentence only occurs in `data.1` but not in `data.2`).
}
Note: My question is similar to this one, where the string variable "Letter" corresponds to my "texts", yet the problem is somewhat different, since the matching of sentences itself is the basis for the creation of a new flag variable in my case (which is not the case in said other question).
can you just join the dataframes?
NOTE: Added replace_na to substitue with 0
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
data.1 %>% dplyr::left_join(data.2, by = 'texts') %>%
dplyr::mutate(code.y = tidyr::replace_na(code.y, 0))
I believe the following match based solution does what the question asks for.
i <- match(data.2$texts, data.1$texts)
i <- sort(i)
data.1$code_2 <- 0L
data.1$code_2[i] <- data.2$code[seq_along(i)]
data.1
# texts code code_2
#1 This is a sentence 1 1
#2 This is another sentence 1 0
#3 This is not a sentence 0 1
#4 Yet another sentence 1 0

Regex in R: how to fill dataframe with multiple matches to left and right of target string

(This is a follow-up to Regex in R: match collocates of node word.)
I want to extract word combinations (collocates) to the left and to the right of a target word (node) and store the three elements in a dataframe.
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while. Going on for ages. It's still going on. And will go on and on, and go on forever.")
Aim:
The target word is the verb GO in any of its possible realizations, be it 'go', 'going', goes', 'gone, or 'went' and I'm interested in extracting 3 words to the left of GO and to the right of GO. The three words can cross sentence boundaries but the extracted strings should not include punctuation.
What I've tried so far:
To extract left-hand collocates I've used str_extract_all from stringr:
unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))"))
[1] "This little sentence" " went on and" " It was" "s still"
[5] " And will" " and"
This captures most but not all matches and includes spaces.
The extraction of the node, by contrast, looks okay:
unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went"))
[1] "went" "went" "going" "Going" "going" "go" "go"
To extract the right hand collocates:
unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))
[1] " on and went" " on" " on for quite" " on for ages" " on" " on and on"
[7] " on forever"
Again the matches are incomplete and unwanted spaces are included.
And finally assembling all the matches in a dataframe throws an error:
collocates <- data.frame(
Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")),
Node = unlist(str_extract_all(GO, "(g|G)o(es|ing|ne)?|went")),
Right = unlist(str_extract_all(GO, "(?<=(g|G)o(es|ing|ne)?|went)(\\s\\w+\\b){1,3}"))); collocates
Error in data.frame(Left = unlist(str_extract_all(GO, "((\\s)?\\w+\\b){1,3}(?=\\s((g|G)o(es|ing|ne)?|went))")), :
arguments imply differing number of rows: 6, 7
Expected output:
Left Node Right
This little sentence went on and went
went on and went on It was
on It was going on for quite
quite a while Going on for ages
ages It’s still going on And will
on And will go on and on
and on and go on forever
Does anyone know how to fix this? Suggestions much appreciated.
If you use Quanteda, you can get the following result. When you deal with texts, you want to use small letters. I converted capital letters with tolower(). I also removed . and , with gsub(). Then, I applied kwic() to the text. If you do not mind losing capital letters, dots, and commas, you get pretty much what you want.
library(quanteda)
library(dplyr)
library(splitstackshape)
myvec <- c("go", "going", "goes", "gone", "went")
mytext <- gsub(x = tolower(GO), pattern = "\\.|,", replacement = "")
mydf <- kwic(x = mytext, pattern = myvec, window = 3) %>%
as_tibble %>%
select(pre, keyword, post) %>%
cSplit(splitCols = c("pre", "post"), sep = " ", direction = "wide", type.convert = FALSE) %>%
select(contains("pre"), keyword, contains("post"))
pre_1 pre_2 pre_3 keyword post_1 post_2 post_3
1: this little sentence went on and went
2: went on and went on it was
3: on it was going on for quite
4: quite a while going on for ages
5: ages it's still going on and will
6: on and will go on and on
7: and on and go on forever <NA>
A little late but not too late for posterity or contemporaries doing collocation research on unannotated text, here's my own answer to my question. Full credit is given to #jazzurro's pointer to quantedaand his answer.
My question was: how to compute collocates of a given node in a text and store the results in a dataframe (that's the part not addressed by #jazzurro).
Data:
GO <- c("This little sentence went on and went on. It was going on for quite a while.
Going on for ages. It's still going on. And will go on and on, and go on forever.")
Step 1: Prepare data for analysis
go <- gsub("[.!?;,:]", "", tolower(GO)) # get rid of punctuation
go <- gsub("'", " ", tolower(go)) # separate clitics from host
Step 2: Extract KWIC using regex pattern and argument valuetype = "regex"
concord <- kwic(go, "go(es|ing|ne)?|went", window = 3, valuetype = "regex")
concord
[text1, 4] this little sentence | went | on and went
[text1, 7] went on and | went | on it was
[text1, 11] on it was | going | on for quite
[text1, 17] quite a while | going | on for ages
[text1, 24] it s still | going | on and will
[text1, 28] on and will | go | on and on
[text1, 33] and on and | go | on forever
Step 3: Identify strings with fewer collocates than defined by window:
# Number of collocates on the left:
concord$nc_l <- unlist(lengths(strsplit(concordance$pre, " "))); concord$nc_l
[1] 3 3 3 3 3 3 3 # nothing missing here
# Number of collocates on the right:
concord$nc_r <- unlist(lengths(strsplit(concordance$post, " "))); concord$nc_r
[1] 3 3 3 3 3 3 2 # last string has only two collocates
Step 4: Add NA to strings with missing collocates:
# define window:
window <- 3
# change string:
concord$post[!concord$nc_r == window] <- paste(concord$post[!concord$nc_r == window], NA, sep = " ")
Step 5: Fill dataframe with slots for collocates and node, using str_extract from library stringras well as regex with lookarounds to determine split points for collocates:
library(stringr)
L3toR3 <- data.frame(
L3 = str_extract(concord$pre, "^\\w+\\b"),
L2 = str_extract(concord$pre, "(?<=\\s)\\w+\\b(?=\\s)"),
L1 = str_extract(concord$pre, "\\w+\\b$"),
Node = concord$keyword,
R1 = str_extract(concord$post, "^\\w+\\b"),
R2 = str_extract(concord$post, "(?<=\\s)\\w+\\b(?=\\s)"),
R3 = str_extract(concord$post, "\\w+\\b$")
)
Result:
L3toR3
L3 L2 L1 Node R1 R2 R3
1 this little sentence went on and went
2 went on and went on it was
3 on it was going on for quite
4 quite a while going on for ages
5 it s still going on and will
6 on and will go on and on
7 and on and go on forever NA

grepl for finding words

I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I don´t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE

count number of words per each line

I am trying to move an R code into spark using sparklyr, I am facing troubles with some of the functions in order to do the following things:
-Count the total number of words in a row: for example
word= "Hello how are you" , number of words: 4
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the first word: 5
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the second word: 3
I tried with dpylr and stringr package but I can't get what I need.
I connect to a spark session
install.packages("DBI")
install.packages("ngram")
require(DBI)
require(sparklyr)
require(dplyr)
require(stringr)
require(stringi)
require(base)
require(ngram)
# Spark Config
config <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"
spark <- spark_connect(master = "yarn-client",version = "2.3.0",app_name = "Test", config=config)
Then I try to retrieve some data with an SQL statement
test_query<-sdf_sql(spark,"SELECT ID, NAME FROM table.name LIMIT 10")
NAME <- c('John Doe','Peter Gynn','Jolie Hope')
ID<-c(1,2,3)
test_query<-data.frame(NAME,ID) # ( this is the example data, here it is in R data frame, but I have on a Spark Data Frame)
When I try to do feature engineering I got an error in the last line
test_query<-test_query %>%
mutate(Total_char=nchar(NAME))%>% # this works good
mutate(Name_has_numbers=str_detect(NAME,"[[:digit:]]"))%>% # Works good
mutate(Total_words=str_count(NAME, '\\w+')) # I got an error
The error message I am getting is this one: Error: org.apache.spark.sql.AnalysisException: Undefined function: 'STR_COUNT'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
-Count the total number of words in a row: for example
word= "Hello how are you" , number of words: 4
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the first word: 5
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the second word: 3
> library(tidyverse)
> test_query %>%
mutate(NAME = as.character(NAME),
word_count = str_count(NAME, "\\w+"), # Count the total number of words in a row
N_char_first_word = nchar((gsub("(\\w+).*", "\\1", NAME))) #Count the total number of character in the first word
)
NAME ID word_count N_char_first_word
1 John Doe 1 2 4
2 Peter Gynn 2 2 5
3 Jolie Hope 3 2 5

Resources