Count the amount of bigrams in sentences in a dataframe - r

I have a dataset that looks a bit like this:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
I would like to have a count where I can see the occurrence of certain bigrams. So lets say I have:
trigger_bg_1 <- "sample text"
I expect the output of 2 (as there are two occurrences of "sample text" in the two sentences. I know how to do a word count like this:
trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")
for(i in unlist(words)){
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}
But I cant get something working for a bigram. Any thoughts on how I should change the code to get it working?
But as I have a long test of trigger-words which I need to count in over

In case you want to count the sentences where you have a match you can use grep:
length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2
In case you want to count how many times you find trigger_bg_1 you can use gregexpr:
sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2

You could sum a grepl
sum(grepl(trigger_bg_1, df$sentences))
[1] 2

If you are really interested in bigrams rather than just set word combinations, the package quanteda can offer a more substantial and systematic way forward:
Data:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
Solution:
library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)
Result:
bigrams
in sentence sample text text in sentence 1 sentence 2
2 2 2 1 1
If you want to inspect the frequency count of a specific bigram:
bigrams["in sentence"]
in sentence
2

Related

In R, compare string variable of two dataframes to create new flag variable indicating match in both dataframes, using a for-loop?

I have two dataframes which I would like to compare. One of them contains a complete list of sentences as a string variable as well as manually assigned codes of 0 and 1 (i.e. data.1). The second dataframe contains a subset of the sentences of the first dataframe and is reduced to those sentences that were matched by a dictionary.
This is, in essence, what these two datasets look like:
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
I would like to merge the results of the data.2 into data.1 and ideally create a new code_2 variable there that indicates whether a sentence was matched by the dictionary. This would yield something like this:
> data.1
texts code code_2
1 This is a sentence 1 1
2 This is another sentence 1 0
3 This is not a sentence 0 1
4 Yet another sentence 1 0
To make this slightly more difficult, and as you can see above, the sentences in data.2 are not just a subset of data.1 but they may also be in a different order (e.g. "This is not a sentence" is in the third row of the first dataframe but in the first row of the second dataframe).
I was thinking that looping through all of the texts of data.1 would do the trick, but I'm not sure how to implement this.
for (i in 1:nrow(data.1)) {
# For each i in data.1...
# compare sentence to ALL sentences in data.2...
# create a new variable called "code_2"...
# assign a 1 if a sentence occurs in both dataframes...
# and a 0 otherwise (i.e. if that sentence only occurs in `data.1` but not in `data.2`).
}
Note: My question is similar to this one, where the string variable "Letter" corresponds to my "texts", yet the problem is somewhat different, since the matching of sentences itself is the basis for the creation of a new flag variable in my case (which is not the case in said other question).
can you just join the dataframes?
NOTE: Added replace_na to substitue with 0
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
data.1 %>% dplyr::left_join(data.2, by = 'texts') %>%
dplyr::mutate(code.y = tidyr::replace_na(code.y, 0))
I believe the following match based solution does what the question asks for.
i <- match(data.2$texts, data.1$texts)
i <- sort(i)
data.1$code_2 <- 0L
data.1$code_2[i] <- data.2$code[seq_along(i)]
data.1
# texts code code_2
#1 This is a sentence 1 1
#2 This is another sentence 1 0
#3 This is not a sentence 0 1
#4 Yet another sentence 1 0

grepl for finding words

I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I donĀ“t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE

Classification based on list of words R

I have a data set with article titles and abstracts that I want to classify based on matching words.
"This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text"
Topic 1 Topic 2 Topic (X)
word1 word4 word(a)
word2 word5 word(b)
word3 word6 word(c)
Given that that text above matches words in Topic 2, I want to assign a new column with this label. Preferred if this could be done with "tidy-verse" packages.
Given the sentence as a string and the topics in a data frame you can do something like this
input<- c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text")
df <- data.frame(Topic1 = c("word1", "word2", "word3"),Topic2 = c("word4", "word5", "word6"))
## This splits on space and punctation (only , and .)
input<-unlist(strsplit(input, " |,|\\."))
newcol <- paste(names(df)[apply(df,2, function(x) sum(input %in% x) > 0)], collapse=", ")
Given I am unsure of the data frame you want to add this too I have made a vector newcol.
If you had a data frame of long sentences then you can use a similar approach.
inputdf<- data.frame(title=c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text", "word2", "word3, word4"))
input <- strsplit(as.character(inputdf$title), " |,|\\.")
inputdf$newcolmn <-unlist(lapply(input, function(x) paste(names(df)[apply(df,2, function(y) sum(x %in% y)>0)], collapse = ", ")))

stringr: extract words containing a specific word

Consider this simple example
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
Here I want to extract the words containing WIFF, that is I want to end up with a dataframe like this
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
I tried to use
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
but this only retuns NAs. Any ideas?
Thanks!
A classic, non-regex approach via base R would be,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"
You seem to want to remove all words containing WIFF and the trailing ; if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;? matches:
(?i) - a case insensitive inline modifier
\\b - a word boundary
(?!\\w*WIFF) - the negative lookahead fails any match where a word contains WIFF anywhere inside it
\\w+ - 1 or more word chars
;? - an optional ; (? matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract, note that your regex could not work because \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b" to match any words with WIFF inside (case insensitively) and use str_extract_all to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all into the sapply function, I separated them for better visibility.

Frequency of occurrence of two-pair combinations in text data in R

I have a file with several string (text) variables where each respondent has written a sentence or two for each variable. I want to be able to find the frequency of each combination of words (i.e. how often "capability" occurs with "performance").
My code so far goes:
#Setting up the data file
data.text <- scan("C:/temp/tester.csv", what="char", sep="\n")
#Change everything to lower text
data.text <- tolower(data.text)
#Split the strings into separate words
data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
data.words.vector <- unlist(data.words.list)
#List each word and frequency
data.freq.list <- table(data.words.vector)
This gives me a list of each word and how often it appears in the string variables. Now I want to see the frequency of every 2 word combination. Is this possible?
Thanks!
An example of the string data:
ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
I'm not sure if this is what yu mean, but rather than splitting on every two word boundaires (which I found a pain to try and regex) you could paste every two words together using the trusty head and tails slip trick...
# How I read your data
df <- read.table( text = 'ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
' , h = TRUE , stringsAsFactors = FALSE )
# Split to words
wlist <- sapply( df[,-1] , strsplit , split = "\\W+", perl=TRUE)
# Paste word pairs together
outl <- sapply( wlist , function(x) paste( head(x,-1) , tail(x,-1) , sep = " ") )
# Table as per usual
table(unlist( outl ) )
are overchanging at other bad service better value customer service
1 1 1 1 1
happy with not happy of same old thing other place
1 1 1 1 1
overchanging me poor customer same old the service they are
1 1 1 1 1
tired of value at with the
1 1 1

Resources