Tokenization of Compound Words not Working in Quanteda - r

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.
This is the subset of the dataset I'm using as a reproducible example:
test_cluster <- speeches_subset %>%
filter(grepl('Schwester Agnes',
speechContent,
ignore.case = TRUE))
test_corpus <- corpus(test_cluster,
docid_field = "id",
text_field = "speechContent")
Here, test_cluster contains six observations of 12 variables, that is, six rows in which the column speechContent contains the compound word "Schwester Agnes". test_corpus transforms the underlying data into a quanteda corpus object.
When I then run the following code, I would expect, first, the content of the speechContent variables to be tokenized, and due to tokens_compound, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound(), but I'm not sure... Any help would be greatly appreciated!
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("Schwester Agnes"))
test_kwic <- kwic(test_tokens,
pattern = "Schwester Agnes",
window = 5)
EDIT: I realize that the examples above are not easily reproducible, so please refer to the reprex below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = c("stack", "overflow"))
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)

You need to apply phrase("stack overflow") and set concatenator = " " in tokens_compound().
require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of",
"This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.",
"this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id = 1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
test_kwic
#> Keyword-in-context with 2 matches.
#> [1, 29] for example is the word | stack overflow | However there are so many
#> [2, 24] but at the very end | stack overflow |
Created on 2022-05-06 by the reprex package (v2.0.1)

Related

How to compare text from two data frames in a wordcloud using R's quanteda package?

Suppose I have two data frames (country_x and country_y which contain similar columns). E.g.
text_country_x
hello
bye
and
text_country_y
see ya
great
Using quanteda and quanteda.textplots packages, I have created a word cloud:
corpus_country_x <- corpus(country_x_df$text_country_x)
country_x_token <- tokens(corpus_country_x, remove_punct = TRUE, remove_numbers = TRUE)
country_x_token <- tokens_remove(country_x_token , stopwords("english"))
token_dfm_x <- dfm(country_x_token)
quanteda.textplots::textplot_wordcloud(token_dfm_x)
However, I want to create a wordcloud where half of it contains text from text_country_x and the other half contains text from text_country_y. Does anyone know how to do this?
I know there is the comparison = TRUE parameter but not sure how to make it work in practice: https://quanteda.io/reference/textplot_wordcloud.html#:~:text=To%20produce%20word%20cloud%20plots,each%20document%20in%20the%20dfm..
Do it this way:
Form each corpus separately
Set a docvar for each corpus to differentiate the country. (Below, I use the document variable set)
Combine the corpus objects using +
Tokenise and form a dfm, then group the dfm using your set variable (country, in your example)
Plot the comparison wordcloud.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
corpus_country_x <- corpus_subset(data_corpus_inaugural, Party == "Democratic")
corpus_country_x$set <- "Dem"
corpus_country_y <- corpus_subset(data_corpus_inaugural, Party == "Republican")
corpus_country_y$set <- "Rep"
corp <- corpus_country_x + corpus_country_y
dfmat <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm() %>%
dfm_group(groups = set)
library("quanteda.textplots")
textplot_wordcloud(dfmat, max_words = 60, comparison = TRUE)
Created on 2022-04-27 by the reprex package (v2.0.1)

How to convert data frame to dfm in quanteda package in R?

Suppose I have a data frame vector which looks like:
tweets
#text
#text 2
#text 3
Using the quanteda package, I'm trying to count the number of hashtags in the data frame.
However, using the following code, I get an error:
tweet_dfm <- dfm(data, remove_punct = TRUE)
tag_dfm <- dfm_select(tweet_dfm, pattern = ('#*'))
toptag <- names(topfeatures(tag_dfm, 50))
head(toptag)
Error (on the first line of code):
Error in dfm.default(data, remove_punct = TRUE) : dfm() only works on character, corpus, dfm, tokens objects.
Any ideas how to fix?
You need to slice out the column of the data.frame called "tweets", using data$tweets. So:
library("quanteda")
## Package version: 2.1.2
data <- data.frame(tweets = c("#text", "#text 2", "#text 3"))
dfm(data$tweets, remove_punct = TRUE) %>%
dfm_select(pattern = ("#*")) %>%
sum()
## [1] 3
(since you wanted the total of all hashtags)
Note that the remove_punct = TRUE is unnecessary here, although it has no effect - since fortunately quanteda's built-in tokeniser recognises the difference between punctuation and the hashtag character that other tokenisers might consider to be a punctuation character.

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.
You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

Filtering text from numbers and stopwords in R(not for tdm)

I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system
Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"
There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "

Efficient way to remove all proper names from corpus

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

Resources