Defining synonyms within a corpus of Documents using R - r

I have a corpus of documents of a very specific topic (e.g. sports/athelics). Within that corpus, I would like to define synonyms myself. The reason why I want to define synonyms myself is because sometimes, given two words, it is possible that the synonyms() function within the WordNet package does not recognise them as synonyms, but within the text they can be interpreted as such (for example, "fit" and "strong").
My idea is to use word associations with Bygrams and Trigrams and define a synonym when words appear frequently in a phrase and have similar semantic content. For example, using the crude dataset within the tm package I would do something like:
data(crude)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
crudetdm <- TermDocumentMatrix(crude, control=list(stripWhitespace = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removeSparseTerms = TRUE,
tokenize = BigramTokenizer))
ListAssoc <- lapply(crudetdm$dimnames$Terms, function(x) findAssocs(crudetdm, x, 0.9))
However this returns (as expected) Bigrams associated with Bigrams, while my idea would be to find individual words associated with the Bigrams in crudetdm$dimnames$Terms (the same excersise with Trigrams would be the next step). For example, using Bygrams and the crude dataset, the ideal scenario would be ending up with a data.frame like:
Bigram Associated Words
oil companies policies, marketing, prices, measures, market, revenue...
Then I would go myself trough the table and manually select those words that I believe can be considered synonyms in my dataset (my dataset is not that big). I can think of some ways around by defining multiple data.frames of bigrams and trigrams and match common words. However, I am sure there is a more elegant and efficient way of doing this in R.
Overall, my question is. Given a series of Bigrams and Trigrams, how can I find individual words that are associated to them?

Related

Keep only sentences in corpus that contain specific key words (in R)

I have a corpus with .txt documents. From these .txt documents, I do not need all sentences, but I only want to keep certain sentences that contain specific key words. From there on, I will perform similarity measures etc.
So, here is an example.
From the data_corpus_inaugural data set of the quanteda package, I only want to keep the sentences in my corpus that contain the words "future" and/or "children".
I load my packages and create the corpus:
library(quanteda)
library(stringr)
## corpus with data_corpus_inaugural of the quanteda package
corpus <- corpus(data_corpus_inaugural)
summary(corpus)
Then I want to keep only those sentences that contain my key words
## keep only those sentences of a document that contain words future or/and
children
First, let's see which documents contain these key words
## extract all matches of future or children
str_extract_all(corpus, pattern = "future|children")
So far, I only found out how to exclude the sentences that contain my key words, which is the opposite of what I want to do.
## excluded sentences that contains future or children or both (?)
corpustrim <- corpus_trimsentences(corpus, exclude_pattern =
"future|children")
summary(corpustrim)
The above command excludes sentences containing my key words.
My idea here with the corpus_trimsentences function is to exclude all sentences BUT those containing "future" and/or "children".
I tried with regular expression. However, I did not manage to do it. It does not return what I want.
I looked into the corpus_reshape and corpus_subset functions of the quanteda package but I can't figure out how to use them for my purpose.
You are correct that it's corpus_reshape() and corpus_subset() that you want here. Here's how to use them.
First, reshape the corpus to sentences.
library("quanteda")
data_corpus_inauguralsents <-
corpus_reshape(data_corpus_inaugural, to = "sentences")
data_corpus_inauguralsents
The use stringr to create a logical (Boolean) that indicates the presence or absence of the pattern, equal in length to the new sentence corpus.
containstarget <-
stringr::str_detect(texts(data_corpus_inauguralsents), "future|children")
summary(containstarget)
## Mode FALSE TRUE
## logical 4879 137
Then use corpus_subset() to keep only those with the pattern:
data_corpus_inauguralsentssub <-
corpus_subset(data_corpus_inauguralsents, containstarget)
tail(texts(data_corpus_inauguralsentssub), 2)
## 2017-Trump.30
## "But for too many of our citizens, a different reality exists: mothers and children trapped in poverty in our inner cities; rusted-out factories scattered like tombstones across the landscape of our nation; an education system, flush with cash, but which leaves our young and beautiful students deprived of all knowledge; and the crime and the gangs and the drugs that have stolen too many lives and robbed our country of so much unrealized potential."
## 2017-Trump.41
## "And now we are looking only to the future."
Finally, if you want to put these selected sentences back into their original document containers, but without the sentences that did not contain the target words, then reshape again:
# reshape back to documents that contain only sentences with the target terms
corpus_reshape(data_corpus_inauguralsentssub, to = "documents")
## Corpus consisting of 49 documents and 3 docvars.
You need to use the tokens function.
library(quanteda)
corpus <- corpus(data_corpus_inaugural)
# tokens to keep
tok_to_keep <- tokens_select(tokens(corpus, what = "sentence"), pattern = "future|children", valuetype = "regex", selection = "keep")
This returns a list of all the speeches and sentences where the key words are present. Next you can unlist the list of tok_to_keep or do whatever you need to it to get what you want.

Text Mining: Getting a Sentence-Term Matrix

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!

stri_replace_all_fixed slow on big data set - is there an alternative?

I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).
I have posted the code below. Does anyone know an alternative for doing this?
Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.
##Read in the dictionary
voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
word <- Corpus(VectorSource(voc))[1]
tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Structure of "voc" data frame:
Word Stem
1 abandonnere abandonner
2 abandonnerede abandonner
3 abandonnerende abandonner
...
313273 åsyns åsyn
To make a dictionary marching fast, you need to implement some clever data structures such as a prefix tree. 300000x search and replace just does not scale.
I don't think this will be efficient in R, but you will need to write a C or C++ extension. You have many tiny operations there, the overhead of the R interpreter will kill you when trying to do this in pure R.

R: tm package, aggregate / join docs

I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!

Creating topic models on frequency lists in R

I've been using the topicmodels package to create LDA models in R.
require(tm)
require(topicmodels)
textvector <- c("this is one sentence", "this is another one",
"a third sentence appears")
#and more, read in through a file
dtm <- DocumentTermMatrix(Corpus(VectorSource(textvector)))
lda.model <- LDA(dtm, 5)
But the only way format it accepts documents is as actual, literal documents. I was wondering if there is a way to provide a map of frequencies
[word1: 4, word2: 9, word3: 25, word5:3...]
This is obviously not a 'map' in R, but any data structure (data frame, table, list of vectors) representation that allows creation of topic models from word frequencies?
The reason I need this is because the topic models aren't being created on 'documents' and 'words' as such but analogous features in images, and a long-form representation needs way too much space.
You don't need to use tm's call to create the doc-term matrix. You can create and sent in your own, so long as the "documents" are in rows and the component "words" are represented in columns. However, you cannot simply supply frequency counts in a table, because LDA relies on knowing the what words appear in what documents!

Resources