I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!
Related
I am looking for assistance with my R code when exporting a DocumentTermMatrix. The file size is too large to export so I was curious if there is a way to set a Frequency to the DTM? For example, only return values in the DTM that have been used 5 or more times.
dtm <- DocumentTermMatrix(alltextclean)
write.csv(as.matrix(dtm), "dtm.csv")
The above produces too large of a file, can I add a frequency to this? I also tried the below but I am left with a list of terms but without a term count (this would also be useful).
termsonly <- findFreqTerms(dtm, 5)
write.csv(termsonly, "termsonly.csv")
Adding a frequency to the above would also be helpful.
Thanks for the help!
I guess you are looking for the total occurrence of each term, across all documents. Using an example dataset:
library(tm)
data(crude)
If your matrix is not so huge, you can do:
dtm = DocumentTermMatrix(crude)
Freq = colSums(as.matrix(dtm))
Otherwise, let's say we take terms with at least 5 occurences:
termsonly <- findFreqTerms(dtm, 5)
Freq = colSums(as.matrix(dtm[,termsonly]))
Or you cast it into a sparseMatrix and sum the columns:
library(Matrix)
Freq = colSums(sparseMatrix(i=dtm$i,j=dtm$j,x=dtm$v,dimnames=dtm$dimnames))
You can also check this post if you like a tidy solution.
I have been working on a machine learning project with tweets, including a classification problem.
As a consequence, I have a training set and a testing set of tweets.
On the training set, I have computed a TF-IDF matrix with "tm" R package:
library(tm)
text_matrix <- DocumentTermMatrix(myCorpus_2,
control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
Now, I want to get a similar term document matrix for my test dataset, with the same words in columns.
And I do not have any idea on how to generate a TF-IDF matrix while specifying the list of columns I want. Does any of you know how I could do ?
EDIT: Actually, I am looking for an equivalent of sklearn.feature_extraction.text.TfidfVectorizer in R.
I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).
I have posted the code below. Does anyone know an alternative for doing this?
Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.
##Read in the dictionary
voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
word <- Corpus(VectorSource(voc))[1]
tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Structure of "voc" data frame:
Word Stem
1 abandonnere abandonner
2 abandonnerede abandonner
3 abandonnerende abandonner
...
313273 åsyns åsyn
To make a dictionary marching fast, you need to implement some clever data structures such as a prefix tree. 300000x search and replace just does not scale.
I don't think this will be efficient in R, but you will need to write a C or C++ extension. You have many tiny operations there, the overhead of the R interpreter will kill you when trying to do this in pure R.
I've been using the topicmodels package to create LDA models in R.
require(tm)
require(topicmodels)
textvector <- c("this is one sentence", "this is another one",
"a third sentence appears")
#and more, read in through a file
dtm <- DocumentTermMatrix(Corpus(VectorSource(textvector)))
lda.model <- LDA(dtm, 5)
But the only way format it accepts documents is as actual, literal documents. I was wondering if there is a way to provide a map of frequencies
[word1: 4, word2: 9, word3: 25, word5:3...]
This is obviously not a 'map' in R, but any data structure (data frame, table, list of vectors) representation that allows creation of topic models from word frequencies?
The reason I need this is because the topic models aren't being created on 'documents' and 'words' as such but analogous features in images, and a long-form representation needs way too much space.
You don't need to use tm's call to create the doc-term matrix. You can create and sent in your own, so long as the "documents" are in rows and the component "words" are represented in columns. However, you cannot simply supply frequency counts in a table, because LDA relies on knowing the what words appear in what documents!
I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing.
How can a simple VSM model be implemented using the R's text mining package?
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:
Create a Document Term Matrix of your collection/corpus
Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
Compute similarity with the query and the matrix from the first step.
Rank the results and choose the top n.
A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.