I have been working on a machine learning project with tweets, including a classification problem.
As a consequence, I have a training set and a testing set of tweets.
On the training set, I have computed a TF-IDF matrix with "tm" R package:
library(tm)
text_matrix <- DocumentTermMatrix(myCorpus_2,
control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
Now, I want to get a similar term document matrix for my test dataset, with the same words in columns.
And I do not have any idea on how to generate a TF-IDF matrix while specifying the list of columns I want. Does any of you know how I could do ?
EDIT: Actually, I am looking for an equivalent of sklearn.feature_extraction.text.TfidfVectorizer in R.
Related
I am looking for assistance with my R code when exporting a DocumentTermMatrix. The file size is too large to export so I was curious if there is a way to set a Frequency to the DTM? For example, only return values in the DTM that have been used 5 or more times.
dtm <- DocumentTermMatrix(alltextclean)
write.csv(as.matrix(dtm), "dtm.csv")
The above produces too large of a file, can I add a frequency to this? I also tried the below but I am left with a list of terms but without a term count (this would also be useful).
termsonly <- findFreqTerms(dtm, 5)
write.csv(termsonly, "termsonly.csv")
Adding a frequency to the above would also be helpful.
Thanks for the help!
I guess you are looking for the total occurrence of each term, across all documents. Using an example dataset:
library(tm)
data(crude)
If your matrix is not so huge, you can do:
dtm = DocumentTermMatrix(crude)
Freq = colSums(as.matrix(dtm))
Otherwise, let's say we take terms with at least 5 occurences:
termsonly <- findFreqTerms(dtm, 5)
Freq = colSums(as.matrix(dtm[,termsonly]))
Or you cast it into a sparseMatrix and sum the columns:
library(Matrix)
Freq = colSums(sparseMatrix(i=dtm$i,j=dtm$j,x=dtm$v,dimnames=dtm$dimnames))
You can also check this post if you like a tidy solution.
I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!
I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!
Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand.
I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater than 50.
Following this, I train a, say, GLMNET model using the DTM and the dependent variable (which was present in the training data). Everything runs smooth and easy till now.
However, how do I proceed when I want to score/predict the model on the testing data or any new data that might come in the future?
Specifically, what I am trying to find out is that how do I create the exact DTM on new data?
If the new data set does not have any of the similar words as the original training data then all the terms should have a count of zero (which is fine). But I want to be able to replicate the exact same DTM (in terms of structure) on any new corpus.
Any ideas/thoughts?
tm has so many pitfalls... See much more efficient text2vec and vectorization vignette which fully answers to the question.
For tm here is probably one more simple way to reconstruct DTM matrix for second corpus:
crude2.dtm <- DocumentTermMatrix(crude2, control = list
(dictionary=Terms(crude1.dtm), wordLengths = c(3,10)) )
If I understand correctly, you have made a dtm, and you want to make a new dtm from new documents that has the same columns (ie. terms) as the first dtm. If that's the case, then it should be a matter of sub-setting the second dtm by the terms in the first, perhaps something like this:
First set up some reproducible data...
This is your training data...
library(tm)
# make corpus for text mining (data comes from package, for reproducibility)
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10)))
And this is your testing data...
corpus2 <- Corpus(VectorSource(crude[15:20]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10)))
Here is the bit that does what you want:
Now we keep only the terms in the testing data that are present in the training data...
# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing
# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
colnames(crude1.dtm.mat))])
Finally add to the testing data all the empty columns for terms in the training data that are not in the testing data...
# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
colClasses = "integer")
# add incols of NAs for terms absent in the
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)
So zz is a data frame of the testing documents, but has the same structure as the training documents (ie. same columns, though many of them contain NA, as SchaunW notes).
Is that along the lines of what you want?
I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing.
How can a simple VSM model be implemented using the R's text mining package?
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:
Create a Document Term Matrix of your collection/corpus
Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
Compute similarity with the query and the matrix from the first step.
Rank the results and choose the top n.
A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.