I am working on counting the frequency of unique words in a text document in R 3.2.2. I have collapsed so many articles into one single text document now and framed into corpus using tm package.
desc<-paste(column_input, collapse=" ")
desrc <- VectorSource(desc)
decorp<-Corpus(desrc)
#dedtm <- DocumentTermMatrix(decorp)
#dedtm <- TermDocumentMatrix(decorp)
There are 12000 odd terms in that one text doc. To proceed forward with matrix operations, I am not quite sure which is better method. Term Document matrix or Document Term matrix ?
I hope that depends upon context. Is it better to use Term Document matrix rather than Document Term matrix in case of fewer documents with more terms. I just wanted to understand the logic behind this. So, I hope there is no need for any reproducible example. Any suggestions would be greatly appreciated.
Thanks in advance,
Bala
Related
I'd like to use the GloVe word embedding implemented in text2vec to perform supervised regression/classification. I read the helpful tutorial on the text2vec homepage on how to generate the word vectors. However, I'm having trouble grasping how to proceed further, namely apply or transform these word vectors and attach them to each document in such a way that each document is represented by a vector (derived from its component words' vectors I assume), to be used as input in a classifier. I've run into some quick fixes online for short documents, but my documents are rather lengthy (movie subtitles) and there doesn't seem to be any guidance on how to proceed with such documents - or at least guidance matching my comprehension level; I have experience working with n-grams, dictionaries, and topic models, but word embeddings puzzle me.
Thank you!
If your goal is to classify documents - I doubt any doc2vec approach will beat bag-of-words/ngrams. If you still want to try - common simple strategy short documents (< 20 words) is to represent document as weighted sum/average of word vectors.
You can obtain it by something like:
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged = normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]
I'm not aware of any universal established methods to obtain good document vectors for long documents.
I'm working with a dataset that has 439 observations for text analysis in stm. When I use textProcessor, the number of observations changes to 438 for some reason. This creates problems later on: when using the findThoughts() function, for example.
##############################################
#PREPROCESSING
##############################################
#Process the data for analysis.
temp<-textProcessor(sovereigncredit$Content,sovereigncredit, customstopwords = customstop, stem=FALSE)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
length(docs) # QUESTION: WHY IS THIS 438 instead of 439, like the original dataset?
length(sovereigncredit$Content) # See, this original one is 439.
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
An example of this becoming a problem down the line is:
thoughts1<-findThoughts(sovereigncredit1, texts=sovereigncredit$Content,n=5, topics=1)
For which the output is:
"Error in findThoughts(sovereigncredit1, texts = sovereigncredit$Content, :
Number of provided texts and number of documents modeled do not match"
In which "sovereigncredit1" is a topic model based on "out" from above.
If my interpretation is correct (and I'm not making another mistake), the problem seems to be this 1 observation difference in the number of observations pre and post textprocessing.
So far, I've looked at the original csv and made sure there are in fact 439 valid observations and no empty rows. I'm not sure what's up. Any help would be appreciated!
stm can't handle empty documents so we simply drop them. textProcessor removes a lot of stuff from texts: custom stopwords, words shorter than 3 characters, numbers etc. So what's happening here is one of your documents (whichever one is dropped) is essentially losing all of its contents sometime during the process of doing the various things textProcessor does.
You can work back what document it was and decide what you want to do about that in this instance. In general if you want more control over the text manipulation, I would strongly recommend the quanteda package which has much more fine-grained tools than stm for manipulating texts into a document-term matrix.
I have a corpus that has words such as 5k,50k,7.5k,75k,10K,100K.
So when i create a TDM using the tm package, terms such as 10k and 100k are extracted separately. However , 5k and 7.5k are not extracted as separate terms.
Now , i understand that after punctuation correction "7.5k" might be falling under "75k" terms , but whats going on with "5k" . Why is it not extracted as a term ?
Basically , i would want to know if there is way to FORCE tm package to look for specific words and extract them as key terms.
Any pointers would help !!
Are you breaking words at punctuation? That is, is '.' a word-break character? If so, then the split of '7.5k' is ('7', '5k'), the second of which matches '5k'.
In text2vec, the only function I could find about Stopwords is “create_vocabulary”. But in text mining mission, we usually need to eliminate the stopwords in the resource document and then build corpus or other further processes. How do we use “stopword” to tackle the documents in building corpus, dtm and tcm using text2vec?
I’ve used tm for text mining before. It has a function for analyzing PDF document, but it reads one paper as several vectors(one line, one vector), not read each of the document as a vector as my expect. Furthermore, the format exchange function in tm have mess code problem in Chinese. If use text2vec to read documents, could it read one paper into a vector?(aka. Is the volume of vector large enough for one paper published on journals?) Otherwise, the corpus and vector built in text2vec compatible with which built in tm?
There are two ways to create document-term matrix:
Using feature hashing
Using vocabulary
See text-vectorization vignette for details.
You are interesting in 2 choice. This mean you should build vocabulary - set of words/ngrams which will be used in all downstream tasks. create_vocabulary creates vocabulary object and only terms from this object will be used in further steps. So if you will provide stopwords to create_vocabulary, it will remove them from the set of all observed words in the corpus.As you can see you should provide stopwords only once. All the dowstream tasks will work with vocabulary.
Answer on second question.
text2vec doesn't provide high-level functions for reading PDF documents. However it allows user to provide custom reader function. All you need is to read full articles with some function and reshape them to character vector where each element corresponds to desired unit of information (full article, paragraph, etc). For example you can easily combine lines into single element with paste() function. For example:
article = c("sentence 1.", "sentence 2")
full_article = paste(article, collapse = ' ')
# "sentence 1. sentence 2"
Hope this helps.
We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words?
1) Skip gram method: paper here and the tool that uses it, google word2vec
2) Using LSTM-RNN to form semantic representations of sentences.
3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.
4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.
5) Using a CNN to summarize documents.
It all depends on:
which vector model you're using
what is the purpose of the model
your creativity in combining word vectors into a document vector
If you've generated the model using Word2Vec, you can either try:
Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html
Wiki2Vec: https://github.com/idio/wiki2vec
Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):
def sent_vectorizer(sent, model):
sent_vec = np.zeros(400)
numw = 0
for w in sent:
try:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))
A solution that is slightly less off the shelf, but probably hard to beat in terms of accuracy if you have a specific thing you're trying to do:
Build an RNN (with LSTM or GRU memory cells, comparison here) and optimize the error function of the actual task you're trying to accomplish. You feed it your sentence, and train it to produce the output you want. The activations of the network after being fed your sentence is a representation of the sentence (although you might only care about the networks output).
You can represent the sentence as a sequence of one-hot encoded characters, as a sequence of one-hot encoded words, or as a sequence of word vectors (e.g. GloVe or word2vec). If you use word vectors, you can keep backpropagating into the word vectors, updating their weights, so you also get custom word vectors tweaked specifically for the task you're doing.
There are a lot of ways to answer this question. The answer depends on your interpretation of phrases and sentences.
These distributional models such as word2vec which provide vector representation for each word can only show how a word usually is used in a window-base context in relation with other words. Based on this interpretation of context-word relations, you can take average vector of all words in a sentence as vector representation of the sentence. For example, in this sentence:
vegetarians eat vegetables .
We can take the normalised vector as vector representation:
The problem is in compositional nature of sentences. If you take the average word vectors as above, these two sentences have the same vector representation:
vegetables eat vegetarians .
There are a lot of researches in distributional fashion to learn tree structures through corpus processing. For example: Parsing With Compositional Vector Grammars. This video also explain this method.
Again I want to emphasise on interpretation. These sentence vectors probably have their own meanings in your application. For instance, in sentiment analysis in this project in Stanford, the meaning that they are seeking is the positive/negative sentiment of a sentence. Even if you find a perfect vector representation for a sentence, there are philosophical debates that these are not actual meanings of sentences if you cannot judge the truth condition (David Lewis "General Semantics" 1970). That's why there are lines of works focusing on computer vision (this paper or this paper). My point is that it can completely depend on your application and interpretation of vectors.
Hope you welcome an implementation. I faced the similar problem in converting the movie plots for analysis, after trying many other solutions I sticked to an implementation that made my job easier. The code snippet is attached below.
Install 'spaCy' from the following link.
import spacy
nlp = spacy.load('en')
doc = nlp(YOUR_DOC_HERE)
vec = doc.vector
Hope this helps.