Workflow for integrating Syntaxnet into the analysis of long(er) documents - syntaxnet

I am trying to figure out what improvements in the text analysis of long documents can be obtained by using Syntaxnet, rather than something "dumb" like word counts, sentence length, etc.
The goal would be to get more accurate linguistic measures (such as "tone" or "sophistication"), for quantifying attributes of long(er) documents like newspaper articles or letters/memos.
What I am trying to figure out is what to do with Syntaxnet output once the POS tagging is concluded. What types of things do people use to process Syntaxnet output?
Ideally I am looking for an example workflow that transforms Syntaxnet output into something quantitative that can be used in statistical analysis.
Also, can someone point me to sources that show how the inferences drawn from a "smart" analysis with Syntaxnet compare to those that can be attained by word counts, sentence length, etc.?

Related

Is the information captured by Doc2Vec a subset of the information captured by BERT?

Both Doc2Vec and BERT are NLP models used to create vectors for text. The original BERT model maintained a vector of 768, while the original Doc2Vec model maintained a vector of size 300. Would it be reasonable to assume that all the information captured by D2V is a subset of information captured by BERT?
I ask, because I want to think about how to compare differences in representations for a set of sentences between models. I am thinking I could project the BERT vectors into a D2V subspace and compare those vectors to the D2V vectors for the same sentence, but this relies on the assumption that the subspace I'm projecting the BERT vectors into is actually comparable (i.e., the same type of information) to the D2V space.
The objective functions, while different, are quite similar. The Cloze task for BERT and the next word prediction for D2V are both trying to create associations between a word and its surrounding words. BERT can look bidirectionally, while D2V can only look at a window and moves from the left to the right of a sentence. The same objective function doesn't necessarily mean that they're capturing the same information, but it seems in which the way D2V does it (the covariates it uses) are a subset of the covariates used by BERT.
Interested to hear other people's thoughts.
I'll assume by Doc2Vec you mean the "Paragraph Vector" algorithm, which is often called Doc2Vec (including in libraries like Python Gensim).
That Doc2Vec is closely related to word2vec: it's essentially word2vec with a synthetic floating pseudoword vector over the entire text. It models texts via a shallow network that can't really consider word-order, or the composite-meaning of word runs, except in a very general 'nearness' sense.
So, a Doc2Vec model will not generate realistic/grammatical completions/summaries from vectors (except perhaps in very-limited single-word tests).
What info Doc2Vec most captures can be somewhat influenced by parameter choices, especially choice-of-mode and window (in modes where that matters, like when co-training word-vectors).
BERT is a far deeper model with more internal layers and a larger default dimensionality of text-representations. Its training mechanisms give it the potential to differentiate between significant word-orderings – and thus be sensitive to grammar and composite phrases beyond what Doc2Vec can learn. It can generate plausible multi-word completions/summarizations.
You could certainly train a 768-dimension Doc2Vec model on the same texts as a BERT model & compare the results. The resulting summary text-vectors, from the 2 models, would likely perform quite differently on key tasks. If you need to detect subtle shifts in meaning in short texts – things like the reversal of menaing from the insert of a single 'not' – I'd expect the BERT model to dominate (if sufficiently trained). On broader tasks less-sensitive to grammar like topic-classification, the Doc2Vec model might be competitive, or (given its simplicity) attractive in its ability to achieve certain targets with far less data or quicker training.
So, it'd be improper to assume that what Doc2Vec captures is a proper subset of what BERT does.
You could try learning a mapping from one model to the other (possibly including dimensionality-reduction), as there are surely many consistent correlations between the trained coordinate-spaces. But the act of creating such a mapping requires starting assumptions that certain vectors "should" line-up, or be in similar configurations.
If trying to understand what's unique/valuable across the two options, it's likely better to compare how the models rank a text's neighbors – do certain kinds of similarities dominate in one or the other? Or, try both as inputs to downstream classification/info-retrieval tasks, and see where they each shine.
(With sufficient data & training time, I'd expect BERT as the more-sophisticated model to usually provide better results – especially if it's also allotted a larger representation. But for some tasks, and limited data/compute/time resources, Doc2Vec might shine.

Compare vectors of a doc and just a word

So, I have to compare vector of article and vector of single word. And I don't have any idea how to do it. Looks like that BERT and Doc2wec good work with long text, Word2vec works with single words. But how to compare long text with just a word?
Some modes of the "Paragraph Vector" algorithm (aka Doc2Vec in libraries like Python gensim) will train both doc-vectors and word-vectors into the a shared coordinate space. (Specifically, any of the PV-DM dm=1 modes, or the PV-DBOW mode dm=0 if you enable the non-default interleaved word-vector training using dbow_words=1.)
In such a case, you can compare Doc2Vec doc-vectors with the co-trained word-vectors, with some utility. You can see some examples in the followup paper form the originators of the "Paragraph Vector" algorithm, "Document Embedding with Paragraph Vectors".
However, beware that vectors for single words, having been trained in contexts of use, may not have vectors that match what we'd expect of those same words when intended as overarching categories. For example, education as used in many sentences wouldn't necessarily assume all facets/breadth that you might expect from Education as a category-header.
Such single word-vectors might work better than nothing, and perhaps help serve as a bootstrapping tool. But, it'd be better if you had expert-labelled examples of documents belonging to categories of interest. Then you you could also use more advanced classification algorithms, sensitive to categories that wouldn't necessarily be summarized-by (and in a tight sphere around) any single vector point. In real domains-of-interest, that'd likely do better than using single-word-vectors as category-anchors.
For any other non-Doc2Vec method of vectorizing a text, you could conceivably get a comparable vector for a single word by supplying a single-word text to the method. (Even in a Doc2Vec mode that doesn't create word-vectors, like pure PV-DBOW, you could use that model's out-of-training-text inference capability to infer a doc-vector for a single-word doc, for known words.)
But again, such simplified/degenerate single-word outputs might not well match the more general/textured categories you're seeking. The models are more typically used for larger contexts, and narrowing their output to a single word might reflect the peculiarities of that unnatural input case moreso than the usual import of the word in real context.
You can use BERT as is for words too. a single word is just a really short sentence. so, in theory, you should be able to use any sentence embedding as you like.
But if you don't have any supervised data, BERT is not the best option for you and there are better options out there!
I think it's best to first try doc2vec and if it didn't work then switch to something else like SkipThoughts or USE.
Sorry that I can't help you much, it's completely task and data dependent and you should test different things.
Based on your further comments that explain your problem a bit more, it sounds like you're actually trying to do Topic Modelling (categorizing documents by a given word is equivalent to labeling them with that topic). If this is what you're doing, I would recommend looking into LDA and variants of it (eg guidedLDA as an example).

how to train Word2Vec model properly for a special purpose

My question concerns the proper training of the model for unique and really specific use of the Word2Vec model. See Word2Vec details here
I am working on identifying noun-adjective (or ) relationships within the word embeddings.
(E.g. we have 'nice car' in a sentence of the data-set. Given the word embeddings of the corpus and the nouns and adjectives all labeled, I am trying to design a technique to find the proper vector that connects 'nice' with 'car'.)
Of course I am not trying to connect only that pair of words, but the technique should would for all relationships. A supervised approach is taken at this moment, then try to work towards designing an unsupervised method.
Now that you understand what I am trying to do, I will explain the problem. I obviously know that word2vec needs to be trained on large amounts of data, to learn the proper embeddings as accurately as possible, but I am afraid to give it more data than the data-set with labelled sentences (500-700).
I am afraid that if I give it more data to train on (e.g. Latest Wikipedia dump data-set), it will learn better vectors, but the extra data will influence the positioning of my words, then this word relationship is biased by the extra training data. (e.g. what if there is also 'nice Apple' in the extra training data, then the positioning of the word 'nice' could be compromised).
Hopefully this makes sense and I am not making bad assumptions, but I am just in the dilemma of having bad vectors because of not enough training data, or having good vectors, but compromised vector positioning in the word embeddings.
What would be the proper way to train on ? As much training data as possible (billions of words) or just the labelled data-set (500-700 sentences) ?
Thank you kindly for your time, and let me know if anything that I explained does not make sense.
As always in similar situations it is best to check...
I wonder if you tested the difference in training on the labelled dataset results vs. the wikipedia dataset. Are there really the issues you are afraid of seeing?
I would just run an experiment and check if the vectors in both cases are indeed different (statistically speaking).
I suspect that you may introduce some noise with larger corpus but more data may be beneficial wrt. to vocabulary coverage (larger corpus - more universal). It all depends on your expected use case. It is likely to be a trade off between high precision with very low recall vs. so-so precision with relatively good recall.

word vector and paragraph vector query

I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.

explain the algorithm for document summarization

I want to do a project on document summarization.
Can anyone please explain the algorithm for document summarization using graph based approach?
Also if someone can provide me links to few good research papers???
Take a look at TextRank and LexRank.
LexRank is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.
In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.
https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank

Resources