I have a problem for network.
For one document I am extracting some information. I am drawing nice graphs for them. But in a document information flows. I am trying to depict it in graph like the way one reads a text flowing with text and then important most entity first and then the next important one.
To understand and grasp this problem what are the kinds of things I have to study or which aspect of network theory or graph theory deals with it.
If any one can kindly refer up.
Regs,
SK.
First of all, I'm not an expert in linguistic or study of languages. I think I understand what you're trying to do, and I don't know what's the best way to do it.
If I got it right, you want to determine some centrality measure for your words (that would explain the social network reference), to find those who are the most linked to others, is that it ?
The problem if you try that is that you will certainly find that the most central words are the most inintersting ones (the, if, then, some redundant adjectives...), if you don't apply a tokenization and lemmization procedure beforehand. Thus you could separate only nouns and stemming of verbs used, and then only you could try your approach.
Another problem that you must keep in mind is that words are important both by their presence and by their rarity (see tf-idf weight measure for instance).
To conclude, I did the following search on google :
"n gram graph language centrality word"
and found this paper that seems interesting for what you're asking (I might give it a look myself !) :
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
Related
I'm making a program that implements procedural content generation (PCG) to create maps in a 2d game.
I use the graph data structure as the basis. then the graph will be transformed into a map like in the example image I attached.
with graph specifications as follows:
-vertex can have more than 4 edges
-allowed the formation of cycles in the graph
any suggestions on what method I can use to transform the graph to a 2d map in a grid with space-tight results?
thanks
Uh, that is a tough one. The first problem you will encounter is whether this is even possible for the graph you use. See more below for that specific topic.
Let's say we ignore the fact that your graph could be impossible to map to a grid. I faced the same issue in my Master's Thesis a few years back. (PDF available here; 3.4 World Generation; page 25). I tried to find an algorithm, that could generate my world from a graph structure but ultimately failed. I tried placing one element after the other and implemented some backtracking in case it got stuck. But in the end you're facing a similar complexity to calculating chess moves. At some point you know you messed up, but you don't know how many steps you should go back/reverse, before trying the next one. If you try to solve this by brute force, you're not going to have a good time. And I did not come up with good heuristics to solve it in an adequate time.
My solution: I decided in the end to go with AnswerSet Programming. You're basically not solving the problem with an algorithm, but you find a (more or less) elegant logical representation of your problem and let a logic solver (program specifically made to find a valid solution to your logical problem-representation) do the work. Have a look in my thesis about the details, it was a few years ago and I didn't use one since. I remember however, that this process was not easy and it took me a few days to find a good logical representation of my problem.
Another question to ask: Could you work on the grid directly? Or maybe on a graph structure representing a grid? In the end a grid is nothing else than a graph; every cell is a node and neighbouring connections are the edges. I have quite some experience in the field and would be happy to help you, if you'd like to share what you want to achieve with your generator. I have also a vast collection of resources about procedural generation, maybe you find something helpful there, too.
More on the planarity of a graph: For your graph to be mappable to a plane, it needs to be planar, and checking so is also not trivial. The easiest way - if I'm not mistaken - is to prove the existence of a non-planar sub-graph, e.g. the K5 (the smallest non-planar a complete graph) or K3,3 (the smallest non-planar complete bipartite graph). And even if your graph is planar, it is not necessarily guaranteed that you can put it on your grid.
I tried to search but couldn't find much helpful information regarding this topic. That's why I am asking it here...
I know there are various methods to classify texts (like Logistic regression etc.) and also we have neural network.
But, I was wondering if it is possible to 'classify the texts into multiple classes' using graph theory?
If yes, how should I proceed? Please guide me.
Example:
I like jeansp -pos
I like toyota -pos
I it so-so place -neutral
I hated that trip -neg
I love that shirt -pos
that place was horrible -neg
I liked food but service was bad -neutral
Assume each document is a node, and each word is also a node. Documents have edges to words.
Now, some of your documents have labels and some don't.
You can use graph convolutional networks (GCN) to classify the unlabelled documents.
Take a look at the Python Geometric package that has implemented different versions of graph conovlutional networks. Create your input in a way that Python Geometric accepts, and you're done.
So, I have to compare vector of article and vector of single word. And I don't have any idea how to do it. Looks like that BERT and Doc2wec good work with long text, Word2vec works with single words. But how to compare long text with just a word?
Some modes of the "Paragraph Vector" algorithm (aka Doc2Vec in libraries like Python gensim) will train both doc-vectors and word-vectors into the a shared coordinate space. (Specifically, any of the PV-DM dm=1 modes, or the PV-DBOW mode dm=0 if you enable the non-default interleaved word-vector training using dbow_words=1.)
In such a case, you can compare Doc2Vec doc-vectors with the co-trained word-vectors, with some utility. You can see some examples in the followup paper form the originators of the "Paragraph Vector" algorithm, "Document Embedding with Paragraph Vectors".
However, beware that vectors for single words, having been trained in contexts of use, may not have vectors that match what we'd expect of those same words when intended as overarching categories. For example, education as used in many sentences wouldn't necessarily assume all facets/breadth that you might expect from Education as a category-header.
Such single word-vectors might work better than nothing, and perhaps help serve as a bootstrapping tool. But, it'd be better if you had expert-labelled examples of documents belonging to categories of interest. Then you you could also use more advanced classification algorithms, sensitive to categories that wouldn't necessarily be summarized-by (and in a tight sphere around) any single vector point. In real domains-of-interest, that'd likely do better than using single-word-vectors as category-anchors.
For any other non-Doc2Vec method of vectorizing a text, you could conceivably get a comparable vector for a single word by supplying a single-word text to the method. (Even in a Doc2Vec mode that doesn't create word-vectors, like pure PV-DBOW, you could use that model's out-of-training-text inference capability to infer a doc-vector for a single-word doc, for known words.)
But again, such simplified/degenerate single-word outputs might not well match the more general/textured categories you're seeking. The models are more typically used for larger contexts, and narrowing their output to a single word might reflect the peculiarities of that unnatural input case moreso than the usual import of the word in real context.
You can use BERT as is for words too. a single word is just a really short sentence. so, in theory, you should be able to use any sentence embedding as you like.
But if you don't have any supervised data, BERT is not the best option for you and there are better options out there!
I think it's best to first try doc2vec and if it didn't work then switch to something else like SkipThoughts or USE.
Sorry that I can't help you much, it's completely task and data dependent and you should test different things.
Based on your further comments that explain your problem a bit more, it sounds like you're actually trying to do Topic Modelling (categorizing documents by a given word is equivalent to labeling them with that topic). If this is what you're doing, I would recommend looking into LDA and variants of it (eg guidedLDA as an example).
I am quite new to NLP. My question is can I combine words of same meaning into one using NLP, for example, considering the following rows;
1. It’s too noisy here
2. Come on people whats up with all the chatter
3. Why are people shouting like crazy
4. Shut up people, why are you making so much noise
As one can notice, the common aspect here is that the people are complaining about the noise.
noisy, chatter, shouting, noise -> Noise
Is it possible to group the words using a common entity using NLP. I am using R to come up with a solution to this problem.
I have used a sample twitter data set and my expected output will be a table which contains;
Noise
It’s too noisy here
Come on people whats up with all the chatter
Why are people shouting like crazy
Shut up people, why are you making so much noise
I did search the web for reference before posting here. Any suggestion or valuable inputs will be of much help.
Thanks
The problem you mention is better known as paraphrasing, and it is not completetly solved. Maybe if you want a fast solution, you can start replacing synonyms, wordnet can help with that.
Other idea is calculate sentence similarity (just getting a vector representation of each sentence and use cosine distance to measure similarity to each other)
I think this paper could provide a good introduction for your problem.
I am using the InfoMap algorithm in the igraph package to perform community detection on a directed and non-weighted graph (34943 vertices, 206366 edges). In the graph, vertices represent websites and edges represent the existence of a hyperlink between websites.
A problem I have encountered after running the algorithm is that the majority of vertices have a membership in a single massive community (32920 or 94%). The rest of the vertices are dispersed into hundreds of other tiny communities.
I have tried different settings with the nb.trials parameter (i.e. 50, 100, and now running 500). However, this doesn't seem to change the result much.
I am feeling rather exasperated because the run-time on the algorithm is quite high, so I have to wait each time for the results (with no luck yet!!).
Many thanks.
Thanks for all the excellent comments. In the end, I got it working by downloading and running the source code for Infomap, which is available at: http://www.mapequation.org/code.html.
Due to licence issues, the latest code has not been integrated with igraph.
This solved the problem of too many nodes being 'lumped' into a single massive community.
Specifically, I used the following options from the command line: -N 10 --directed --two-level --map
Kudos to Martin Rosvall from the Infomap project for providing me with detailed help to resolve this problem.
For the interested reader, here is more information about this issue:
When a network collapses into one major cluster, it is most often because of a very dense and random link structure ... In the code for directed networks implemented in iGraph, teleportation is encoded. If many nodes have no outlinks, the effect of teleportation can be significant because it randomly connect nodes. We have made new code available here: http://www.mapequation.org/code.html that can cluster network without encoding the random teleportation necessary to make the dynamics ergodic. For details, see this paper: http://pre.aps.org/abstract/PRE/v85/i5/e056107
I was going to put this in a comment, but it ended up being too long and hard to read in that format, so this is a tangentially related answer.
One thing you should do is assess whether the algorithm is doing a good job at finding community structure. You can try to visualise your network to establish:
Is the algorithm returning community structures that make sense? Maybe there is one massive community?
If not does the visualisation provide insight as to why?
This will help inform your next steps. Maybe the structure of the network requires a different algorithm?
One thing I find useful for large networks is plotting your edges as a heatmap. This is simple to do if you have your edges stored in an adjacency matrix.
For this, you can use the image function, passing in your matrix of edges as the argument z. Hopefully this will allow you to see by eye the community structure.
However you also want to assess the correctness of your algorithm, so you want to sort the nodes (rows and columns of your adjacency matrix) by the community they've been assigned to.
Another thing to note is that if your edges are directed it may be more difficult to assess by eye as edges can appear on either side of the diagonal of the heatmap. One thing you can do is instead plot the underlying graph -- that is the adjacency matrix assuming your edges are undirected.
If your algorithm is doing a good job, you would expect to see square blocks along the diagonal, one for each detected community.