text classification using graphs in natural language processing - graph

I tried to search but couldn't find much helpful information regarding this topic. That's why I am asking it here...
I know there are various methods to classify texts (like Logistic regression etc.) and also we have neural network.
But, I was wondering if it is possible to 'classify the texts into multiple classes' using graph theory?
If yes, how should I proceed? Please guide me.
Example:
I like jeansp -pos
I like toyota -pos
I it so-so place -neutral
I hated that trip -neg
I love that shirt -pos
that place was horrible -neg
I liked food but service was bad -neutral

Assume each document is a node, and each word is also a node. Documents have edges to words.
Now, some of your documents have labels and some don't.
You can use graph convolutional networks (GCN) to classify the unlabelled documents.
Take a look at the Python Geometric package that has implemented different versions of graph conovlutional networks. Create your input in a way that Python Geometric accepts, and you're done.

Related

How to bundle related words with text mining in R

I have data of advertisements posted on a secondhand site to sell used smartphones. Each ad describes the product that is being sold. I want to know which parameters are most often described by sellers. For example: brand, model, colour, memory capacity, ...
By text mining all the text from the advertisements I would like to bundle similar words together in 1 category. For example: black, white, red, ... should be linked to each other as they all describe the colour of the phone.
Can this be done with clustering or categorisation and which text mining algorithms are equipped to do this?
Your best attempt is something based on word2vec.
Clustering algorithms will not be able to discover the humang language concept of color reliably. So either you choose some supervised approach, or you need to try methods to first infere SUV concepts.
Word2vec is trained on substitutability of words. As in a sentence such as "I like the red color" you can substitute red with other colors, word2vec could theoretically be able to help with finding such concepts in an unsupervised way, given lots and lots of data. But I'm sure you can also find counterexamples that break these concepts... Good luck... I doubt you'll manage to do this unsupervised.

Graph Clustering

I've been searching paper about method review in graph clustering but not satisfied me,
please tell me what is best method (according to you) in graph clustering, so sorry if my question very general
Thanks
With such an open question, I guess I can recommend you to try WEKA.
It has a nice set of user interfaces to let you import your dataset and then try and compare various classification and clustering algorithms on your data, without writing even one line of code.
After you identified an algorithm that works for your problem, you can then search for a nice and fast implementation in the programming language of your choice.
EDIT: since you mentioned the graph tag, maybe you should have a look at Markov Cluster Algorithm, or else, you will have a hard time trying to represent your graph data in a format suitable for the distance based clustering algorithms in WEKA.

Community detection with InfoMap algorithm producing one massive module

I am using the InfoMap algorithm in the igraph package to perform community detection on a directed and non-weighted graph (34943 vertices, 206366 edges). In the graph, vertices represent websites and edges represent the existence of a hyperlink between websites.
A problem I have encountered after running the algorithm is that the majority of vertices have a membership in a single massive community (32920 or 94%). The rest of the vertices are dispersed into hundreds of other tiny communities.
I have tried different settings with the nb.trials parameter (i.e. 50, 100, and now running 500). However, this doesn't seem to change the result much.
I am feeling rather exasperated because the run-time on the algorithm is quite high, so I have to wait each time for the results (with no luck yet!!).
Many thanks.
Thanks for all the excellent comments. In the end, I got it working by downloading and running the source code for Infomap, which is available at: http://www.mapequation.org/code.html.
Due to licence issues, the latest code has not been integrated with igraph.
This solved the problem of too many nodes being 'lumped' into a single massive community.
Specifically, I used the following options from the command line: -N 10 --directed --two-level --map
Kudos to Martin Rosvall from the Infomap project for providing me with detailed help to resolve this problem.
For the interested reader, here is more information about this issue:
When a network collapses into one major cluster, it is most often because of a very dense and random link structure ... In the code for directed networks implemented in iGraph, teleportation is encoded. If many nodes have no outlinks, the effect of teleportation can be significant because it randomly connect nodes. We have made new code available here: http://www.mapequation.org/code.html that can cluster network without encoding the random teleportation necessary to make the dynamics ergodic. For details, see this paper: http://pre.aps.org/abstract/PRE/v85/i5/e056107
I was going to put this in a comment, but it ended up being too long and hard to read in that format, so this is a tangentially related answer.
One thing you should do is assess whether the algorithm is doing a good job at finding community structure. You can try to visualise your network to establish:
Is the algorithm returning community structures that make sense? Maybe there is one massive community?
If not does the visualisation provide insight as to why?
This will help inform your next steps. Maybe the structure of the network requires a different algorithm?
One thing I find useful for large networks is plotting your edges as a heatmap. This is simple to do if you have your edges stored in an adjacency matrix.
For this, you can use the image function, passing in your matrix of edges as the argument z. Hopefully this will allow you to see by eye the community structure.
However you also want to assess the correctness of your algorithm, so you want to sort the nodes (rows and columns of your adjacency matrix) by the community they've been assigned to.
Another thing to note is that if your edges are directed it may be more difficult to assess by eye as edges can appear on either side of the diagonal of the heatmap. One thing you can do is instead plot the underlying graph -- that is the adjacency matrix assuming your edges are undirected.
If your algorithm is doing a good job, you would expect to see square blocks along the diagonal, one for each detected community.

Network Analysis

I have a problem for network.
For one document I am extracting some information. I am drawing nice graphs for them. But in a document information flows. I am trying to depict it in graph like the way one reads a text flowing with text and then important most entity first and then the next important one.
To understand and grasp this problem what are the kinds of things I have to study or which aspect of network theory or graph theory deals with it.
If any one can kindly refer up.
Regs,
SK.
First of all, I'm not an expert in linguistic or study of languages. I think I understand what you're trying to do, and I don't know what's the best way to do it.
If I got it right, you want to determine some centrality measure for your words (that would explain the social network reference), to find those who are the most linked to others, is that it ?
The problem if you try that is that you will certainly find that the most central words are the most inintersting ones (the, if, then, some redundant adjectives...), if you don't apply a tokenization and lemmization procedure beforehand. Thus you could separate only nouns and stemming of verbs used, and then only you could try your approach.
Another problem that you must keep in mind is that words are important both by their presence and by their rarity (see tf-idf weight measure for instance).
To conclude, I did the following search on google :
"n gram graph language centrality word"
and found this paper that seems interesting for what you're asking (I might give it a look myself !) :
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

Finding Large Example Graphs

I am doing a project that involves processing large, sparse graphs. Does anyone know of any publicly available data sets that can be processed into large graphs for testing? I'm looking for something like a Facebook friend network, or something a little smaller with the same flavor.
I found the Stanford Large Network Dataset Collection pretty useful.
If you asked nicely, you might be able to get Brian O'Meara's data set for treetapper. It's a pretty nice example of real-world data in that genre. Particularly, you'd probably be interested in the coauthorship data.
http://www.treetapper.org/
http://www.brianomeara.info/
Github's API is nice for building out graphs. I've messed around using the python lib networkx to generate graphs of that network. Here's some sample code if you're interested.
Apologies for the double post, evidently I can only post two links at a time since I have <10 reputation...
DIMACS also has some data sets from their cluser challenge and there's always the Graph500. The Boost Graph Library has a number of graph generators as well.
Depending on what you consider "large", there's the University of Florida Sparse Matrix Collection as well as some DIMACS Road Networks (mostly planar of course).
A few other ones:
Newman's page
Barabasi's page
Pajek software
Arena's page
Network Science

Resources