How to bundle related words with text mining in R

How to bundle related words with text mining in R - r

I have data of advertisements posted on a secondhand site to sell used smartphones. Each ad describes the product that is being sold. I want to know which parameters are most often described by sellers. For example: brand, model, colour, memory capacity, ...
By text mining all the text from the advertisements I would like to bundle similar words together in 1 category. For example: black, white, red, ... should be linked to each other as they all describe the colour of the phone.
Can this be done with clustering or categorisation and which text mining algorithms are equipped to do this?

Your best attempt is something based on word2vec.
Clustering algorithms will not be able to discover the humang language concept of color reliably. So either you choose some supervised approach, or you need to try methods to first infere SUV concepts.
Word2vec is trained on substitutability of words. As in a sentence such as "I like the red color" you can substitute red with other colors, word2vec could theoretically be able to help with finding such concepts in an unsupervised way, given lots and lots of data. But I'm sure you can also find counterexamples that break these concepts... Good luck... I doubt you'll manage to do this unsupervised.

Related

text classification using graphs in natural language processing

I tried to search but couldn't find much helpful information regarding this topic. That's why I am asking it here...
I know there are various methods to classify texts (like Logistic regression etc.) and also we have neural network.
But, I was wondering if it is possible to 'classify the texts into multiple classes' using graph theory?
If yes, how should I proceed? Please guide me.
Example:
I like jeansp -pos
I like toyota -pos
I it so-so place -neutral
I hated that trip -neg
I love that shirt -pos
that place was horrible -neg
I liked food but service was bad -neutral

Assume each document is a node, and each word is also a node. Documents have edges to words.
Now, some of your documents have labels and some don't.
You can use graph convolutional networks (GCN) to classify the unlabelled documents.
Take a look at the Python Geometric package that has implemented different versions of graph conovlutional networks. Create your input in a way that Python Geometric accepts, and you're done.

Compare vectors of a doc and just a word

So, I have to compare vector of article and vector of single word. And I don't have any idea how to do it. Looks like that BERT and Doc2wec good work with long text, Word2vec works with single words. But how to compare long text with just a word?

Some modes of the "Paragraph Vector" algorithm (aka Doc2Vec in libraries like Python gensim) will train both doc-vectors and word-vectors into the a shared coordinate space. (Specifically, any of the PV-DM dm=1 modes, or the PV-DBOW mode dm=0 if you enable the non-default interleaved word-vector training using dbow_words=1.)
In such a case, you can compare Doc2Vec doc-vectors with the co-trained word-vectors, with some utility. You can see some examples in the followup paper form the originators of the "Paragraph Vector" algorithm, "Document Embedding with Paragraph Vectors".
However, beware that vectors for single words, having been trained in contexts of use, may not have vectors that match what we'd expect of those same words when intended as overarching categories. For example, education as used in many sentences wouldn't necessarily assume all facets/breadth that you might expect from Education as a category-header.
Such single word-vectors might work better than nothing, and perhaps help serve as a bootstrapping tool. But, it'd be better if you had expert-labelled examples of documents belonging to categories of interest. Then you you could also use more advanced classification algorithms, sensitive to categories that wouldn't necessarily be summarized-by (and in a tight sphere around) any single vector point. In real domains-of-interest, that'd likely do better than using single-word-vectors as category-anchors.
For any other non-Doc2Vec method of vectorizing a text, you could conceivably get a comparable vector for a single word by supplying a single-word text to the method. (Even in a Doc2Vec mode that doesn't create word-vectors, like pure PV-DBOW, you could use that model's out-of-training-text inference capability to infer a doc-vector for a single-word doc, for known words.)
But again, such simplified/degenerate single-word outputs might not well match the more general/textured categories you're seeking. The models are more typically used for larger contexts, and narrowing their output to a single word might reflect the peculiarities of that unnatural input case moreso than the usual import of the word in real context.

You can use BERT as is for words too. a single word is just a really short sentence. so, in theory, you should be able to use any sentence embedding as you like.
But if you don't have any supervised data, BERT is not the best option for you and there are better options out there!
I think it's best to first try doc2vec and if it didn't work then switch to something else like SkipThoughts or USE.
Sorry that I can't help you much, it's completely task and data dependent and you should test different things.

Based on your further comments that explain your problem a bit more, it sounds like you're actually trying to do Topic Modelling (categorizing documents by a given word is equivalent to labeling them with that topic). If this is what you're doing, I would recommend looking into LDA and variants of it (eg guidedLDA as an example).

Extract attributes of a picture for image recognition

My question is inspired by the following Kaggle competition: https://www.kaggle.com/c/leaf-classification
I have a set of leaves which I would like to classify by how they look like. The classification part I managed to do it by using Random Forests and K-means. However, I am more interested in the pre-processing part so as to replicate this analysis with my own set of pictures.
The characteristics that conform each leaf are given by:
id - an anonymous id unique to an image
margin_1, margin_2, margin_3, ... margin_64 - each of the 64 attribute vectors for the margin feature
shape_1, shape_2, shape_3, ..., shape_64 - each of the 64 attribute vectors for the shape feature
texture_1, texture_2, texture_3, ..., texture_64 - each of the 64 attribute vectors for the texture feature
So, focusing on the question: I would like to get these characteristics from a raw picture. I have tried with Jpeg R package, but I haven't succeed. I do not show any code I've tried as this is a rather more theoretical question on how to tackle the issue, no need for the code.
I would really appreciate any advise on how to proceed to get the best descriptors of each image.

The problem is more of which kind of features can you extract based on margin, shape and texture of the images you have (leaves). This depends, some plants can easy be identified with shapes only while others need more features such as texture as they have similar shapes so this is still an open area of research. There have been a number of proposed features for the case of plant species identification which aims to focus on performance, efficiency or usability and good features must be invariant to scale, translation and rotation. Please refer to the link below providing a state-of-art review on the current feature extraction techniques that have been used in plant species identification
https://www.researchgate.net/publication/312147459_Plant_Species_Identification_Using_Computer_Vision_Techniques_A_Systematic_Literature_Review

How to do Language Modeling using HTK

I am in confusion on how to use HTK for Language Modeling.
I followed the tutorial example from the Voxforge site
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial
After training and testing I got around 78% accuracy. I did this for my native language.Now I have to use HTK for Language Modeling.
Is there any tutorial available for doing the same? Please help me.
Thanks
speech_tri

If I understand your question correctly, you are trying to change from a "grammar" to an "n-gram language model" approach. These two methods are alternative ways of specifying what combinations of words are permissible in the responses that a recognizer will return. Having followed the Voxforge process you will probably have a grammar in place.
A language model comes from the analysis of a corpus of text which defines the probabilities of words appearing together. The text corpus used can be very specialized. There are a number of analysis tools such as SRILM (http://www.speech.sri.com/projects/srilm/) and MITLM (https://github.com/mitlm/mitlm) which will read a corpus and produce a model.
Since you are using words from your native language you will need a unique corpus of text to analyze. One way to get a test corpus would be to artificially generate a number of sentences from your existing grammar and use that as the corpus. Then with the new language model in place, you just point the recognizer at it instead of the grammar and hope for the best.

Network Analysis

I have a problem for network.
For one document I am extracting some information. I am drawing nice graphs for them. But in a document information flows. I am trying to depict it in graph like the way one reads a text flowing with text and then important most entity first and then the next important one.
To understand and grasp this problem what are the kinds of things I have to study or which aspect of network theory or graph theory deals with it.
If any one can kindly refer up.
Regs,
SK.

First of all, I'm not an expert in linguistic or study of languages. I think I understand what you're trying to do, and I don't know what's the best way to do it.
If I got it right, you want to determine some centrality measure for your words (that would explain the social network reference), to find those who are the most linked to others, is that it ?
The problem if you try that is that you will certainly find that the most central words are the most inintersting ones (the, if, then, some redundant adjectives...), if you don't apply a tokenization and lemmization procedure beforehand. Thus you could separate only nouns and stemming of verbs used, and then only you could try your approach.
Another problem that you must keep in mind is that words are important both by their presence and by their rarity (see tf-idf weight measure for instance).
To conclude, I did the following search on google :
"n gram graph language centrality word"
and found this paper that seems interesting for what you're asking (I might give it a look myself !) :
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex