So I am trying to classify documents bases on its texts with Naive Bayes. Each document might belong to 1 to n categories (think of it as tags in a blog post).
My current approach is to provide R with a csv looking like this
+-------------------------+---------+-------+-------+
| TEXT TO CLASSIFY | Tag 1 | Tag 2 | Tag 3 |
+-------------------------+---------+-------+-------+
| Some text goes here | Yes | No | No |
+-------------------------+---------+-------+-------+
| Some other text here | No | Yes | Yes |
+-------------------------+---------+-------+-------+
| More text goes here | Yes | No | Yes |
+-------------------------+---------+-------+-------+
Of course the desired behaviour is to have an input looking like
Some new text to classify
And an output like
+------+------+-------+
| Tag 1| Tag 2| Tag 3 |
+------+------+-------+
| 0.12 | 0.75 | 0.65 |
+------+------+-------+
And then based on a certain threshold, determine whether or not the given text belongs to tags 1, 2, 3.
Now the question is, in the tutorials I have found, it looks like the input should be more like
+--------------------------+---------+
| TEXT TO CLASSIFY | Class |
+--------------------------+---------+
| Some other text here | No |
+--------------------------+---------+
| Some other text here | Yes |
+--------------------------+---------+
| Some other text here | Yes |
+--------------------------+---------+
That is, a ROW per text per class... Then using that yes, i can train naive bayes and then use one-vs-all in order to determine which texts belongs to which tags. Question is, can I do this in a more elegant way (that is, with the training data looking like the first example I mentioned)?
One of the examples I found is http://blog.thedigitalgroup.com/rajendras/2015/05/28/supervised-learning-for-text-classification/
There are conceptually two approaches.
You combine the tag into a combined tag. Then you would get the joint probability. The main drawback is the combinatorial explosion, which implies that you also need much more training data
You build a individual NB model for each tag.
As always in probabilistic modelling is the question whether you assume that your tags are independent or not. In the spirit of Naive Bayes the independence assumption would be very natural. In that case 2. would be the way to go. If the independence assumption is not justified and you are afraid of the combinatorial explosion, you can use a standard Bayesian Network. If you keep certain assumptions your performance will not be impacted.
However, you could also assume a mixed a approach.
You could use a Hierarchical Naive Bayes Model. If there is some logical structure in the Tags you can introduce a parent variable for the classes. Bascially you have a value tag1/tag2 if both tags occur together.
The basic idea can be extended towards a latent variable you do not observer. This can be trained using a EM scheme. This will slightly impact your training performance, as you need to run the training, multiple iteration, however, will probably give you the best results.
http://link.springer.com/article/10.1007%2Fs10994-006-6136-2#/page-1
Related
I am still working on a dataset with customer satisfaction and their online paths/journeys (here as mini-sequences/bi-grams). The customers were classified according to their usability satisfaction (4 classes).
Please find some exemplary lines of my current data frame in the following.
|bi-gram| satisfaction_class |
|:---- |:----: |----:|
|"openPage1", "writeText“|1|
|“writeText“ , “writeText“|2|
|"openPage3", "writeText“|4|
|"writeText“,"openPage1"|3|
...
Now I would like to know which bi-gram is a significant/robust predictor for certain classes. Is it possible with only bigrams or do I need the whole customer path?
I have read that one can use TF-IDF or chi-square but I could not find the perfect code. :(
Thank you so much!
Marius
I have generated a simple decision with rpart and displayed it with rpart.plot like the following.
Is it possible to edit the look of the tree so it's "mirrored" like the following:
(e-100%)
____(yes)___|___(no)____
| |
| (e-53%)
(p-47%) __|__
| |
(p-1%) (e-52%)
Adding the parameter xflip=TRUE to rpart.plot function flips the tree horizontally like intended
I have a data set, which consists of a number of elements -- divided into two distinct categories (with an equal number of elements for each category) -- and with two continuous variables describing them, like so:
ID | Category | Variable_1 | Variable_2
--------------------------------------------
1 | Triangle | 4.3522 | 5.2321
2 | Triangle | 3.6423 | 6.3223
3 | Circle | 5.2331 | 3.2452
4 | Circle | 2.6334 | 7.3443
... | ... | ... | ...
Now, what I want to do is to create a list of one-to-one parings so that every element of category Triangle is paired to an element of category Circle, and so that the combined distance between the points within each pairing in a 2D space defined by Variable_1 and Variable_2 is as small as possible. In other words, if I had to travel from each Triangle element to a Circle element (but never to the same Circle element twice), I want to find out how to minimize the total traveling distance (see illustration below).
Since I'm not really in the mood of trying to brute force this problem, I've been thinking that Simulated annealing probably would be a suitable optimisation method to use. I'd also like to work in R.
The good news is that I've found a couple of packages for doing Simulated annealing within R, for example GenSA and optim. The bad news is that I don't really know how to utilize these packages for my specific input needs. That is, as input I would like to specify a list of numbers denoting elements of a certain category in my list and in what order they should be paired to the other set of elements belonging to the other category. However, this would mean that I, in my Simulated annealing algoritm, only would like to use integers and that I never would like the same integer to appear twice, something that seems to go against how the packages above are implemented.
Is there some way that I could make effective use of some pre-written Simulated annealing package for R, or do I need to write my own methods for this problem?
This question comes in the sequence of a previous one I asked this week.
But generally my problem goes as follows:
I have a datastream of records entering in R via a socket and I want to do some analyses.
They come sequentially like this:
individual 1 | 1 | 2 | timestamp 1
individual 2 | 4 | 10 | timestamp 2
individual 1 | 2 | 4 | timestamp 3
I need to create a structure to maintain those records. The main idea is discussed in the previous question but generally I've created a structure that looks like:
*var1* *var2* *timestamp*
- individual 1 | [1,2,3] | [2,4,6] | [timestamp1, timestamp3...]
- individual 2 | [4,7,8] | [10,11,12] | [timestamp2, ...]
IMPORTANT - this structure is created and enlarged at runtime. I think this is not the best choice as it takes too long creating. The main structure is a matrix and inside each pair individual variable I have lists of records.
The individuals are on great number and vary a lot over time. So without going through some records I don't have enough information to make a good analyse. Thinking about creating some king of cache at run time on R by saving the records of individuals to disk.
My full database has an amount of approximately 100 GB. I want to analyse it mainly by seasonal blocks within each individual (dependent on the timestamp variable).
The creation of my structure takes too long as I enlarge the amount of records I'm collecting.
The idea of using a matrix of data with lists inside each pair individual - variable was adapted from using a three dimensional matrix because I don't have observations at the same timestamps. Don't know if it was a good idea.
If anyone has any idea on this matter I would appreciate it.
I have tabulated data. I have to write some code to dynamically generate some .pdf reports. Once I know how to make R read and publish only 1 row at a time, I will be using Sweave to format it and make it look nice.
For example, if my data set looks like this:
Name | Sport | Country
Ronaldo | Football | Portugal
Federer | Tennis |Switzerland
Woods | Golf | USA
My output would be composed of three .pdf files. The first one would say "Ronaldo plays football for Portugal"; and so on for the other two rows.
I have started with a for-loop but every forum I have trawled through talks about the advantages of the -apply functions over it but I don't know how to make it apply on every row of the data.
PS: This is my first post on stackoverflow.com. Excuse me if I am not following the community rules here. I will try my best to ensure that the question conforms to the guidelines based on your feedback.