I have a data set, which consists of a number of elements -- divided into two distinct categories (with an equal number of elements for each category) -- and with two continuous variables describing them, like so:
ID | Category | Variable_1 | Variable_2
--------------------------------------------
1 | Triangle | 4.3522 | 5.2321
2 | Triangle | 3.6423 | 6.3223
3 | Circle | 5.2331 | 3.2452
4 | Circle | 2.6334 | 7.3443
... | ... | ... | ...
Now, what I want to do is to create a list of one-to-one parings so that every element of category Triangle is paired to an element of category Circle, and so that the combined distance between the points within each pairing in a 2D space defined by Variable_1 and Variable_2 is as small as possible. In other words, if I had to travel from each Triangle element to a Circle element (but never to the same Circle element twice), I want to find out how to minimize the total traveling distance (see illustration below).
Since I'm not really in the mood of trying to brute force this problem, I've been thinking that Simulated annealing probably would be a suitable optimisation method to use. I'd also like to work in R.
The good news is that I've found a couple of packages for doing Simulated annealing within R, for example GenSA and optim. The bad news is that I don't really know how to utilize these packages for my specific input needs. That is, as input I would like to specify a list of numbers denoting elements of a certain category in my list and in what order they should be paired to the other set of elements belonging to the other category. However, this would mean that I, in my Simulated annealing algoritm, only would like to use integers and that I never would like the same integer to appear twice, something that seems to go against how the packages above are implemented.
Is there some way that I could make effective use of some pre-written Simulated annealing package for R, or do I need to write my own methods for this problem?
Related
I am still working on a dataset with customer satisfaction and their online paths/journeys (here as mini-sequences/bi-grams). The customers were classified according to their usability satisfaction (4 classes).
Please find some exemplary lines of my current data frame in the following.
|bi-gram| satisfaction_class |
|:---- |:----: |----:|
|"openPage1", "writeText“|1|
|“writeText“ , “writeText“|2|
|"openPage3", "writeText“|4|
|"writeText“,"openPage1"|3|
...
Now I would like to know which bi-gram is a significant/robust predictor for certain classes. Is it possible with only bigrams or do I need the whole customer path?
I have read that one can use TF-IDF or chi-square but I could not find the perfect code. :(
Thank you so much!
Marius
TL;DR
I want to group multiple datasets from a single table without increasing the performance
by optimizing the QSortFilterProxyModel or by iterating the table data model. (which is better in performance)
For example, the following main table:
+------+------+---------+
| Col1 | Col2 | Results |
+------+------+---------+
| a | b | 2 |
| a | c | 4 |
| v | b | 5 |
+------+------+---------+
Can output multiple aggregated datasets by specifying some grouping conditions
for example:
Condition of group and sum "a" entries
Dataset results => a = 6
Condition of group and sum "ab" entries
Dataset results => a = 2
Condtion of group and sum Col1
Dataset results => a = 2
V = 1
Each results dataset will be displayed in a proper table view.
I succeed to achieve this by implementing multiple QSortFilerProxyModel for each group condition.
(I had to inherit the QSortFilerProxyModel set the group condition and override the filterAcceptsRow function.)
But, the issue is with the performance, on a large dataset, and with multiple proxies
the Qt proxy model will iterate (filterAcceptsRow) all the tables models X times which slow down in performance.
I want to create multiple datasets by iterating the model only once.
Is it possible to implement it by using the proxy model?
Or maybe I need to iterate the main table model by my self and to generate these custom models?
Note:
In my opinion, it looks like impossible to implement it by using the QSortFilerProxyModel, because of the model indexing,
If I sore multiple datasets, each one can have different rowCount() and the model indexing will be broken.
First of all, everything you need to create a custom proxy is the QAbstractItemModel. You don't need to derive from the filtering proxy classes at all. How you implement such a model is up to you, but don't be confined by thinking that the proxy needs to actually be anything but an implementation of the abstract model. The proxy classes are for your convenience: when it is not convenient to use them - don't!
Furthermore, the viable approaches differ a bit depending on what sort of output arities you have. If each filter produces only one row of results, then having just one proxy generate all of them is fine - but you're viewing each single-row result in its own table view? Perhaps your UI demands that. If the groupings can produce multi-row data (e.g. group on Col1, output sum(Results)), then you'd need individual view into each of the result sets.
Then I'd create a common proxy that interfaces to the data source, but that proxy isn't used directly. In fact, this proxy is just a QObject and doesn't derive from QAbstractItemModel at all. Instead, it would create QAbstractItemModel instances, as views into the data. They would forward requests to the common proxy that has all the data necessary to fulfill the request under any condition.
So I am trying to classify documents bases on its texts with Naive Bayes. Each document might belong to 1 to n categories (think of it as tags in a blog post).
My current approach is to provide R with a csv looking like this
+-------------------------+---------+-------+-------+
| TEXT TO CLASSIFY | Tag 1 | Tag 2 | Tag 3 |
+-------------------------+---------+-------+-------+
| Some text goes here | Yes | No | No |
+-------------------------+---------+-------+-------+
| Some other text here | No | Yes | Yes |
+-------------------------+---------+-------+-------+
| More text goes here | Yes | No | Yes |
+-------------------------+---------+-------+-------+
Of course the desired behaviour is to have an input looking like
Some new text to classify
And an output like
+------+------+-------+
| Tag 1| Tag 2| Tag 3 |
+------+------+-------+
| 0.12 | 0.75 | 0.65 |
+------+------+-------+
And then based on a certain threshold, determine whether or not the given text belongs to tags 1, 2, 3.
Now the question is, in the tutorials I have found, it looks like the input should be more like
+--------------------------+---------+
| TEXT TO CLASSIFY | Class |
+--------------------------+---------+
| Some other text here | No |
+--------------------------+---------+
| Some other text here | Yes |
+--------------------------+---------+
| Some other text here | Yes |
+--------------------------+---------+
That is, a ROW per text per class... Then using that yes, i can train naive bayes and then use one-vs-all in order to determine which texts belongs to which tags. Question is, can I do this in a more elegant way (that is, with the training data looking like the first example I mentioned)?
One of the examples I found is http://blog.thedigitalgroup.com/rajendras/2015/05/28/supervised-learning-for-text-classification/
There are conceptually two approaches.
You combine the tag into a combined tag. Then you would get the joint probability. The main drawback is the combinatorial explosion, which implies that you also need much more training data
You build a individual NB model for each tag.
As always in probabilistic modelling is the question whether you assume that your tags are independent or not. In the spirit of Naive Bayes the independence assumption would be very natural. In that case 2. would be the way to go. If the independence assumption is not justified and you are afraid of the combinatorial explosion, you can use a standard Bayesian Network. If you keep certain assumptions your performance will not be impacted.
However, you could also assume a mixed a approach.
You could use a Hierarchical Naive Bayes Model. If there is some logical structure in the Tags you can introduce a parent variable for the classes. Bascially you have a value tag1/tag2 if both tags occur together.
The basic idea can be extended towards a latent variable you do not observer. This can be trained using a EM scheme. This will slightly impact your training performance, as you need to run the training, multiple iteration, however, will probably give you the best results.
http://link.springer.com/article/10.1007%2Fs10994-006-6136-2#/page-1
I have a set of data in the following format:
Resp | Q1 | Q2
P1 | 4 | 5
P2 | 1 | 2
P3 | 4 | 3
P4 | 6 | 4
I'd like to show the count and % of people who gave an answer greater than 3. So in this case, the output would be:
Question | Count | Percent
Q1 | 3 | 75%
Q2 | 2 | 50%
Any suggestions?
Although it sounds like a fairly easy thing, it is a bit more complicated.
Firstly your data is not row based so you will have to pivot it.
Load your data into Tableau
In the DataSource Screen choose column Q1 and Q1, right click on them and chosse "Pivot"
Name the column with the answers "Answers" (just for clarity.
You should get a table that looks like this:
Now you need to create a calculated field (I called it Overthreshold to check for your condition:
if [Answer] > 3 then
[Answer]
End
At this point you could substitute the 3 with a parameter in case you want to easily change that condition.
You can already drop the pills as follows to get the count:
Now if you want the percentage it gets a bit more complicated, since you have to determine the count of the questions and the count of the answers > 3 which is information that is stored in two different columns.
Create another Calculated field with this calculation COUNT([Overthreshold]) / AVG({fixed [Question]:count([Answer])})
drop the created pill onto the "text" field or into the columns drawer and see the percentage values
right click on the field and choose Default Propertiess / Number Format to have it as percentage rather than a float
To explain what the formular does:
It takes the count of the answers that are over the threshold and devides it by the count of answers for each question. This is done by the fixed part of the formular which counts the rows that have the same value in the Question column. The AVG is only there because Tableau needs an aggregeation there. Since the value will be the same for every record of the question, you could also use MIN or MAX.
It feels like there should be an eassier solution but right now I cannot think of one.
Here is a variation on #Alexander's correct answer. Some folks might find it slightly simpler, and it at least shows some of the Tableau features for calculating percentages.
Starting as in Alexander's answer, revise Overtheshold into a boolean valued field, defined as Answer > 3
Instead of creating a second calculated field for the percentage, drag Question, Overthreshold and SUM(Number Of Records) onto the viz as shown below.
Right click on SUM(Number of Records) and choose Quick Table Calculation->Percentage of Total
Double click Number of Records in the data pane on the left to add it to the sheet, which is a shortcut for bringing out the Measure Names and Measure Values meta-fields. Move Measure Names from Rows to Columns to get the view below, which also uses aliases on Measure Names to shorten the column titles.
If you don't want to show the below threshold data, simply right click on the column header False and choose Hide. (You can unhide it if needed by right clicking on the Overthreshold field)
Finally, to pretty it up a bit, you can move Overthreshold to the detail shelf (you can't remove it from the view though), and adjust the number formatting for the fields being displayed to get your result.
Technically, Alexander's solution uses LOD calculations to compute the percentages on the server side, while this solution uses Table calculations to compute the percentage on the client side. Both are useful, and can have different performance impacts. This just barely nicks the surface of what you can do with each approach; each has power and complexity that you need to start to understand to use in more complex situations.
This question comes in the sequence of a previous one I asked this week.
But generally my problem goes as follows:
I have a datastream of records entering in R via a socket and I want to do some analyses.
They come sequentially like this:
individual 1 | 1 | 2 | timestamp 1
individual 2 | 4 | 10 | timestamp 2
individual 1 | 2 | 4 | timestamp 3
I need to create a structure to maintain those records. The main idea is discussed in the previous question but generally I've created a structure that looks like:
*var1* *var2* *timestamp*
- individual 1 | [1,2,3] | [2,4,6] | [timestamp1, timestamp3...]
- individual 2 | [4,7,8] | [10,11,12] | [timestamp2, ...]
IMPORTANT - this structure is created and enlarged at runtime. I think this is not the best choice as it takes too long creating. The main structure is a matrix and inside each pair individual variable I have lists of records.
The individuals are on great number and vary a lot over time. So without going through some records I don't have enough information to make a good analyse. Thinking about creating some king of cache at run time on R by saving the records of individuals to disk.
My full database has an amount of approximately 100 GB. I want to analyse it mainly by seasonal blocks within each individual (dependent on the timestamp variable).
The creation of my structure takes too long as I enlarge the amount of records I'm collecting.
The idea of using a matrix of data with lists inside each pair individual - variable was adapted from using a three dimensional matrix because I don't have observations at the same timestamps. Don't know if it was a good idea.
If anyone has any idea on this matter I would appreciate it.