I have a rectangular planar grid, with each cell assigned some integer weight. I am looking for an algorithm to identify clusters of 3 to 6 adjacent cells with higher-than-average weight. These blobs should have approximately circular shape.
For my case the average weight of the cells not containing a cluster is around 6, and that for cells containing a cluster is around 6+4, i.e. there is a "background weight" somewhere around 6. The weights fluctuate with a Poisson statistic.
For small background greedy or seeded algorithms perform pretty well, but this breaks down if my cluster cells have weights close to fluctuations in the background i.e. they will tend to find a cluster even though there is nothing. Also, I cannot do a brute-force search looping through all possible setups because my grid is large (something like 1000x1000) and I plan to do this very often (10^9 times). I have the impression there might exist ways to tackle this in graph theory. I heard of vertex-covers and cliques, but am not sure how to best translate my problem into their language. I know that graph theory might have issues with the statistical nature of the input, but I would be interest to see what algorithms from there could find, even if they cannot identify every cluster.
Here an example clipping: the framed region has on average 10 entries per cell, all other cells have on average 6. Of course the grid extends further.
| 8| 8| 2| 8| 2| 3|
| 6| 4| 3| 6| 4| 4|
===========
| 8| 3||13| 7| 11|| 7|
|10| 4||10| 12| 3|| 2|
| 5| 6||11| 6| 8||12|
===========
| 9| 4| 0| 2| 8| 7|
For graph theory solutions there are a couple of sentences in wikipedia, but you are probably best posting on MathOverflow. This question might also be useful.
The traditional method (and probably best considering its ubiquity) in computing to solve these is raster analysis - well known in the world of GIS and Remote Sensing, so there are a number of tools that provide implementations. Keywords to use to find the one most suited to your problem would be raster, nearest neighbor, resampling, and clustering. The GDAL library is often the basis for other tools.
E.g. http://clusterville.org/spatialtools/index.html
You could try checking out the GDAL library and source code to see if you can use it in your situation, or to see how it is implemented.
For checking for circular shapes you could convert remaing values to polygons and check the resultant feature
http://www.gdal.org/gdal_polygonize.html
I'm not sure I see a graph theory analogy, but you can speed things up by pre-computing an area integral. This feels like a multi-scale thing.
A[i,j] = Sum[Sum[p[u,v], {u, 0, i},
{v, 0, j}]];
Then the average brightness of a rectangular (a,b), (c,d) region is
(A[c,d] - (A[c,b] + A[a,d]) + A[a,b])/((c-a)(d-b))
Overflow is probably not your friend if you have big numbers in your cells.
Use the union-find algorithm for clustering? It's very fast.
I guess the graph would result from considering each pair of neighboring high-valued cells connected. Use the union-find algorithm to find all clusters, and accept all those above a certain size, perhaps with shape constraints too (eg based on average squared distance from cluster center vs cluster size). It's a trivial variation on the union-find algorithm to collect statistics that you would need for this as you go (count, sum of x, sum of x^2, sum of y, sum of y^2).
If you are just looking for a way to translate your problem into a graph problem heres what you could do.
From each point look at all your neighbors (this could be the 8 adjacent squares or the 4 adjacent depending on what you want). Look for the neighbor with the max value, if it is larger than you then connect yourself to this square. if it is smaller then do nothing.
after this you will have a forest or possibly a tree (though i imagine that that is unlikely). This is one possible translation of your matrix to a graph. Im not sure if it is the most useful translation.
Related
Good day!
I have a challenge/idea to improve efficiency of one of the most widely used audio listening test formats – “MUSHRA-like” multiple stimulus grading task (all details about it are freely available in ITU document: https://www.itu.int/rec/R-REC-BS.1534-3-201510-I/en).
Picture of Hulti-Gen MUSHRA interface example
Rate the basic audio quality of each stimulus:
I I === I I I I
I I I I I I I
I === I I I I I
=== I I I I === I [Ref-sound]
I I I === I I ===
I I I I I I I
I I I I === I I
50% 75% 100% 35% 5% 50% 35%
Problem
MUSHRA test participants are presented with about 10 different sounds to compare with one another in no particular order. While the interface is great for fine comparisons, it is terrible for the initial stage when you just need to make sense of the presented alternatives and give them reasonable initial grading relative to the references.
For example, a study comparing sound of different headphones could have the following stimuli presented on a single page (which are randomized before presentation):
Stimuli nr.
Ground truth MUSHRA score that we expect to measure
s1
95%
s2
90% (top-quality hidden-reference anchor)
s3
85%
s4
85%
s5
75%
s6
60%
s7
50% (main reference “benchmark to beat”)
s8
40%
s9
10% (low-quality hidden-reference anchor)
Idea
Before throwing people into the “listen and rate how you want” MUSHRA interface, the idea is to introduce a preliminary paired comparison step where a simple algorithm can guide user to roughly rank stimuli before the 2nd half of the process.
Preliminary paired comparisons could have the following reply options:
“Please compare A with B”
◦ A >> B (A much better than B)
◦ A > B (A somewhat better than B)
◦ A ~= B (about equal)
◦ A < B (A somewhat worse than B)
◦ A << B (A much worse than B)
The 2nd half of the process is the normal “MUSHRA-like” interface, only all stimuli will already be pre-sorted and pre-graded based on preliminary results. Participants can then listen to stimuli in sequence and make any adjustments to the automatically proposed score.
Some specifics for the initial paired comparisons:
We should change only 1 out of two A-B stimuli from one comparison to another. It will reduce listening fatigue compared to both random stimuli in each comparison. Any biases from this logic can be ignored.
We will scale the ranking results into the MUSHRA score value relative to the included low, mid and high quality anchors with a pre-defined rating.
Consequently it is important to know how different is one stimuli from another (not only ranking but also the effect size).
Most lack of confidence, biases, undecided cases and logical errors can be accepted in favor of the least amount of paired comparisons prior to the 2nd half of the process.
Help
The idea is relatively simple, but I am still struggling to define an efficient algorithm – I would appreciate any suggestions for a simple state machine that provides ROUGH ranking of stimuli in the least amount of paired comparisons. Most methods I looked at are made for more complex scenarios and often require quite advanced algorithms that are hard to re-implement from scratch without any libraries. For example I looked into Ranking from pairwise comparisons ; https://en.wikipedia.org/wiki/Elo_rating_system ; https://en.wikipedia.org/wiki/Scale_(social_sciences)#Comparative_scaling_techniques ; https://en.wikipedia.org/wiki/Ranked_pairs and many more pages.
If such rough paired comparison is added to MUSHRA. we can significantly speed up testing by ensuring that all participants use an efficient strategy to arrive at the results; make the whole process more fun and mitigate scaling biases by suggesting initial scores based on a common algorithm.
(so far my concepts are too broken to present for feedback)
I was asked to find the asymptotic complexity of the given function using recursion tree
but I'm struggling to find the correct complexity at each level
Let's draw out the first two levels of the recursion tree:
+------------------+
| Input size n |
| Work done: n^2 |
+------------------+
/ \
+--------------------+ +--------------------+
| Input size: 3n/4 | | Input size: n/3 |
| Work done: 9n^2/16 | | Work done: n^2/9 |
+--------------------+ +--------------------+
Once we've done that, let's sum up the work done by each layer. That top layer does n2 work. That next layer does
(9/16)n2 + (1/9)n2 = (43/48)n2
total work. Notice that the work done by this second level is (43/48)ths of the work done in the level just above it. If you expand out a few more levels of the recursion tree, you'll find that the next level does (43/48)2n2 work, the level below that does (43/48)3n2 work, and that more generally the work done by level l in the tree is (43/48)ln2. (Convince yourself of this - don't just take my word for it!)
From there, you can compute the total amount of work done by recursion tree by summing up the work done per level across all the levels of the tree. As a hint, you're looking at the sum of a geometric sequence that decays from one term to the next - does this remind you of any of the cases of the Master Theorem?
I know that vw can handle very raw data(e.g. raw text) but for instance should one consider scaling numerical features before feeding the data to vw?
Consider the following line:
1 |n age: 80.0 height: 180.0 |c male london |d the:1 cat:2 went:3 out:4
Assuming that typical age ranges from 1 to 100 and height(in centimeters) may range from 140 to 220, is it better to transform/scale the age and height so they share a common range? I think many algorithms may need this kinda of preprocessing on their input data, for example Linear Regression.
vw SGD is highly enhanced vs the vanilla naive SGD so pre-scaling isn't needed.
If you have very few instances (small data-set), pre-scaling may help somewhat.
vw does automatic normalization for scale by remembering the range of each feature as it goes, so pre-scaling is rarely needed to achieve good results.
Normalization for scale, rarity and importance is applied by default. The relevant vw options are:
--normalized
--adaptive
--invariant
If any of them appears on the command line, the others are not applied. By default all three are applied.
See also: this stackoverflow answer
The paper explaining the enhanced SGD algorithm in vw is:
Online Importance Weight Aware Updates - Nikos Karampatziakis & John Langford
2 1
1----------2---------4
| | |
|3 |3 |1
| 6 | |
3---------5 ---------
Ok, so this is the graph. My source node is 1 and destination node is 5
My question is.
Are both the algorithms going to give the same output or not?
That is, will both return 1->2->4->5? (Except that negative weights are not allowed in dijkstra's)
Thanks in advance for help.
Bellman-Ford algorithm is a single-source shortest path algorithm, which allows for negative edge weight and can detect negative cycles in a graph.
Dijkstra algorithm is also another single-source shortest path algorithm. However, the weight of all the edges must be non-negative.
For your case, as far as the total cost is concerned, there will be no difference, since the edges in the graph have non-negative weight. However, Dijkstra's algorithm is usually used, since the typical implementation with binary heap has Theta((|E|+|V|)log|V|) time complexity, while Bellman-Ford algorithm has O(|V||E|) complexity.
If there are more than 1 path that has minimum cost, the actual path returned is implementation dependent (even for the same algorithm).
The vertexes in Dijkstra’s algorithm contain the whole information of a network. There is no such thing that every vertex only cares about itself and its neighbors. On the other hand, Bellman-Ford algorithm’s nodescontain only the information that are related to. This information allows that node just to know about which neighbor nodes can it connect and the node that the relation come from, mutually. Dijkstra’s algorithm is faster than Bellman-Ford’s algorithm however the second algorithm can be more useful to solve some problems, such as negative weights of paths.
Djikstra algorithm is a greedy technique where as for implementing bellmanford algorithm we require dynamic approach.
In djikstra algo, we do relaxation of every node/vertices in a loop where as in bellmanford algo we perform relaxation only|v-1| times .
The Dijkstra algo is suitable for the graph when it's edge-weight are positive, and fails in the case of negative-edged graph where as the bellmanford algo have advantage over djikstra that it can implement even when the edge's weight are assign negatively.
the bellmanford can find whether the graph solution exists or not (ie given directed graph has negative weight cycle or not) where as the djikstra algo fails to do the same.
the time complexity of djikstra algo is O(v^2) when implemented by linear array,O(e.log v)when implemented with binay heap or Fibonacci heap where as bellmanford algo has O(|v| |e|) time complexity .
I'm looking for something that I guess is rather sophisticated and might not exist publicly, but hopefully it does.
I basically have a database with lots of items which all have values (y) that correspond to other values (x). Eg. one of these items might look like:
x | 1 | 2 | 3 | 4 | 5
y | 12 | 14 | 16 | 8 | 6
This is just a a random example. Now, there are thousands of these items all with their own set of x and y values. The range between one x and the x after that one is not fixed and may differ for every item.
What I'm looking for is a library where I can plugin all these sets of Xs and Ys and tell it to return things like the most common item (sets of x and y that follow a compareable curve / progression), and the ability to check whether a certain set is atleast x% compareable with another set.
With compareable I mean the slope of the curve if you would draw a graph of the data. So, not actaully the static values but rather the detection of events, such as a high increase followed by a slow decrease, etc.
Due to my low amount of experience in mathematics I'm not quite sure what I'm looking for is called, and thus have trouble explaining what I need. Hopefully I gave enough pointers for someone to point me into the right direction.
I'm mostly interested in a library for javascript, but if there is no such thing any library would help, maybe I can try to port what I need.
About Markov Cluster(ing) again, of which I happen to be the author, and your application. You mention you are interested in trend similarity between objects. This is typically computed using Pearson correlation. If you use the mcl implementation from http://micans.org/mcl/, you'll also obtain the program 'mcxarray'. This can be used to compute pearson correlations between e.g. rows in a table. It might be useful to you. It is able to handle missing data - in a simplistic approach, it just computes correlations on those indices for which values are available for both. If you have further questions I am happy to answer them -- with the caveat that I usually like to cc replies to the mcl mailing list so that they are archived and available for future reference.
What you're looking for is an implementation of a Markov clustering. It is often used for finding groups of similar sequences. Porting it to Javascript, well... If you're really serious about this analysis, you drop Javascript as soon as possible and move on to R. Javascript is not meant to do this kind of calculations, and it is far too slow for it. R is a statistical package with much implemented. It is also designed specifically for very speedy matrix calculations, and most of the language is vectorized (meaning you don't need for-loops to apply a function over a vector of values, it happens automatically)
For the markov clustering, check http://www.micans.org/mcl/
An example of an implementation : http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi
Now you also need to define a "distance" between your sets. As you are interested in the events and not the values, you could give every item an extra attribute being a vector with the differences y[i] - y[i-1] (in R : diff(y) ). The distance between two items can then be calculated as the sum of squared differences between y1[i] and y2[i].
This allows you to construct a distance matrix of your items, and on that one you can call the mcl algorithm. Unless you work on linux, you'll have to port that one.
What you're wanting to do is ANOVA, or ANalysis Of VAriance. If you run the numbers through an ANOVA test, it'll give you information about the dataset that will help you compare one to another. I was unable to locate a Javascript library that would perform ANOVA, but there are plenty of programs that are capable of it. Excel can perform ANOVA from a plugin. R is a stats package that is free and can also perform ANOVA.
Hope this helps.
Something simple is (assuming all the graphs have 5 points, and x = 1,2,3,4,5 always)
Take u1 = the first point of y, ie. y1
Take u2 = y2 - y1
...
Take u5 = y5 - y4
Now consider the vector u as a point in 5-dimensional space. You can use simple clustering algorithms, like k-means.
EDIT: You should not aim for something too complicated as long as you go with javascript. If you want to go with Java, I can suggest something based on PCA (requiring the use of singular value decomposition, which is too complicated to be implemented efficiently in JS).
Basically, it goes like this: Take as previously a (possibly large) linear representation of data, perhaps differences of components of x, of y, absolute values. For instance you could take
u = (x1, x2 - x1, ..., x5 - x4, y1, y2 - y1, ..., y5 - y4)
You compute the vector u for each sample. Call ui the vector u for the ith sample. Now, form the matrix
M_{ij} = dot product of ui and uj
and compute its SVD. Now, the N most significant singular values (ie. those above some "similarity threshold") give you N clusters.
The corresponding columns of the matrix U in the SVD give you an orthonormal family B_k, k = 1..N. The squared ith component of B_k gives you the probability that the ith sample belongs to cluster K.
If it is ok to use java you really should have a look at Weka. It is possible to access all features via java code. Maybe you find a markov clustering, but if not, they hava a lot other clustering algorithem and its really easy to use.