Can I choose clustering methods base on internal validation result? - r

From what I understand, it is usually difficult to select the best possible clustering method for your data priori, and we can use cluster validity to compare the results of different clustering algorithms and choose the one with the best validation scores.
I use an internal validation function from R stats package on my clustering result (for clustering methods I used R igraph fast.greedy and walk.trap).
The outcome is a list of many validation scores.
In the list, almost in every validation Fast greedy method has better scores than Walk trap, except in entropy walk trap method has a better score.
Can I use this validation result list as one of my reasons to explain to others why I choose Fast greedy method rather than walk trap method?
Also, is there any way to validate a disconnected graph?

Short answer: NO!
You can't use a internal index to justify the choose of a algorithm over another. Why?
Because evaluation indexes were designed to evaluate clustering results, i.e., partitions and hierarchies. You can only use them to access the quality of a clustering and therefore justify its choice over the others options. But again, you can't use them to justify choosing a particular algorithm to apply on a different dataset based on a single previous experiment.
For this task, several benchmarks are needed to determine which algorithms are generally better and should be tried first. Here some paper about it: Community detection algorithms: a comparative analysis.
Edit: What I am saying is, your validation indexes may show that the fast.greed's solution is better than the walk.trap's. However, they do not explain why you chose these algorithms instead of any others. Only your data, your assumptions, and your constraints could do that.
Also, is there any way to validate a disconnected graph?
Theoretically, any evaluation index can do this. Technically, some implementations don't handle disconnected components.

Related

r documentation on methods used by a test?

R provides a number of .tests for various applications. It is often the case that there are many methods in the literature for a test. For example, a test of a population proportion probably has at least a half dozen possible tests. Each of these has various properties, and they may make different assumptions.
The base R documentation seems to not always give much information on precisely which method is used for a test. For example ?prop.test does not mention whether the Wald method or Wilson method (or some other method) is used.
Is this information documented somewhere? How can I find out more about which methods are being used by a particular test in R?
One option is to view and dissect the source code: getAnywhere(prop.test)
While possibly tedious, it gives an unambiguous explanation for what is actually happening when you run a function.

Avoiding singularity in analysis - does OpenMDAO automatically enable 'fully-simultaneous' solution?

Turbulent boundary layer calculations break down at the point of flow separation when solved with a prescribed boundary layer edge velocity, ue, in what is called the direct method.
This can be alleviated by solving the system in a fully-simultaneous or quasi-simultaneous manner. Details about both methods are available here (https://www.rug.nl/research/portal/files/14407586/root.pdf), pages 38 onwards. Essentially, the fully-simultaneous method combines the inviscid and viscous equations into a single large system of equations, and solves them with Newton iteration.
I have currently implemented an inviscid panel solver entirely in ExplicitComponents. I intend to implement the boundary layer solver also entirely with ExplicitComponents. I am unsure whether coupling these two groups would then result in an execution procedure like the direct method, or whether it would work like the fully-simultaneous method. I note that in the OpenMDAO paper, it is stated that the components are solved "as a single nonlinear system of equations", and that the reformulation from explicit components to the implicit system is handled automatically by OpenMDAO.
Does this mean that if I couple my two analyses (again, consisting purely of ExplicitComponents) and set the group to solve with the Newton solver, I'll get a fully-simultaneous solution 'for free'? This seems too good to be true, as ultimately the component that integrates the boundary layer equations will have to take some prescribed ue as an input, and then will run into the singularity in the execution of its compute() method.
If doing the above would instead make it execute like the direct method and lead to the singularity, (briefly) what changes would I need to make to avoid it? Would it require defining the boundary layer components implicitly?
despite seeming too good to be true, you can in fact change the structure of your system by changing out the top level solver.
If you used a NonlinearBlockGS solver at the tope, it would solve in the weak form. If you used a NewtonSolver at the top, it would solve as one large monolithic system. This property does indeed derive from the unique structure of how OpenMDAO stores things.
There are some caveats. I would guess that your panel code is implemented as a set of intermediate calculations broken up across several components. If thats the case, then the NewtonSolver will be treating each intermediate variable as it it was its own state variable. In other words, you would have more than just delta and u_e as states, but also all the intermediate calculations too.
This is might be somewhat unstable (though it might work just fine, so try it!). You might need a hybrid between the weak and strong forms, that can be achieved via the solve_subsystems option on the NewtonSolver. This approach, is called the Hierarchical Newton Method in section 5.1.2 of the OpenMDAO paper. It will do a sub-iteration of NLBGS for every top level Newton iteration. This acts as a form of nonlinear preconditioner which can help stabilize the strong form. You can limit ho many sub-iterations are done, and in your case you may want to use just 2 or 3 because of the risk of singularity.

Is having multiple features for the same data bad practice (e.g. use both ordinal and binarized time series data)?

I'm trying to train a learning model on real estate sale data that includes dates. I've looked into 1-to-K binary encoding, per the advice in this thread, however my initial assessment is that it may have the weakness of not being able to train well on data that is not predictably cyclic. While real estate value crashes are recurring, I'm concerned (maybe wrongfully so, you tell me) that doing 1-to-K encoding will inadvertently overtrain on potentially irrelevant features if the recurrence is not explainable by a combination of year-month-day.
That said, I think there is potentially value in that method. I think that there is also merit to the argument of converting time series data to ordinal, as also recommended in the same thread. Which brings me to the real question: is it bad practice to duplicate the same initial feature (the date data) in two different forms in the same training data? I'm concerned if I use methods that rely on the assumption of feature independence I may be violating this by doing so.
If so, what are suggestions for how to best get the maximal information from this date data?
Edit: Please leave a comment how I can improve this question instead of down-voting.
Is it bad practice?
No, sometimes transformations make your Feature easier accesible for your algorithm. Following this line of thought you converting Features is completely fine.
Does it scew your algorithm?
Concerning runtime it might be better to not have to transform your data everytime. Depending on your algorithm you might get worse interpretability (if that is important for you) depending on the type of transformations.
Also if you want to restrict the amount / set of Features your algorithm should use, you might add Information redundancies by adding transformed Features.
So what should you do?
Transform your data / Features as much as you want and as often as you want.
That's not hurting anyone, but rather helping by increasing the Feature space. But after you did so, do a PCA or something similar in order to find redundancies in your Features and reduce your Feature space again.
Note:
I tried to be General, obviously this is highly dependant on the Kind of algorithm you're using.

Heuristic function meaning

I've searched for meaning of heuristic function but everything I got is that it's function that ranks alternatives in search algorithms. But I suppose that it is not full definition of heuristics. As an example, heuristic of tree rank is used in Disjoint Union Set problem, but there's no searching!
I still don't understand, what does mean heuristic. Do you know any math definitions?
In its broadest sense, heuristics is the art of obtaining adequate results in short time from incomplete information.
As an example, the best known algorithms that solve the travelling salesman problem (i.e. finding the cheapest Hamiltonian cycle in a edge-weighted graph) have exponential time complexity. A heuristic algorithm to this problem is one that will often find a Hamiltonian cycle that is not much more expensive than the optimal solution, but will use only polynomial time.
A heuristic is a problem solving technique that comes from intuition,common sense,experience and is basically an easier and short method to solve problems.
It is applicable to human as well as machines.
Example: For example you choose to eat one of the fruits from the bucket because it looks fresh and ripe.(what you thought before picking up the fruit was an example of heuristic)
A heuristic function can be explained with an example of objective type question.
Suppose you are facing an objective test of 4 options of which only one is correct.
Now there is a question whose answer you do not know.
But then you look at the options and start thinking 1st is least probable and so is 3rd and so and so...
So basically you are rejecting least probable answers based on you experience and knwoledge related to those options
Thus it can be said that you are using a heuristic function.
Similarly for computers a heuristic function can be designed by keeping in mind the constraints of what you want from your program to do.

What is the network analog of a recursive function?

This is an ambitious question from a Wolfram Science Conference: Is there such a thing as a network analog of a recursive function? Maybe a kind of iterative "map-reduce" pattern? If we add interaction to iteration, things become complicated: continuous iteration of a large number of interacting entities can produce very complex results. It would be nice to have a way of seeing the consequences of the myriad interactions that define a complex system. Can we find a counterpart of a recursive function in an iterative network of connected nodes which contain nested propagation loops?
One of the basic patterns of distributed computation is Map-Reduce: it can be found in Cellular Automata (CA) and Neural Networks (NN). Neurons in NN collect informations through their synapses (reduce) and send it to other neurons (map). Cells in CA act similar, they gather informations from their neighbors (reduce), apply a transition rule (reduce), and offer the result to their neighbors again. Thus >if< there is a network analog of a recursive function, then Map-Reduce is certainly an important part of it. What kind of iterative "map-reduce" patterns exist? Do certain kinds of "map-reduce" patterns result in certain kinds of streams or even vortices or whirls? Can we formulate a calculus for map-reduce patterns?
I'll take a stab at the question about recursion in neural networks, but I really don't see how map-reduce plays into this at all. I get that neural network can perform distributed computation and then reduce it to a more local representation, but the term map-reduce is a very specific brand of this distributed/local piping, mainly associated with google and Hadoop.
Anyways, the simple answer to your question is that there isn't a general method for recursion in neural networks; in fact, the very related simpler problem of implementing general purpose role-value bindings in neural networks is currently still an open question.
The general principle of why things like role-binding and recursion in neural networks (ANNs) are so hard is that ANNs are very interdependent by nature; indeed that is where most of their computational power is derived from. Whereas function calls and variable bindings are both very delineated operations; what they include is an all-or-nothing affair, and that discreteness is a valuable property in many cases. So implementing one inside the other without sacrificing any computational power is very tricky indeed.
Here is a small sampling of papers that try there hand at partial solutions. Lucky for you, a great many people find this problem very interesting!
Visual Segmentation and the Dynamic Binding Problem: Improving the Robustness of an Artificial Neural Network Plankton Classifier (1993)
A Solution to the Binding Problem for Compositional Connectionism
A (Somewhat) New Solution to the Binding Problem

Resources