reusable holdout in mlr - r

How can someone change the cross validation or holdout procedures in mlr so that before testing with the validation set, that same validation set is changed according to a procedure, namely the reusable holdout procedure?
Procedure:
http://insilico.utulsa.edu/wp-content/uploads/2016/10/Dwork_2015_Science.pdf

Short answer: mlr doesn't support that.
Long answer: My experience with differential privacy for machine learning is that in practice it doesn't work as well as advertised. In particular, to apply thresholdout you need a) copious amounts of data and b) the a priori probability that a given classifier will overfit on the given data -- something you can't easily determine in practice. While the paper you reference comes with example code that shows that thresholdout works in this particular case, but the amount of noise added in the code looks like it was determined on an ad-hoc basis; the relationship to the thresholdout algorithm described in the paper isn't clear.
Before differential privacy can be robustly applied in practice in scenarios like that, mlr won't support it.

Related

Using discrete elements to optimize the design of another element - Expanding on Circuit Example

I've read through most of the questions posted on here tagged with OpenMDAO on discrete variables and have reviewed the following documentation here but I cannot find an answer to my question.
Here is a description of my use-case:
Lets start with the circuit example here. Now lets assume that I have a set of R values I would like to use. Perhaps in my box of hardware I have 3 types of resistors available to me that I must take advantage of.
With the resistors available, I would like to find a configuration that constrains the net current to 0 but minimizes the voltages at the nodes. Is OpenMDAO capable of taking in sets of discrete variables to select an optimal design for the other components? What would be the proper methods for this use-case? Is there any documentation or publication that I could use as a reference for this type of effort?
Overall I'm looking to use OpenMDAO to define bespoke hardware requirements in cooperation with available COTS components to meet a performance need. Am I barking up the right tree?
There is some work on discrete optimization in OpenMDAO. There is specifically some support for discrete variables in components. The SimpleGADriver also supports discrete variables. If you are looking for something more advanced than that, you can check out the AMIEGO driver.
Im not fully sure how to pose the optimization problem you've described, but there is some relevant OpenMDAO code for circuit analysis in the Zappy library that you might wan to check out.

Avoiding singularity in analysis - does OpenMDAO automatically enable 'fully-simultaneous' solution?

Turbulent boundary layer calculations break down at the point of flow separation when solved with a prescribed boundary layer edge velocity, ue, in what is called the direct method.
This can be alleviated by solving the system in a fully-simultaneous or quasi-simultaneous manner. Details about both methods are available here (https://www.rug.nl/research/portal/files/14407586/root.pdf), pages 38 onwards. Essentially, the fully-simultaneous method combines the inviscid and viscous equations into a single large system of equations, and solves them with Newton iteration.
I have currently implemented an inviscid panel solver entirely in ExplicitComponents. I intend to implement the boundary layer solver also entirely with ExplicitComponents. I am unsure whether coupling these two groups would then result in an execution procedure like the direct method, or whether it would work like the fully-simultaneous method. I note that in the OpenMDAO paper, it is stated that the components are solved "as a single nonlinear system of equations", and that the reformulation from explicit components to the implicit system is handled automatically by OpenMDAO.
Does this mean that if I couple my two analyses (again, consisting purely of ExplicitComponents) and set the group to solve with the Newton solver, I'll get a fully-simultaneous solution 'for free'? This seems too good to be true, as ultimately the component that integrates the boundary layer equations will have to take some prescribed ue as an input, and then will run into the singularity in the execution of its compute() method.
If doing the above would instead make it execute like the direct method and lead to the singularity, (briefly) what changes would I need to make to avoid it? Would it require defining the boundary layer components implicitly?
despite seeming too good to be true, you can in fact change the structure of your system by changing out the top level solver.
If you used a NonlinearBlockGS solver at the tope, it would solve in the weak form. If you used a NewtonSolver at the top, it would solve as one large monolithic system. This property does indeed derive from the unique structure of how OpenMDAO stores things.
There are some caveats. I would guess that your panel code is implemented as a set of intermediate calculations broken up across several components. If thats the case, then the NewtonSolver will be treating each intermediate variable as it it was its own state variable. In other words, you would have more than just delta and u_e as states, but also all the intermediate calculations too.
This is might be somewhat unstable (though it might work just fine, so try it!). You might need a hybrid between the weak and strong forms, that can be achieved via the solve_subsystems option on the NewtonSolver. This approach, is called the Hierarchical Newton Method in section 5.1.2 of the OpenMDAO paper. It will do a sub-iteration of NLBGS for every top level Newton iteration. This acts as a form of nonlinear preconditioner which can help stabilize the strong form. You can limit ho many sub-iterations are done, and in your case you may want to use just 2 or 3 because of the risk of singularity.

Is having multiple features for the same data bad practice (e.g. use both ordinal and binarized time series data)?

I'm trying to train a learning model on real estate sale data that includes dates. I've looked into 1-to-K binary encoding, per the advice in this thread, however my initial assessment is that it may have the weakness of not being able to train well on data that is not predictably cyclic. While real estate value crashes are recurring, I'm concerned (maybe wrongfully so, you tell me) that doing 1-to-K encoding will inadvertently overtrain on potentially irrelevant features if the recurrence is not explainable by a combination of year-month-day.
That said, I think there is potentially value in that method. I think that there is also merit to the argument of converting time series data to ordinal, as also recommended in the same thread. Which brings me to the real question: is it bad practice to duplicate the same initial feature (the date data) in two different forms in the same training data? I'm concerned if I use methods that rely on the assumption of feature independence I may be violating this by doing so.
If so, what are suggestions for how to best get the maximal information from this date data?
Edit: Please leave a comment how I can improve this question instead of down-voting.
Is it bad practice?
No, sometimes transformations make your Feature easier accesible for your algorithm. Following this line of thought you converting Features is completely fine.
Does it scew your algorithm?
Concerning runtime it might be better to not have to transform your data everytime. Depending on your algorithm you might get worse interpretability (if that is important for you) depending on the type of transformations.
Also if you want to restrict the amount / set of Features your algorithm should use, you might add Information redundancies by adding transformed Features.
So what should you do?
Transform your data / Features as much as you want and as often as you want.
That's not hurting anyone, but rather helping by increasing the Feature space. But after you did so, do a PCA or something similar in order to find redundancies in your Features and reduce your Feature space again.
Note:
I tried to be General, obviously this is highly dependant on the Kind of algorithm you're using.

Can I choose clustering methods base on internal validation result?

From what I understand, it is usually difficult to select the best possible clustering method for your data priori, and we can use cluster validity to compare the results of different clustering algorithms and choose the one with the best validation scores.
I use an internal validation function from R stats package on my clustering result (for clustering methods I used R igraph fast.greedy and walk.trap).
The outcome is a list of many validation scores.
In the list, almost in every validation Fast greedy method has better scores than Walk trap, except in entropy walk trap method has a better score.
Can I use this validation result list as one of my reasons to explain to others why I choose Fast greedy method rather than walk trap method?
Also, is there any way to validate a disconnected graph?
Short answer: NO!
You can't use a internal index to justify the choose of a algorithm over another. Why?
Because evaluation indexes were designed to evaluate clustering results, i.e., partitions and hierarchies. You can only use them to access the quality of a clustering and therefore justify its choice over the others options. But again, you can't use them to justify choosing a particular algorithm to apply on a different dataset based on a single previous experiment.
For this task, several benchmarks are needed to determine which algorithms are generally better and should be tried first. Here some paper about it: Community detection algorithms: a comparative analysis.
Edit: What I am saying is, your validation indexes may show that the fast.greed's solution is better than the walk.trap's. However, they do not explain why you chose these algorithms instead of any others. Only your data, your assumptions, and your constraints could do that.
Also, is there any way to validate a disconnected graph?
Theoretically, any evaluation index can do this. Technically, some implementations don't handle disconnected components.

Custom Graph Partitioning algorithms in Giraph

There have been mentions of using Custom Partitioning algorithms for Giraph applications. However it is not clearly given at any place. As Castagna pointed out here in how to partition graph for pregel to maximize processing speed?, there may not be a need for such partitioning as HashPartitioner will in itself be very good in most cases.
The problem of partitioning a graph 'intelligently' in order to minimize execution time is an interesting one, however it's not simple and it depends on your data and your algorithm. You might find also that, in practice, it's not necessary and a random partitioning is sufficiently good.
For example, if you are interested in exploring Pregel-like approaches, you can have a look at Apache Giraph and experiment with different partitioning techniques.
However for the purpose of learning, it would be good to see live examples and there are none found as far as I've seen. For example, the normal k-way partitioning algorithm (Kerninghan-Lin) being executed in Giraph or atleast the direction I should implement it towards.
All the google results were from the Apache giraph page where there are only definitions of the functions and various options to use them.

Resources