How to normalize benchmark results to obtain distribution of ratios correctly?

How to normalize benchmark results to obtain distribution of ratios correctly? - r

To give a bit of the context, I am measuring the performance of virtual machines (VMs), or systems software in general, and usually want to compare different optimizations for performance problem. Performance is measured in absolute runtime for a number of benchmarks, and usually for a number of configurations of a VM variating over used number of CPU cores, different benchmark parameters, etc. To get reliable results, each configuration is measure like 100 times. Thus, I end up with quite a number of measurements for all kind of different parameters where I am usually interested in the speedup for all of them, comparing the VM with and the VM without a certain optimization.
What I currently do is to pick one specific series of measurements. Lets say the measurements for a VM with and without optimization (VM-norm/VM-opt) running benchmark A, on 1 core.
Since I want to compare the results of the different benchmarks and number of cores, I can not use absolute runtime, but need to normalize it somehow. Thus, I pair up the 100 measurements for benchmark A on 1 core for VM-norm with the corresponding 100 measurements of VM-opt to calculate the VM-opt/VM-norm ratios.
When I do that taking the measurements just in the order I got them, I obviously have quite a high variation in my 100 resulting VM-opt/VM-norm ratios. So, I thought, ok, let's assume the variation in my measurements come from non-deterministic effects and the same effects cause variation in the same way for VM-opt and VM-norm. So, naively, it should be ok to sort the measurements before pairing them up. And, as expected, that reduces the variation of course.
However, my half-knowledge tells me that is not the best way and perhaps not even correct.
Since I am eventually interested in the distribution of those ratios, to visualize them with beanplots, a colleague suggested to use the cartesian product instead of pairing sorted measurements. That sounds like it would account better for the random nature of two arbitrary measurements paired up for comparison. But, I am still wondering what a statistician would suggest for such a problem.
In the end, I am really interested to plot the distribution of ratios with R as bean or violin plots. Simple boxplots, or just mean+stddev tell me too few about what is going on. These distributions usually point at artifacts that are produced by the complex interaction on these much to complex computers, and that's what I am interested in.
Any pointers to approaches of how to work with and how to produce such ratios in a correct way a very welcome.
PS: This is a repost, the original was posted at https://stats.stackexchange.com/questions/15947/how-to-normalize-benchmark-results-to-obtain-distribution-of-ratios-correctly

I found it puzzling that you got such a minimal response on "Cross Validated". This does not seem like a specific R question, but rather a request for how to design an analysis. Perhaps the audience there thought you were asking too broad a question, but if that is the case then the [R] forum is even worse, since we generally tackle problems where data is actually provided. We deal with the requests for implementation construction in our language. I agree that violin plots are preferred to boxplots for the examination of distributions (when there is sufficient data and I am not sure that 100 samples per group makes the grade in that instance), but in any case that means the "R answer" is that you just need to refer to the proper R help page:
library(lattice)
?xyplot
?panel.violin
Further comments would require more details and preferably some data examples constructed in R. You may want to refer to the page where "great question design is outlined".
One further graphical method: If you are interested in the ratios of two paired variates but do not want to "commit" to just x/y, then you can examine them by plotting and then plotting iso-ratio lines by repeatedly using abline(a=0, b= ). I think 100 samples is pretty "thin" for doing density estimates, but there are 2d density methods if you can gather more data.

Related

How do I numerically compare the Dymos solution to the simulated solution?

I want to conduct a convergence study for my Dymos optimization results where I vary the number of nodes and compare the simulated solution to the optimization solution. From what I understand, Dymos fits polynomials to the system dynamics to represent the timeseries solution. What is the best way to compare the polynomial trajectory of the optimization solution to the trajectory of the simulated solution? I specifically want to evaluate the difference between the two trajectories away from the collocation/control nodes... to show that the polynomial fitting actually represents the simulated solution. How would I access the polynomial fitting data?
Thanks in advance.

For some of the testing we have an assert_timeseries_near_equal function that treats the more dense time series as the truth and tests that the less dense timeseries (usually the discrete solution) is reasonably close to it.
We're actually working on this method a bit more explicit right now so it's a little easier for users to apply in general cases, such as comparing discrete solutions from two different cases.
In general, there's a few different ways you can test your explicit results against an explicit integration. You could just verify that the final states of the two solutions are reasonably close. Since the error tends to increase over the course of the trajectory this is often good enough for a quick check. The downside of this approach is that it doesn't test that both solutions took the same path to the final condition.
To test the solution away from the nodes I'd recommend the following: Add a second timeseries output to the relevant phase that contains more segments or higher order segments. This timeseries will have more nodes. Dymos will interpolate from the solution's collocation grid onto this more dense timeseries output grid. Comparing this against the explicit simulation should still match exactly in terms of times, controls, and parameters, you'll better capture the interpolating state polynomials vs the explicitly simulated results.
There are other statistical methods out there for comparing timeseries that you can bring to bear, but visualizing the explicit trajectory plus some error bound as a "tube" into which we want to fit the discrete solution is usually how I handle it.

Suggested Neural Network for small, highly varying dataset?

I am currently working with a small dataset of training values, no more than 20, and am getting large MSE. The input data vectors themselves consist of 16 parameters, many of which are binary variables. Across all the training values, a majority of the 16 parameters stay the same (but not all). The remaining input variables, across all the exemplars, vary a lot amongst one another. This is to say, two exemplars might appear to be the same except for two parameters in which they differ, one parameter being a binary variable, and another being a continuous variable, where the difference could be greater than a single standard deviation (for that variable's set of values).
My single output variable (as of now) can either be a continuous variable, OR depending on the true difficulty of reducing the error in my situation, I can make this a classification problem instead, with 12 different forms for classification.
I have long been researching different neural networks than my current implementation of a feed-forward MLP, as I have read into Stochastic NNs, Ladder NNs, and many forms of recurrent NNs. I am stuck with which one I should investigate, as I do not have time to try every NN available.
While my description may be vague, could anyone make a suggestion as to which network I should investigate to minimize my cost function (as of now, MSE) the most?
If my current setup must be rendered implacable because of the sheer difficulty involved with predicting correct output for such a small set of highly variant training values, which network would best work, should my dataset be expanded to the order of thousands of exemplars (at the cost of having a significantly more redundant, seemingly homogenous set of input values)?
Any help is most certainly appreciated.

20 samples is very small especially if you have 16 input variables. It will be hard to determine which one of those inputs is responsible for your output value. If you keep your network simple (fewer layers) you may be able to use as many samples as you would need for traditional regression.

Cluster your time-series data

I have time-series data of 12 consumers. The data corresponding to 12 consumers (named as a ... l) is
I want to cluster these consumers so that I may know which of the consumers have utmost similar consumption behavior. Accordingly, I found clustering method pamk, which automatically calculates the number of clusters in input data.
I assume that I have only two options to calculate the distance between any two time-series, i.e., Euclidean, and DTW. I tried both of them and I do get different clusters. Now the question is which one should I rely upon? and why?
When I use Eulidean distance I got following clusters:
and using DTW distance I got
Conclusion:
How will you decide which clustering approach is the best in this case?
Note: I have asked the same question on Cross-Validated also.

none of the timeseries above look similar to me. Do you see any pattern? Maybe there is no pattern?
the clustering visualizations indicate that there are no clusters, too. b and l appear to be the most unusual outliers; followed by d,e,h; but there are no clusters there.
Also try hierarchical clustering. The dendrogram may be more understandable.
But in either way, there may be no clusters. You need to be prepared for this outcome, and consider it a valid hypothesis. Double-check any result. As you have seen, pam will always return a result, and you have absolutely no means to decide which result is more "correct" than the other (most likely, neither is correct, and you should rely on neither, to answer your question).

How do you conduct a power analysis for logistic regression in R?

I'm familiar with G*Power as a tool for power analyses, but have yet to find a resource on the internet describing how to compute a power analysis for for logistic regression in R. The pwr package doesn't list logistic regression as an option.

You will very likely need to "roll your own".
Specify your hypothesized relationship between predictors and outcome.
Specify what values of your predictors you are likely to observe in your study. Will they be correlated?
Specify the effect size you would like to detect, e.g., odds ratios corresponding to two specific settings of your predictors.
Specify a power level, e.g., beta=0.80.
For different sample sizes n:
Simulate predictors as specified
Simulate outcomes
Run your analysis
Record whether you detect a statistically significant effect
Do these steps many times, on the order of 1000 or more times. Count how often you did detect an effect. If you detected an effect more than (e.g.) 80% of the time, you are overpowered - reduce n and start over. If you detected an effect less than 80%, you are underpowered - increase n and start over. Rinse & repeat until you have a good n.
And then think some more about whether all your assumptions really make sense. Vary them a bit. Is the resulting value of n sensitive to your assumptions?
Yes, this will be quite a bit of work. But it will be worth it. On the one hand, it will keep you from running an over- or underpowered study. On the other hand, as I wrote, this will force you to think deeply about your assumptions, and this is the path to enlightenment. (Which is a painful path to travel. Sorry.)
If you don't get any better answers specifically helping you to do this in R, you may want to look to CrossValidated for more help. Good luck!

This question and answers on Crossvalidated discuss power for logistic regression and include R code as well as additional discussion and links for more information.

When and why is crossover beneficial in differential evolution?

I implemented a differential evolution algorithm for a side project I was doing. Because the crossover step seemed to involve a lot of parameter choices (e.g. crossover probabilities), I decided to skip it and just use mutation. The method seemed to work ok, but I am unsure whether I would get better performance if I introduced crossover.
Main Question: What is the motivation behind introducing crossover to differential evolution? Can you provide a toy example where introducing crossover out-performs pure mutation?
My intuition is that crossover will produce something like the following in 2-dimensions. Say
we have two parent vectors (red). Uniform crossover could produce a new trial vector at one of the blue points.
I am not sure why this kind of exploration would be expected to be beneficial. In fact, it seems like this could make performance worse if high-fitness solutions follow some linear trend. In the figure below, lets say the red points are the current population, and the optimal solution is towards the lower right corner. The population is traveling down a valley such that the upper right and lower left corners produce bad solutions. The upper left corner produces "okay" but suboptimal solutions. Notice how uniform crossover produces trials (in blue) that are orthogonal to the direction of improvement. I've used a cross-over probability of 1 and neglected mutation to illustrate my point (see code). I imagine this situation could arise quite frequently in optimization problems, but could be misunderstanding something.
Note: In the above example, I am implicitly assuming that the population was randomly initialized (uniformly) across this space, and has begun to converge to the correct solution down the central valley (top left to bottom right).
This toy example is convex, and thus differential evolution wouldn't even be the appropriate technique. However, if this motif was embedded in a multi-modal fitness landscape, it seems like crossover might be detrimental. While crossover does support exploration, which could be beneficial, I am not sure why one would choose to explore in this particular direction.
R code for the example above:
N = 50
x1 <- rnorm(N,mean=2,sd=0.5)
x2 <- -x1+4+rnorm(N,mean=0,sd=0.1)
plot(x1,x2,pch=21,col='red',bg='red',ylim=c(0,4),xlim=c(0,4))
x1_cx = list(rep(0, 50))
x2_cx = list(rep(0, 50))
for (i in 0:N) {
x1_cx[i] <- x1[i]
x2_cx[i] <- x2[sample(1:N,1)]
}
points(x1_cx,x2_cx,pch=4,col='blue',lwd=4)
Follow-up Question: If crossover is beneficial in certain situations, is there a sensible approach to a) determining if your specific problem would benefit from crossover, and b) how to tune the crossover parameters to optimize the algorithm?
A related stackoverflow question (I am looking for something more specific, with a toy example for instance): what is the importance of crossing over in Differential Evolution Algorithm?
A similar question, but not specific to differential evolution: Efficiency of crossover in genetic algorithms

I am not particularly familiar with the specifics of the DE algorithm but in general the point of crossover is that if you have two very different individuals with high fitness it will produce an offspring that is intermediate between them without being particularly similar to either. Mutation only explores the local neighbourhood of each individual without taking the rest of the population into account. If you think of genomes as points in some high dimensional vector space, then a mutation is shift in a random direction. Therefore mutation needs to take small steps since if your are starting from a significantly better than random position, a long step in a random direction is almost certain to make things worse because it is essentially just introducing entropy into an evolved genome. You can think of a cross over as a step from one parent towards the other. Since the other parent is also better than random, it is more promising to take a longer step in that direction. This allows for faster exploration of the promising parts of the fitness landscape.
In real biological organisms the genome is often organized in such a way that genes that depend on each other are close together on the same chromosome. This means that crossover is unlikely to break synergetic gene combinations. Real evolution actually moves genes around to achieve this (though this is much slower than the evolution of individual genes) and sometimes the higher order structure of the genome (the 3 dimensional shape of the DNA) evolves to prevent cross-overs in particularly sensitive areas. These mechanisms are rarely modeled in evolutionary algorithms, but you will get more out of crossovers if you order your genome in a way that puts genes that are likely to interact close to each other.

No. Crossover is not useful. There I said it. :P
I've never found a need for crossover. People seem to think it does some kind of magic. But it doesn't (and can't) do anything more useful than simple mutation. Large mutations can be used to explore the entire problem space and small mutations can be used to exploit niches.
And all the explanations I've read are (to put it mildly) unsatisfactory. Crossover only complicates your algorithms. Drop it asap. Your life will be simpler. .... IMHO.

As Daniel says, cross over is a way to take larger steps across the problem landscape, allowing you to escape local maxima that a single mutation would be unable to do so.
Whether it is appropriate or not will depend on the complexity of the problem space, how the genotype -> phenotype expression works (will related genes be close together), etc.
More formally this is the concept of 'Connectivity' in Local Search algorithms, providing strong enough operators that the local search neighbourhood is sufficentally large to escape local minima.