Finding the Formula for a Curve - math

Is there a program that will take "response curve" values from me, and provide a formula that approximates the response curve?
It would be cool if such a program would take a numeric "percent correct" (perhaps with a standard deviation) so that it returns simplified formulas when laxity is permissable, and more precise (viz. complex) formulas when the curve needs to be approximated closely.
My interest is to play with the response curve values and "laxity" factor, until such a tool spits out a curve-fit formula simple enough that I know it will be high performance during machine computations.

Check our Eureqa, a free (as in beer) utility from Cornell University.
What's particularly interesting about Eureqa is that it uses genetic algorithms to fit the input curve you specify, and you can say what functions to allow or not in the fit. So if you wanted to stay away from sine and cosine, for instance, it wouldn't even consider those. It will also show you the best approximation with the fewest steps, and the most accurate approximation (regardless of steps). You can also run the fitting tool across multiple networked computers to speed up getting your results.
It's a very interesting tool -- check out their how-to videos.

Matlab, mathematica, octave, maple, numpy, scilab are just six among thousands of programs that will do this.

SigmaPlot - does exactly what you're looking for. Statistics and visualization of data.
(source: sigmaplot.com)

Related

Are series representations of functions every practically used to graph in computer science?

As you probably know functions can be represented as a infinite series. For example f(x) = cosx can be represented as this. My question is if this is every used practically in programming for any type of application. I know it can be used I was just wondering if it actually is for serious projects.
Aside from infinite series, there are other representations for functions which can be useful for computing approximations. Asymptotic series, identities involving other "elementary" functions, and interpolation in a table of values are all used in different contexts. Take a look at Abramowitz & Stegun "Handbook of Mathematical Functions" to get an idea of the variety of possibilities. Also look for the source code for popular libraries or systems such as R, Numpy, Scipy, or Octave to see what approaches have been used by the authors of that software.
Specifically about series approximations for trigonometric functions, I think that might be a reasonable thing to do, but only if the range of the argument is reduced (via identities) so that it is as small as possible.
Approximation of functions is a great topic; good luck and have fun.

Maximal Information Coefficient vs Hierarchical Agglomerative Clustering

What is the difference between the Maximal Information Coefficient and Hierarchical Agglomerative Clustering in identifying functional and non functional dependencies.
Which of them can identify duplicates better?
This question doesn't make a lot of sense, sorry.
The MIC and HAC have close to zero in common.
The MIC is a crippled form of "correlation" with a very crude heuristic search, and plenty of promotion video and news announcements, and received some pretty harsh reviews from statisticians. You can file it in the category "if it had been submitted to an appropriate journal (rather than the quite unspecific and overrated Science which probably shouldn't publish such topics at all - or at least, get better reviewers from the subject domains. It's not the first Science article of this quality....), it would have been rejected (as-is - better expert reviewers would have demanded major changes)". See, e.g.,
Noah Simon and Robert Tibshirani, Comment on “Detecting Novel Associations in Large Data Sets” by Reshef et al., Science Dec. 16, 2011
"As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome."
And "tibs" is a highly respected author. And this is just one of many surprised that such things get accepted in such a high reputation journal. IIRC, the MIC authors even failed to compare to "ancient" alternatives such as Spearman, to modern alternatives like dCor, or to properly conduct a test of statistical power of their method.
MIC works much worse than advertised when studied with statistical scrunity:
Gorfine, M., Heller, R., & Heller, Y. (2012). Comment on "detecting novel associations in large data sets"
"under the majority of the noisy functionals and non-functional settings, the HHG and dCor tests hold very large power advantages over the MIC test, under practical sample sizes; "
As a matter of fact, MIC gives wildly inappropriate results on some trivial data sets such as a checkerboard uniform distribution ▄▀, which it considers maximally correlated (as correlated as y=x); by design. Their grid-based design is overfitted to the rather special scenario with the sine curve. It has some interesting properties, but these are IMHO captured better by earlier approaches such as Spearman and dCor).
The failure by the MIC authors to compare to Spearman is IMHO a severe omission, because their own method is also purely rank-based if I recall correctly. Spearman is Pearson-on-ranks, yet they compare only to Pearson. The favorite example of MIC (another questionable choice) is the sine wave - which after rank transformation actually is busy a zigzag curve, not a sine anymore). I consider this to be "cheating" to make Pearson look bad, by not using the rank transformation with Pearson, too. Good reviewers would have demanded such a comparison.
Now all of these complaints are essentially unrelated to HAC. HAC is not trying to define any form if "correlation", but it can be used with any distance or similarity (including correlation similarity).
HAC is something completely different: a clustering algorithm. It analyzes a larger rows, not two (!) columns.
You could even combine them: if you compute the MIC foe every pair of variables (but I'd rather use Pearson correlation, Spearman correlation, or distance correlation dCor instead), you can use HAC to cluster variables.
For finding aftual duplicates, neither is a good choice. Just sort your data, and duplicates will follow each other. (Or, if you sort columns, next to each other).

How can I do blind fitting on a list of x, y value pairs if I don't know the form of f(x) = y?

If I have a function f(x) = y that I don't know the form of, and if I have a long list of x and y value pairs (potentially thousands of them), is there a program/package/library that will generate potential forms of f(x)?
Obviously there's a lot of ambiguity to the possible forms of any f(x), so something that produces many non-trivial unique answers (in reduced terms) would be ideal, but something that could produce at least one answer would also be good.
If x and y are derived from observational data (i.e. experimental results), are there programs that can create approximate forms of f(x)? On the other hand, if you know beforehand that there is a completely deterministic relationship between x and y (as in the input and output of a pseudo random number generator) are there programs than can create exact forms of f(x)?
Soooo, I found the answer to my own question. Cornell has released a piece of software for doing exactly this kind of blind fitting called Eureqa. It has to be one of the most polished pieces of software that I've ever seen come out of an academic lab. It's seriously pretty nifty. Check it out:
It's even got turnkey integration with Amazon's ec2 clusters, so you can offload some of the heavy computational lifting from your local computer onto the cloud at the push of a button for a very reasonable fee.
I think that I'm going to have to learn more about GUI programming so that I can steal its interface.
(This is more of a numerical methods question.) If there is some kind of observable pattern (you can kinda see the function), then yes, there are several ways you can approximate the original function, but they'll be just that, approximations.
What you want to do is called interpolation. Two very simple (and not very good) methods are Newton's method and Laplace's method of interpolation. They both work on the same principle but they are implemented differently (Laplace's is iterative, Newton's is recursive, for one).
If there's not much going on between any two of your data points (ie, the actual function doesn't have any "bumps" whose "peaks" are not represented by one of your data points), then the spline method of interpolation is one of the best choices you can make. It's a bit harder to implement, but it produces nice results.
Edit: Sometimes, depending on your specific problem, these methods above might be overkill. Sometimes, you'll find that linear interpolation (where you just connect points with straight lines) is a perfectly good solution to your problem.
It depends.
If you're using data acquired from the real-world, then statistical regression techniques can provide you with some tools to evaluate the best fit; if you have several hypothesis for the form of the function, you can use statistical regression to discover the "best" fit, though you may need to be careful about over-fitting a curve -- sometimes the best fit (highest correlation) for a specific dataset completely fails to work for future observations.
If, on the other hand, the data was generated something synthetically (say, you know they were generated by a polynomial), then you can use polynomial curve fitting methods that will give you the exact answer you need.
Yes, there are such things.
If you plot the values and see that there's some functional relationship that makes sense, you can use least squares fitting to calculate the parameter values that minimize the error.
If you don't know what the function should look like, you can use simple spline or interpolation schemes.
You can also use software to guess what the function should be. Maybe something like Maxima can help.
Wolfram Alpha can help you guess:
http://blog.wolframalpha.com/2011/05/17/plotting-functions-and-graphs-in-wolframalpha/
Polynomial Interpolation is the way to go if you have a totally random set
http://en.wikipedia.org/wiki/Polynomial_interpolation
If your set is nearly linear, then regression will give you a good approximation.
Creating exact form from the X's and Y's is mostly impossible.
Notice that what you are trying to achieve is at the heart of many Machine Learning algorithm and therefor you might find what you are looking for on some specialized libraries.
A list of x/y values N items long can always be generated by an degree-N polynomial (assuming no x values are the same). See this article for more details:
http://en.wikipedia.org/wiki/Polynomial_interpolation
Some lists may also match other function types, such as exponential, sinusoidal, and many others. It is impossible to find the 'simplest' matching function, but the best you can do is go through a list of common ones like exponential, sinusoidal, etc. and if none of them match, interpolate the polynomial.
I'm not aware of any software that can do this for you, though.

Calculus, How can you find an equation from a series of numbers?

I'm analyzing financial data and would like to find the inflection points of a line. I know I can do this using derivatives, but first I need an equation. Is there a way to generate an equation based off of a series of numbers. I would need to do this programmaticly.
Spline interpolation is probably more useful for you than polynomial interpolation: if you fit a polynomial, it must inevitably head off to +/- infinity outside your data range.
You will also want a method which allows a slightly loose fit: financial data is often a bit noisy which can result in very weird curves if you try to fit it exactly.
There are established procedures for turning a set of existing data points into a polynomial; this is called Polynomial Interpolation. This article in Wikipedia: http://en.wikipedia.org/wiki/Polynomial_interpolation
explains it mathematically. You can probably Google for algorithms easily enough.
Given enough points, your polynomial tracks the original, unknown function reasonably well, so the polynomial's inflection points should roughly coincide with the peaks and troughs of your data.
On the other hand, we all know there's not really a function behind financial data. So if I were you I'd scan along those points and find every point that has a smaller value to either side of it, and declare that a high; and vice versa for lows. Force-fitting this data into a fictitious function isn't going to make it any more useful.
Update: Tom Smith advises that spline interpolation is to be preferred to polynomial interpolation for this kind of thing, and Wikipedia bears him out. Or rather, it's bullish on his answer.
What you are thinking is analytical calculus ... when having discrete data (e.g. points), you have to do it numerically. Now, a line usually doesn't have inflection points, so I guess you're thinking of a curve. You can either interpolate some kind of it through the points, then calculate the first derivative (also numerically, but for a larger number of points), or you can just calculate the first derivation from the points you have (which will be better depends on how many points you actually have).
But really, this is just theory since we don't know the nature of data, or the language or anything.
For more on the subject search: numerical analysis on wiki, and go from there.
I think curve fitting might help you in this case. Here is a discussion which might be handy.
cheers

Which particular software development tasks have you used math for? And which branch of math did you use?

I'm not looking for a general discussion on if math is important or not for programming.
Instead I'm looking for real world scenarios where you have actually used some branch of math to solve some particular problem during your career as a software developer.
In particular, I'm looking for concrete examples.
I frequently find myself using De Morgan's theorem when as well as general Boolean algebra when trying to simplify conditionals
I've also occasionally written out truth tables to verify changes, as in the example below (found during a recent code review)
(showAll and s.ShowToUser are both of type bool.)
// Before
(showAll ? (s.ShowToUser || s.ShowToUser == false) : s.ShowToUser)
// After!
showAll || s.ShowToUser
I also used some basic right-angle trigonometry a few years ago when working on some simple graphics - I had to rotate and centre a text string along a line that could be at any angle.
Not revolutionary...but certainly maths.
Linear algebra for 3D rendering and also for financial tools.
Regression analysis for the same financial tools, like correlations between financial instruments and indices, and such.
Statistics, I had to write several methods to get statistical values, like the F Probability Distribution, the Pearson product moment coeficient, and some Linear Algebra correlations, interpolations and extrapolations for implementing the Arbitrage pricing theory for asset pricing and stocks.
Discrete math for everything, linear algebra for 3D, analysis for physics especially for calculating mass properties.
[Linear algebra for everything]
Projective geometry for camera calibration
Identification of time series / statistical filtering for sound & image processing
(I guess) basic mechanics and hence calculus for game programming
Computing sizes of caches to optimize performance. Not as simple as it sounds when this is your critical path, and you have to go back and work out the times saved by using the cache relative to its size.
I'm in medical imaging, and I use mostly linear algebra and basic geometry for anything related to 3D display, anatomical measurements, etc...
I also use numerical analysis for handling real-world noisy data, and a good deal of statistics to prove algorithms, design support tools for clinical trials, etc...
Games with trigonometry and AI with graph theory in my case.
Graph theory to create a weighted graph to represent all possible paths between two points and then find the shortest or most efficient path.
Also statistics for plotting graphs and risk calculations. I used both Normal distribution and cumulative normal distribution calculations. Pretty commonly used functions in Excel I would guess but I actully had to write them myself since there is no built-in support in the .NET libraries. Sadly the built in Math support in .NET seem pretty basic.
I've used trigonometry the most and also a small amount a calculus, working on overlays for GIS (mapping) software, comparing objects in 3D space, and converting between coordinate systems.
A general mathematical understanding is very useful if you're using 3rd party libraries to do calculations for you, as you ofter need to appreciate their limitations.
i often use math and programming together, but the goal of my work IS the math so use software to achive that.
as for the math i use; mostly Calculus (FFT's analysing continuous and discrete signals) with a slash of linar algebra (CORDIC) to do trig on a MCU with no floating point chip.
I used a analytic geometry for simple 3d engine in opengl in hobby project on high school.
Some geometry computation i had used for dynamic printing reports, where was another 90° angle layout than.
A year ago I used some derivatives and integrals for store analysis (product item movement in store).
Bot all the computation can be found on internet or high-school book.
Statistics mean, standard-deviation, for our analysts.
Linear algebra - particularly gauss-jordan elimination and
Calculus - derivatives in the form of difference tables for generating polynomials from a table of (x, f(x))
Linear algebra and complex analysis in electronic engineering.
Statistics in analysing data and translating it into other units (different project).
I used probability and log odds (log of the ratio of two probabilities) to classify incoming emails into multiple categories. Most of the heavy lifting was done by my colleague Fidelis Assis.
Real world scenarios: better rostering of staff, more efficient scheduling of flights, shortest paths in road networks, optimal facility/resource locations.
Branch of maths: Operations Research. Vague definition: construct a mathematical model of a (normally complex) real world business problem, and then use mathematical tools (e.g. optimisation, statistics/probability, queuing theory, graph theory) to interrogate this model to aid in the making of effective decisions (e.g. minimise cost, maximise efficency, predict outcomes etc).
Statistics for scientific data analyses such as:
calculation of distributions, z-standardisation
Fishers Z
Reliability (Alpha, Kappa, Cohen)
Discriminance analyses
scale aggregation, poling, etc.
In actual software development I've only really used quite trivial linear algebra, geometry and trigonometry. Certainly nothing more advanced than the first college course in each subject.
I have however written lots of programs to solve really quite hard math problems, using some very advanced math. But I wouldn't call any of that software development since I wasn't actually developing software. By that I mean that the end result wasn't the program itself, it was an answer. Basically someone would ask me what is essentially a math question and I'd write a program that answered that question. Sure I’d keep the code around for when I get asked the question again, and sometimes I’d send the code to someone so that they could answer the question themselves, but that still doesn’t count as software development in my mind. Occasionally someone would take that code and re-implement it in an application, but then they're the ones doing the software development and I'm the one doing the math.
(Hopefully this new job I’ve started will actually let me to both, so we’ll see how that works out)

Resources