Approximating two different curves in R - r

I have two different density plots in R- one of them is the observed data (x1), and the other is randomly generated data from a Poisson distribution with the observed mean (x2). I would like to approximate the curves, i.e. make the expected curve look more like the observed data as it is over and under-estimated in certain areas. How do I go about doing this? I know you can get the absolute value between the curves by using
abs (x1 - x2)
However I'm not too sure how to proceed. Anybody have any ideas?

I think if you want to find an analytical solution, you might just have to play with the functions for a while. Otherwise, it seems that you could use calculus of variations to do this. That is, you take the difference between the area under both of your functions, and then minimize that (take the derivative). Formally, you need to take the second derivative to find if it's a max, min, or inflection point. However, you don't need to in this case if the function fits the data. I'm not sure what the best program would be for finding an analytical solution, but maybe that will put you on the right track. Just an idea to bounce around

Related

Find the radius of a cluster, given that its center is the average of the centers of two other clusters

I do not know if it is possible to find it, but I am using Kmeans clustering with Mahout, and I am stuck to the following.
In my implementation, I create with two different threads the following clusters:
CL-1{n=4 c=[1.75] r=[0.82916]}
CL-1{n=2 c=[4.5] r=[0.5]}
So, I would like to finally combine these two clusters into one final cluster.
In my code, I manage to find that for the final cluster the total points are n=6, the new average of the centers is c=2.666 but I am not able to find the final combined radius.
I know that the radius is the Population Standard Deviation, and I can calculate it if I previously know each point that belongs to the cluster.
However, in my case I do not have previous knowledge of the points, so I need the "average" of the 2 radius I mentioned before, in order to finally have this: CL-1{n=6 c=[2.666] r=[???]}.
Any ideas?
Thanks for your help.
It's not hard. Remember how the "radius" (not a very good name) is computed.
It's probably the standard deviation; so if you square this value and multiply it by the number of objects, you get the sum of squares. You can aggregate the sum of squares, and then reverse this process to get a standard deviation again. It's pretty basic statistic knowledge; you want to compute the weighted quadratic mean, just like you computed the weighted arithmetic mean for the center.
However, since your data is 1 dimensional, I'm pretty sure it will fit into main memory. As long as your data fits into memory, stay away from Mahout. It's slooooow. Use something like ELKI instead, or SciPy, or R. Run benchmarks. Mahout will perform several orders of magnitude slower than all the others. You won't need all of this Canopy-thing then either.

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Looking for interesting formula

I'm creating a game where players can make an alloy. To make it less predictable and more interesting, I thought that the durability and hardness of an alloy should not be calculated by a simple formula, because it will be extremely easy to find extrema, where alloy have best statistics.
So the questions is, is there any formula for a function where extrema can be found only by investigating all points? Input values will be in percents: 0.0%-100.0%. I think it should look like this: half sound wave
A very simple way would be a couple of sin function, just vary the constants and the sign for each new player. Here is one example (sin(1.1*x) + sin(x) + sin(0.9 *x))^2
If you use this between 10pi and 20pi you have an by average increasing function with local minima.
Modulating a simple linear or exponential function with trigonometric functions whose frequency and amplitude are dependent on the input should get you what you want.
You don't need a formula, I think — throw a bunch of random values around your domain, and then interpolate (linear interpolation will do) between them. Then you can even change the "formula" completely each time the game is run, or once in a while, or change it slowly with time, etc, etc.
If you want something that is very hard to predict then I would suggest involving a random number generator with the same seed every time. You can use it as an envelope for whatever function you come up with (trig functions or what not) to make it more jagged.
An interesting formula to use would be that of gamma of the Black-Scholes options pricing model. It goes as follows:
You can easily replace the variables, here's a graph of how the function looks:
alt text http://www.sqbimmer.com/aalex/gamma.png

Calculus, How can you find an equation from a series of numbers?

I'm analyzing financial data and would like to find the inflection points of a line. I know I can do this using derivatives, but first I need an equation. Is there a way to generate an equation based off of a series of numbers. I would need to do this programmaticly.
Spline interpolation is probably more useful for you than polynomial interpolation: if you fit a polynomial, it must inevitably head off to +/- infinity outside your data range.
You will also want a method which allows a slightly loose fit: financial data is often a bit noisy which can result in very weird curves if you try to fit it exactly.
There are established procedures for turning a set of existing data points into a polynomial; this is called Polynomial Interpolation. This article in Wikipedia: http://en.wikipedia.org/wiki/Polynomial_interpolation
explains it mathematically. You can probably Google for algorithms easily enough.
Given enough points, your polynomial tracks the original, unknown function reasonably well, so the polynomial's inflection points should roughly coincide with the peaks and troughs of your data.
On the other hand, we all know there's not really a function behind financial data. So if I were you I'd scan along those points and find every point that has a smaller value to either side of it, and declare that a high; and vice versa for lows. Force-fitting this data into a fictitious function isn't going to make it any more useful.
Update: Tom Smith advises that spline interpolation is to be preferred to polynomial interpolation for this kind of thing, and Wikipedia bears him out. Or rather, it's bullish on his answer.
What you are thinking is analytical calculus ... when having discrete data (e.g. points), you have to do it numerically. Now, a line usually doesn't have inflection points, so I guess you're thinking of a curve. You can either interpolate some kind of it through the points, then calculate the first derivative (also numerically, but for a larger number of points), or you can just calculate the first derivation from the points you have (which will be better depends on how many points you actually have).
But really, this is just theory since we don't know the nature of data, or the language or anything.
For more on the subject search: numerical analysis on wiki, and go from there.
I think curve fitting might help you in this case. Here is a discussion which might be handy.
cheers

Correct way to standardize/scale/normalize multiple variables following power law distribution for use in linear combination

I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.

Resources