I faced an interesting problem for which I would like to have a solution. I need to get a random number by the coefficient (min + abs(max - min) * coef), but not just a random number, the randomness of this number will correspond to the graph. Let's say that the random most often gives a coefficient less than 0.5, and less often more than 0.5. I have provided an illustration of the graph below. How can this be done? (Please give examples on JS, Python, C++, C# or Java)
Related
I'm trying to generate a random double in R between 0 and 100, with 0 being a possible result, and normally runif(1, min, max) would do what I need. However, if I understand it correctly runif will only give you results between the min and max and never the actual limits.
Is there a way in R to generate a random double that includes only one of the limits? (In this case, 0≤x<100)
josliber created a custom function that includes both limits (https://stackoverflow.com/a/24070116/6429759), but I'm afraid that I don't know if this can be modified to only include min.
I do realise this would only change the outcome an extremely small fraction of the time, but it's part of a function that will be run extremely frequently, so it's not for nothing.
Ignoring R for the moment, the probability of obtaining exactly 0 when sampling from the uniform distribution is 0 - so for all practical purposes, drawing from the open interval and the closed interval are essentially the same.
Now, in R (or any computer-based system, for that matter), we cannot actually represent an infinite number of numbers, because we're working with a finite representational system. So you technically are drawing from a finite population, and there is a non-zero probability of drawing a boundary point. However, a good random number generator (and R has several pretty good ones) will do a pretty good job of mimicking reality - which means that even if you drew from a closed interval instead of an open interval, the probability of actually drawing 0 is negligible.
I have an arbitrary curve (defined by a set of points) and I would like to generate a polynomial that fits that curve to an arbitrary precision. What is the best way to tackle this problem, or is there already a library or online service that performs this task?
Thanks!
If your "arbitrary curve" is described by a set of points (x_i,y_i) where each x_i is unique, and if you mean by "fits" the calculation of the best least-squares polynomial approximation of degree N, you can simply obtain the coefficients b of the polynomial using
b = polyfit(X,Y,N)
where X is the vector of x_i values, Y is the vector of Y_i values. In this way you can increase N until you obtain the accuracy you require. Of course you can achieve zero approximation error by calculating the interpolating polynomial. However, data fitting often requires some thought beforehand - you need to give thought to what you want the approximation to achieve. There are a variety of mathematical ways of assessing approximation error (by using different norms), the choice of which will depend on your requirements of the resulting approximation. There are also many potential pitfalls (such as overfitting) that you may come across and blindly attempting to fit curves may result in an approximation that is theoritically sound but utterly useless to you in practical terms. I would suggest doing a little research on approximation theory if the above method does not meet your requirements, as has been suggested in the comments on your question.
I'm trying to create an interpolation of a list of points.
I've some point of coordinates (ti, xi), where ti are timestamp and xi are associated values. I want to create a function that passes through these points and allows me to find the x value corresponding to a generic t that lies in the interval.
I want to interpolate them with a third-order interpolation. I've seen something like catmull-rom interpolation, but it works only if points xi are equidistant.
For example here http://www.mvps.org/directx/articles/catmull/ you can find that the timestamp points are equidistant, like also here http://www.cs.cmu.edu/~462/projects/assn2/assn2/catmullRom.pdf .
There' some way to apply cubic interpolation with non regular points?
Unequal spacing of the arguments is not a problem as long as they are all distinct. As you probably know, if you have four distinct times t[i], then there exists a unique polynomial interpolant of corresponding values x[i] having degree at most 3 (cubic or lower order).
There are two main approaches to computing the interpolant, Newton's divided-differences and Lagrange's method of interpolation.
Keeping in mind that just finding the polynomial is not the point, but rather evaluating it at another time in the interval, there are some programming tradeoffs to consider.
If the times t[i] are fixed but values x[i] are changed repeatedly, you might be well off using Lagrange's method. It basically constructs four cubic polynomials that take roots at three of the four points and gives a normalized value 1 at the remaining point. Once you have those four polynomials, interpolating the values x[i] is just a matter of taking the corresponding linear combination of them. Lagrange's method suffers from Runge's phenomenon at the edges of the interval.
However if the times t[i] keep changing, or perhaps you are evaluating the interpolating polynomial for a number of intermediate points for the same t[i], x[i] data, then Newton's divided differences may be better. If accuracy is important, one can vary the order that the times t[i] appear in the divided-difference tableau so that the evaluation is localized around the closest times to the intermediate time where the value is needed.
It's not hard to find sample code for Newton's divided difference method on the Web, e.g. in C++, Python, or Java.
One way might be to fit a least squares cubic through the points. I've found that approach here to be robust and practical, even with a small number of points.
As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.
I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.