Don't understand code that gets random number between two values - math

So, I understand this seems like a stupid question. With that said however, within my classes concerned with teaching proper code. I came upon this. Min + (Math.random() * ((Max - Min) + 1)) Essentially the code goes. Add your minimum value + a random number between 0.0 and 1.0 multiplied by the max value minus the minimum value plus 1. The book regards this code as a base for retrieving a random value within certain parameters. i.e max=40 min=20. That would get a value between 40 and 20.
The thing is, I know what the code is saying and doing. I was using this to generate a random character by adding (char) to the front and using 'a' and 'z' as the values. The thing is though, I don't understand how, mathematically speaking, this even works. I understand it makes me a pretty poor programmer on my part. I never claimed to be great or brilliant. I know algebra and some basic higher math concepts but there are some stupidly basic formulas like this that leave me scratching my head.
In terms of programming logic, this isn't so much an issue for me, but seeing concepts like this. I'm confused. I don't get the mathematical logic of this code. Am I missing anything? I mean, with a math random value between 0.0 and 1.0, I don't see how it procures a value between the minimum and maximum value. Would anybody be willing to be give me a layman's explanation of how this works?

it is called linear interpolation or sometimes even linear extrapolation depends if you are enlarging or shrinking dynamic range. Anyway the idea behind dynamic range changing is this:
let have:
x = < x0 , x1 > // input range
And we want to change them to
y = < y0 , y1 > // output range
So let me derive it step by step:
// equation range operation
y = x // < x0 , x1 > -x0
y = x-x0 // < 0 , x1-x0 > /(x1-x0)
y = (x-x0)/(x1-x0) // < 0 , 1 > *(y1-y0)
y = (y1-y0)(x-x0)/(x1-x0) // < 0 , y1-y0 > +y0
y = y0 + (y1-y0)(x-x0)/(x1-x0) // < y0 , y1 >
Now I suspect x=Math().random() returns values x=<0,1> and we want result in <y0,y1> = <min,max> then:
y = min + (max-min)(x-0)/(1-0)
y = min + (max-min)*x
The +1 resulting in <min,max+1> range or in case your Random()<1 restores range from <min,max) back to <min,max> hard to say without context (I do not code in your language assuming JAVA or something similar I am more of a C++ guy)
For simplicity linear interpolation/extrapolation is to obtain values between two edges/points/values linearly with some parameter t=<0,1>
x(t) = x0 + (x1-x0)*t
if (t=0) then x(0)=x0
if (t=1) then x(1)=x1
if (t=0.5) then x(0.5)= middle between x0 and x1
If t=<0,1> then we are talking about linear interpolation. If t is outside this range then we are talking about linear extrapolation (equation is the same).
Linear means when you sample points/values with constant t step then the resulting values will also have constant distance between them. And also lies on a single line ...
Hope it is clear now.

Imagine rubber fiber spanned between points 0 and 1 (line segment).
Sprinkle some dye drops on it - you have generated random values on 0..1 interval.
Now fix left point and stretch this fiber until its length becomes Max - Min.
Now shift it right by Min.
You can see some color points (random values) on interval Min..Max
In general this is linear transformation of one interval (0..1) into another (Min..Max). Note that initial interval might be arbitrary.

Related

Calculating error in PCA

I have a question about a result which I did not expect when doing PCA.
I have successfully calculated the principal components using reference data, and then as a check to ensure that what's going on is what I think is going on, I've projected the reference data onto the entire basis of its eigenfucntions (kept all components) and then transformed back, (this is in python, so it's pca.fit(ref_data) followed by ref_data_transform =pca.transform.(ref_data) followed by pca.inverse_transform(ref_data_transform) I get the exact same data. This is not a surprise.
What is also not a surprise is that as I choose fewer and fewer principle components, the point to point difference between the original data and that which has been projected onto a smaller basis and then projected back increases. That is, if you plot the original data and "filtered" data, it looks different, with the difference increasing as you reduce the size of the subspace onto which you're projecting. I can capture the difference between each data point in a vector called, say, difference_vec.
What IS a surprise (to me at least) is that when I sum over any column of difference_vec it always equals zero. That is, while the actual differences between any original data point and the corresponding one filtered by some number of principal components grow larger as I project onto a smaller and smaller subspace, the TOTAL error is always zero.
I very much appreciate any insight that one my have into if I'm making some mistake here and if not, why this erstwhile "projection induced error" metric doesn't work.
Thanks.
This happens because ref_data and what I’ll call inv_data = pca.inverse_transform(pca.transform(ref_data)) both have the same mean (taken along the second dimension, i.e., averaging over samples).
To see this, take a look at the code for transform:
transform = lambda X: dot(X - mu, V.T)
whereas inverse_transform can be defined as:
inverse_transform = lambda X: dot(X, V) + mu
where mu is the mean of ref_data and V are the first N eigenvectors of covariance(ref_data).
So if you follow the chain of data and its mean:
ref_data with mean mu;
transform(ref_data) has mean 0 (see the equivalent definition above: X-mu has zero mean, then projecting the result linearly onto some coordinate reference only rotates/shears/flips those zero-mean points, doesn’t alter their mean;
Finally, inv_data = inverse_transform(transform(ref_data)) adds mu back so it has mu-mean;
you see that ref_data and inv_data both have mean mu.
Finally, sum(ref_data - inv_data) can be seen as sum(mean(ref_data - inv_data) * num_samples), which by linearity simplifies to sum(mu - mu), which is 0.
That’s a lot of words, sorry, but the idea, now that I see it, is really simple. As I mentioned in my comment, in cases like this you want to use a matrix norm, like the Frobenius norm, to measure a distance between two matrixes, not just sum(A - B) 😅!
Sample code:
import numpy as np
from sklearn.decomposition import PCA
ref_data = np.random.randn(20, 3)
pca = PCA(n_components=1)
pca.fit(ref_data)
trans_data = pca.transform(ref_data)
inv_data = pca.inverse_transform(trans_data)
np.mean(inv_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(ref_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(trans_data, 0) # array([ -2.49800181e-17]) meanwhile ...
np.sum(inv_data - ref_data) # -1.3877787807814457e-15 !

Combining two normal random variables

suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.

How do you create a formula that has diminishing returns?

I guess this is a math question and not a programming question, but what is a good way to create a formula that has diminishing returns?
Here are some example points on how I want the curve to look like.
f(1) = 1
f(1.5)= .98
f(2) = .95
f(2.5) = .9
f(3) = .8
f(4) = .7
f(5) = .6
f(10) = .5
f(20) = .25
Notice that as the input gets higher, the percentage decreases rapidly. Is there any way to model a function that has a very smooth and accurate curve that says this?
Another way to say it is by using a real example. You know in Diablo II they have Magic Find? There are diminishing returns for magic find. If you get 100%, the real magic find is still 100%. But the more get, your actual magic find goes down. So much that say if you had 1200, your real magic find is probably 450%. So they have a function like:
actualMagicFind(magicFind) = // some way to reduced magic find
f(x) = f(0)e-rx
Where r is the rate of compounded diminishing return
This is just exponential decay
f(1) = 1
f(1.5)= .98
f(2) = .95
f(2.5) = .9
f(3) = .8
f(4) = .7
f(5) = .6
f(10) = .5
f(20) = .25
This doesn't make sense: for 3-5, adding one each time subtracts .1. With a real curve the output wouldn't be evenly spaced between any evenly spaced inputs. In other words, your curve is not a curve, as you can see by the graph: https://docs.google.com/spreadsheets/d/1EEbRxTyYalPSyQ93rcIbKiIvmYRX1lhRvkRh_HyofJc/edit?usp=sharing
So let's just ignore your "curve"; There are several ways to create diminishing returns. One of my favorites is:
f(x) = (x*a) / (x+b) + c
You can make a, b, and c anything you want. With this format a + c essentially* becomes your maximum possible output, c is your minimum, and b controls how quickly the output values scale and its effectiveness is relative to the value of a. This curve, of course increases the output as the input increases, whereas your example wants to decrease the output as the input increases. To fix this, you can swap the numerator and denominator:
f(x) = (x+b) / (x*a) + c
This makes the minimum output value equal to 1/a + c, the maximum output value approaches infinity as the input value approaches 0. b once again controls how quickly the output scales and its effectiveness is relative to the value of a.
Another approach would be to use something like what was mentioned by #Pierreten, though I'm not sure why he explicitly uses e:
a^(-bx)
Both a and b have a profound impact on how quickly the curve scales. If a is greater than 0 and less than 1, the output will increase as the input increases, but will also have the opposite effect, meaning it will have increasing returns, not diminishing. If a is greater than 1, then you'll see the desired effect of the output decreasing as the input increases with diminishing returns. The following is the closest thing I found to the numbers you described:
f(x) = 1.01^(-6.96607x)
f(0) = 1
f(1) = 0.933
f(3) = 0.812
f(10) = 0.5
f(20) = 0.25
There are several other options as well, but this is long enough.
Any inverse exponential function, such as f(x) = 1/(x2). Modify the exponent to adjust the steepness of the curve.
This is the best I have been able to turn this into excel, any help improving this is appreciated, it works I just cannot get it to work like I expect it.
Formula in M column is:
Output increases as input increases
=((I6*J6)/(I6+K6))+L6
Output decreases as input increases
=((I12+K12)/(I12*J12))+L12
Image of Excel Table
The idea behind this is to calculate a diminishing return for the more volume of product a customer purchases, the lower the amount of discount he gets, while he still gets some, its a lot less
Full thread of the excel quest to get this formula working here (for those interested)
Here

What is O value for naive random selection from finite set?

This question on getting random values from a finite set got me thinking...
It's fairly common for people to want to retrieve X unique values from a set of Y values. For example, I may want to deal a hand from a deck of cards. I want 5 cards, and I want them to all be unique.
Now, I can do this naively, by picking a random card 5 times, and try again each time I get a duplicate, until I get 5 cards. This isn't so great, however, for large numbers of values from large sets. If I wanted 999,999 values from a set of 1,000,000, for instance, this method gets very bad.
The question is: how bad? I'm looking for someone to explain an O() value. Getting the xth number will take y attempts...but how many? I know how to figure this out for any given value, but is there a straightforward way to generalize this for the whole series and get an O() value?
(The question is not: "how can I improve this?" because it's relatively easy to fix, and I'm sure it's been covered many times elsewhere.)
Variables
n = the total amount of items in the set
m = the amount of unique values that are to be retrieved from the set of n items
d(i) = the expected amount of tries needed to achieve a value in step i
i = denotes one specific step. i ∈ [0, n-1]
T(m,n) = expected total amount of tries for selecting m unique items from a set of n items using the naive algorithm
Reasoning
The first step, i=0, is trivial. No matter which value we choose, we get a unique one at the first attempt. Hence:
d(0) = 1
In the second step, i=1, we at least need 1 try (the try where we pick a valid unique value). On top of this, there is a chance that we choose the wrong value. This chance is (amount of previously picked items)/(total amount of items). In this case 1/n. In the case where we picked the wrong item, there is a 1/n chance we may pick the wrong item again. Multiplying this by 1/n, since that is the combined probability that we pick wrong both times, gives (1/n)2. To understand this, it is helpful to draw a decision tree. Having picked a non-unique item twice, there is a probability that we will do it again. This results in the addition of (1/n)3 to the total expected amounts of tries in step i=1. Each time we pick the wrong number, there is a chance we might pick the wrong number again. This results in:
d(1) = 1 + 1/n + (1/n)2 + (1/n)3 + (1/n)4 + ...
Similarly, in the general i:th step, the chance to pick the wrong item in one choice is i/n, resulting in:
d(i) = 1 + i/n + (i/n)2 + (i/n)3 + (i/n)4 + ... = = sum( (i/n)k ), where k ∈ [0,∞]
This is a geometric sequence and hence it is easy to compute it's sum:
d(i) = (1 - i/n)-1
The overall complexity is then computed by summing the expected amount of tries in each step:
T(m,n) = sum ( d(i) ), where i ∈ [0,m-1] = = 1 + (1 - 1/n)-1 + (1 - 2/n)-1 + (1 - 3/n)-1 + ... + (1 - (m-1)/n)-1
Extending the fractions in the series above by n, we get:
T(m,n) = n/n + n/(n-1) + n/(n-2) + n/(n-3) + ... + n/(n-m+2) + n/(n-m+1)
We can use the fact that:
n/n ≤ n/(n-1) ≤ n/(n-2) ≤ n/(n-3) ≤ ... ≤ n/(n-m+2) ≤ n/(n-m+1)
Since the series has m terms, and each term satisfies the inequality above, we get:
T(m,n) ≤ n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + n/(n-m+1) + ... + n/(n-m+1) + n/(n-m+1) = = m*n/(n-m+1)
It might be(and probably is) possible to establish a slightly stricter upper bound by using some technique to evaluate the series instead of bounding by the rough method of (amount of terms) * (biggest term)
Conclusion
This would mean that the Big-O order is O(m*n/(n-m+1)). I see no possible way to simplify this expression from the way it is.
Looking back at the result to check if it makes sense, we see that, if n is constant, and m gets closer and closer to n, the results will quickly increase, since the denominator gets very small. This is what we'd expect, if we for example consider the example given in the question about selecting "999,999 values from a set of 1,000,000". If we instead let m be constant and n grow really, really large, the complexity will converge towards O(m) in the limit n → ∞. This is also what we'd expect, since while chosing a constant number of items from a "close to" infinitely sized set the probability of choosing a previously chosen value is basically 0. I.e. We need m tries independently of n since there are no collisions.
If you already have chosen i values then the probability that you pick a new one from a set of y values is
(y-i)/y.
Hence the expected number of trials to get (i+1)-th element is
y/(y-i).
Thus the expected number of trials to choose x unique element is the sum
y/y + y/(y-1) + ... + y/(y-x+1)
This can be expressed using harmonic numbers as
y (Hy - Hy-x).
From the wikipedia page you get the approximation
Hx = ln(x) + gamma + O(1/x)
Hence the number of necessary trials to pick x unique elements from a set of y elements
is
y (ln(y) - ln(y-x)) + O(y/(y-x)).
If you need then you can get a more precise approximation by using a more precise approximation for Hx. In particular, when x is small it is possible to
improve the result a lot.
If you're willing to make the assumption that your random number generator will always find a unique value before cycling back to a previously seen value for a given draw, this algorithm is O(m^2), where m is the number of unique values you are drawing.
So, if you are drawing m values from a set of n values, the 1st value will require you to draw at most 1 to get a unique value. The 2nd requires at most 2 (you see the 1st value, then a unique value), the 3rd 3, ... the mth m. Hence in total you require 1 + 2 + 3 + ... + m = [m*(m+1)]/2 = (m^2 + m)/2 draws. This is O(m^2).
Without this assumption, I'm not sure how you can even guarantee the algorithm will complete. It's quite possible (especially with a pseudo-random number generator which may have a cycle), that you will keep seeing the same values over and over and never get to another unique value.
==EDIT==
For the average case:
On your first draw, you will make exactly 1 draw.
On your 2nd draw, you expect to make 1 (the successful draw) + 1/n (the "partial" draw which represents your chance of drawing a repeat)
On your 3rd draw, you expect to make 1 (the successful draw) + 2/n (the "partial" draw...)
...
On your mth draw, you expect to make 1 + (m-1)/n draws.
Thus, you will make 1 + (1 + 1/n) + (1 + 2/n) + ... + (1 + (m-1)/n) draws altogether in the average case.
This equals the sum from i=0 to (m-1) of [1 + i/n]. Let's denote that sum(1 + i/n, i, 0, m-1).
Then:
sum(1 + i/n, i, 0, m-1) = sum(1, i, 0, m-1) + sum(i/n, i, 0, m-1)
= m + sum(i/n, i, 0, m-1)
= m + (1/n) * sum(i, i, 0, m-1)
= m + (1/n)*[(m-1)*m]/2
= (m^2)/(2n) - (m)/(2n) + m
We drop the low order terms and the constants, and we get that this is O(m^2/n), where m is the number to be drawn and n is the size of the list.
There's a beautiful O(n) algorithm for this. It goes as follows. Say you have n items, from which you want to pick m items. I assume the function rand() yields a random real number between 0 and 1. Here's the algorithm:
items_left=n
items_left_to_pick=m
for j=1,...,n
if rand()<=(items_left_to_pick/items_left)
Pick item j
items_left_to_pick=items_left_to_pick-1
end
items_left=items_left-1
end
It can be proved that this algorithm does indeed pick each subset of m items with equal probability, though the proof is non-obvious. Unfortunately, I don't have a reference handy at the moment.
Edit The advantage of this algorithm is that it takes only O(m) memory (assuming the items are simply integers or can be generated on-the-fly) compared to doing a shuffle, which takes O(n) memory.
Your actual question is actually a lot more interesting than what I answered (and harder). I've never been any good at statistitcs (and it's been a while since I did any), but intuitively, I'd say that the run-time complexity of that algorithm would probably something like an exponential. As long as the number of elements picked is small enough compared to the size of the array the collision-rate will be so small that it will be close to linear time, but at some point the number of collisions will probably grow fast and the run-time will go down the drain.
If you want to prove this, I think you'd have to do something moderately clever with the expected number of collisions in function of the wanted number of elements. It might be possible do to by induction as well, but I think going by that route would require more cleverness than the first alternative.
EDIT: After giving it some thought, here's my attempt:
Given an array of m elements, and looking for n random and different elements. It is then easy to see that when we want to pick the ith element, the odds of picking an element we've already visited are (i-1)/m. This is then the expected number of collisions for that particular pick. For picking n elements, the expected number of collisions will be the sum of the number of expected collisions for each pick. We plug this into Wolfram Alpha (sum (i-1)/m, i=1 to n) and we get the answer (n**2 - n)/2m. The average number of picks for our naive algorithm is then n + (n**2 - n)/2m.
Unless my memory fails me completely (which entirely possible, actually), this gives an average-case run-time O(n**2).
The worst case for this algorithm is clearly when you're choosing the full set of N items. This is equivalent to asking: On average, how many times must I roll an N-sided die before each side has come up at least once?
Answer: N * HN, where HN is the Nth harmonic number,
a value famously approximated by log(N).
This means the algorithm in question is N log N.
As a fun example, if you roll an ordinary 6-sided die until you see one of each number, it will take on average 6 H6 = 14.7 rolls.
Before being able to answer this question in details, lets define the framework. Suppose you have a collection {a1, a2, ..., an} of n distinct objects, and want to pick m distinct objects from this set, such that the probability of a given object aj appearing in the result is equal for all objects.
If you have already picked k items, and radomly pick an item from the full set {a1, a2, ..., an}, the probability that the item has not been picked before is (n-k)/n. This means that the number of samples you have to take before you get a new object is (assuming independence of random sampling) geometric with parameter (n-k)/n. Thus the expected number of samples to obtain one extra item is n/(n-k), which is close to 1 if k is small compared to n.
Concluding, if you need m unique objects, randomly selected, this algorithm gives you
n/n + n/(n-1) + n/(n-2) + n/(n-3) + .... + n/(n-(m-1))
which, as Alderath showed, can be estimated by
m*n / (n-m+1).
You can see a little bit more from this formula:
* The expected number of samples to obtain a new unique element increases as the number of already chosen objects increases (which sounds logical).
* You can expect really long computation times when m is close to n, especially if n is large.
In order to obtain m unique members from the set, use a variant of David Knuth's algorithm for obtaining a random permutation. Here, I'll assume that the n objects are stored in an array.
for i = 1..m
k = randInt(i, n)
exchange(i, k)
end
here, randInt samples an integer from {i, i+1, ... n}, and exchange flips two members of the array. You only need to shuffle m times, so the computation time is O(m), whereas the memory is O(n) (although you can adapt it to only save the entries such that a[i] <> i, which would give you O(m) on both time and memory, but with higher constants).
Most people forget that looking up, if the number has already run, also takes a while.
The number of tries nessesary can, as descriped earlier, be evaluated from:
T(n,m) = n(H(n)-H(n-m)) ⪅ n(ln(n)-ln(n-m))
which goes to n*ln(n) for interesting values of m
However, for each of these 'tries' you will have to do a lookup. This might be a simple O(n) runthrough, or something like a binary tree. This will give you a total performance of n^2*ln(n) or n*ln(n)^2.
For smaller values of m (m < n/2), you can do a very good approximation for T(n,m) using the HA-inequation, yielding the formula:
2*m*n/(2*n-m+1)
As m goes to n, this gives a lower bound of O(n) tries and performance O(n^2) or O(n*ln(n)).
All the results are however far better, that I would ever have expected, which shows that the algorithm might actually be just fine in many non critical cases, where you can accept occasional longer running times (when you are unlucky).

Can I force two components in a three-way linear regression to be positive?

I'm sorry if I'm not using the correct mathemathical terms, but I hope you'll understand what I'm trying to accomplish.
My problem:
I'm using linear regression (currently least squares method) on the values from two vectors x and y against the result z. This is to be done in matlab, and I'm using the \-operator to perform the regression. My dataset will contain a few thousand observations (up to about 50000 at max).
The x-values will be in the area of 10-300 (most between 60 and 100) and the y-values in the 1-3 area.
My code looks like this:
X = [ones(size(x,1) x y];
parameters = X\y;
The output "parameters" are then the three factors a0, a1 and a2 which is used in this formula:
a0 * 1 + a1 * xi + a2 * yi = zi
(The i's are supposed to be subscripted)
This works like expected, although I want the two parameters a1 and a2 to ALWAYS be positive values, even when the vector z is negative (this means that the a0 will be negative, of course), since this is what the real model looks like (z is always positively correlated to x and z). Is this possible using the least squares method? I'm also open for other algorithms for linear regression.
Let me try and rephrase to clarify. Accoring to your model z is always positively correlated with x and y. However, sometimes when you solve the linear regression for the coefficient this gives you a negative value.
If you are right about the data, this should only happen when the correct coefficient is small, and noise happens to take it negative. You could just assign it to zero, but then the means wouldn't match properly.
In which case the correct solution is as jpalacek says, but explained with more detail here:
Try and regress against x and y. If both positive take the result.
If a1 is negative, assume it should be zero. regress z against y. If a2 is positive then take a1 as 0, and a0 and a2 from this regression.
If a2 is negative, assume it should be zero too. Regress z against 1, and take this as a0. Let a1 and a2 be 0.
This should give you what you want.
The simple solution is to use a tool designed to solve it. That is, use lsqlin, from the optimization toolbox. Set a lower bound constraint for two of the three parameters.
Thus, assuming x, y, and z are all COLUMN vectors,
A = [ones(length(x),1),x,y];
lb = [-inf, 0, 0];
a = lsqlin(A,z,[],[],[],[],lb);
This will constrain only the second and third unknown parameters.
Without the optimization toolbox, use lsqnonneg, which is part of matlab itself. Here too the solution is easy enough.
A = [ones(length(x),1),x,y];
a = lsqnonneg(A,z);
Your model will be
z = a(1) + a(2)*x + a(3)*y
If a(1) is essentially zero, i.e., it is within a tolerance of zero, then assume that the first parameter was constrained by the bound at zero. In that case, solve a second problem by changing the sign on the column of ones in A.
A(:,1) = -1;
a = lsqnonneg(A,z);
If this solution has a(1) significantly non-zero, then the second solution must be better than the first. Your model will now be
z = -a(1) + a(2)*x + a(3)*y
It costs you at most two calls to lsqnonneg, and the second call is only ever made some fraction (lacking any information about your problem, the odds are 50% of the second call) of the time.

Resources