R code iteration blues - r

I am very new to R, trying to use it to visualize things. To make the story short, I'm exploring a conjecture I have on the economic theory of public goods. (I'm in my mid 50s, so bear with me.)
As far as R is concerned, I need to create a matrix with two vectors, one with E(W)/max(W), and the 2nd vector with stdev(W)/E(W). The trick is that the sample space of W, my r.v., keeps expanding by 1. To make this clearer, here's the probability distribution of W, the first 4 iterations:
W p
0 2/3
1 1/3
W p
0 3/6
1 2/6
2 1/6
W p
0 4/10
1 3/10
2 2/10
3 1/10
W p
0 5/15
1 4/15
2 3/15
3 2/15
4 1/15
...
I need to iterate this 20 times or so. Of course, I could do this manually, by copying, pasting, and then manually adjusting simple code, but it'd be too bulky and ugly looking, and I'm a bit concerned about --- you know --- elegance.
With good help from this community, I learned how to program R to generate the denominator of the probabilities:
R code iteration
I thought (foolishly) I could take it from there, but after a few hours scratching my bald head, I'm still stuck without knowing how to get what I want. It's about my not understanding well how to program less simple procedures that iterate. :/
I'll appreciate any help, especially clues setting me on the right track.
Thanks in advance!

You're just diving out by the sum; and sum of 1 to k is k*(k+1)/2. So...
R>k <- 3
R>k:1 / (k^2 + k)*2

I assume you mean that you want a matrix of 20 or so rows with each row being the values of your two requested quantities given that the distribution has max(W) = N values.
t(vapply(seq_len(20) + 1, function(N) {
W <- seq(N, 1) / (N * (N + 1) / 2) # create your distribution w/ 20 values
E <- function(pdf) sum((seq_along(pdf) - 1) * pdf)
c(E(W) / max(W), sqrt(E((W - E(W))^2)) / E(W))
}, numeric(2)))

Related

Count the Number of 6s Rolled on a Number of Dice in R

I am trying to develop code that will tell me the likelihood of rolling at least one six given 1 thru 20 die using. I am specifically trying to build a single piece of code that loops through the problem space. generates this information. The question has left me at a loss.
I have tried using the sample function and looked at contingency tables.
die1 = sample(1:6,n,replace=T)
die2 = sample(1:6,n,replace=T)
sum_of_dice = die1 + die2
counts = table(sum_of_dice)
proba_empiric = counts/sum(counts)
barplot(proba_empiric)
The above provides the basis for a probability but not for the joint probability of two die.
The final code should be able to tell me the likelihood of rolling a six on 1 die, 2 die, 3 die, all the way to twenty die.
One way to simulate the probability of rolling at least one 6 using 1 to 20 die is to use rbinom():
sapply(1:20, function(x) mean(rbinom(10000, x, 1/6) > 0))
[1] 0.1675 0.3008 0.4174 0.5176 0.5982 0.6700 0.7157 0.7704 0.8001 0.8345 0.8643 0.8916 0.9094 0.9220 0.9310
[16] 0.9471 0.9547 0.9623 0.9697 0.9718
If I am understanding you correctly, you have 20 dice and you want to know the probability of atleast one six happening in them.
We can write a function to roll one die
roll_die <- function() sample(6, 1)
Then write another function which rolls 20 dice and checks if there is atleast one six in it
roll_20_die <- function() {
any(replicate(20, roll_die()) == 6)
}
and replicate this function sufficient number of times to get the probability ratio
n <- 10000
table(replicate(n, roll_20_die()))/n
# FALSE TRUE
#0.0244 0.9756

R: approximating `e = exp(1)` using `(1 + 1 / n) ^ n` gives absurd result when `n` is large

So, I was just playing around with manually calculating the value of e in R and I noticed something that was a bit disturbing to me.
The value of e using R's exp() command...
exp(1)
#[1] 2.718282
Now, I'll try to manually calculate it using x = 10000
x <- 10000
y <- (1 + (1 / x)) ^ x
y
#[1] 2.718146
Not quite but we'll try to get closer using x = 100000
x <- 100000
y <- (1 + (1 / x)) ^ x
y
#[1] 2.718268
Warmer but still a bit off...
x <- 1000000
y <- (1 + (1 / x)) ^ x
y
#[1] 2.71828
Now, let's try it with a huge one
x <- 5000000000000000
y <- (1 + (1 / x)) ^ x
y
#[1] 3.035035
Well, that's not right. What's going on here? Am I overflowing the data type and need to use a certain package instead? If so, are there no warnings when you overflow a data type?
You've got a problem with machine precision. As soon as (1 / x) < 2.22e-16, 1 + (1 / x) is just 1. Mathematical limit breaks down in finite-precision numerical computations. Your final x in the question is already 5e+15, very close to this brink. Try x <- x * 10, and your y would be 1.
This is neither "overflow" nor "underflow" as there is no difficulty in representing a number as small as 1e-308. It is the problem of the loss of significant digits during floating-point arithmetic. When you do 1 + (1 / x), the bigger x is, the fewer significant digits in the (1 / x) part can be preserved when you add it to 1, and eventually you lose that (1 / x) term altogether.
## valid 16 significant digits
1 + 1.23e-01 = 1.123000000000000|
1 + 1.23e-02 = 1.012300000000000|
... ...
1 + 1.23e-15 = 1.000000000000001|
1 + 1.23e-16 = 1.000000000000000|
Any numerical analysis book would tell you the following.
Avoid adding a large number and a small number. In floating-point addition a + b = a * (1 + b / a), if b / a < 2.22e-16, there us a + b = a. This implies that when adding up a number of positive numbers, it is more stable to accumulate them from the smallest to the largest.
Avoid subtracting one number from another of the same magnitude, or you may get cancellation error. The web page has a classic example of using the quadratic formula.
You are also advised to have a read on Approximation to constant "pi" does not get any better after 50 iterations, a question asked a few days after your question. Using a series to approximate an irrational number is numerically stable as you won't get the absurd behavior seen in your question. But the finite number of valid significant digits imposes a different problem: numerical convergence, that is, you can only approximate the target value up to a certain number of significant digits. MichaelChirico's answer using Taylor series would converge after 19 terms, since 1 / factorial(19) is already numerically 0 when added to 1.
Multiplication / division between floating-point numbers don't cause problem on significant digits; they may cause "overflow" or "underflow". However, given the wide range of representable floating-point values (1e-308 ~ 1e+307), "overflow" and "underflow" should be rare. The real difficulty is with addition / subtraction where significant digits can be easily lost. See Can I stably invert a Vandermonde matrix with many small values in R? for an example on matrix computations. It is not impossible to get higher precision, but the work is probably more involved. For example, OP of the matrix example eventually used the GMP (GNU Multiple Precision Arithmetic Library) and associated R packages to proceed: How to put Rmpfr values into a function in R?
You might also try the Taylor series approximation to exp(1), namely
e^x = \sum_{k = 0}{\infty} x^k / k!
Thus we can approximate e = e^1 by truncating this sum; in R:
sprintf('%.20f', exp(1))
# [1] "2.71828182845904509080"
sprintf('%.20f', sum(1/factorial(0:10)))
# [1] "2.71828180114638451315"
sprintf('%.20f', sum(1/factorial(0:100)))
# [1] "2.71828182845904509080"

Find out if a solution exists for multiple equations (in N) [duplicate]

This question already has answers here:
Algorithm for solving systems of linear inequalities
(5 answers)
Closed 8 years ago.
Consider the following equations:
X > Y
X + Y > 7
Y <= 10
X >= 0
Y >= 0
I want to find out if there exists a solution that fulfills all of them (natural numbers).
I don't care about the exact solution, I just want to know if there is a solution at all
I have read about Microsoft Solver Foundation or other linear programming libraries, but I'm not sure if they can solve problems like this.
Especially I'm not sure if the can solve equations with variables on each side, like
X > Y, or X + Y > Z
most examples are of the form:
X * 10 + Y * 30 > constant
I need it to be able to solve systems with maximum of 4-8 variables, all in range of 0-100
Another important constraint I have, the library needs to be fast. I need to be able to solve systems of like 7 equations in like 0,00001 seconds
Interesting question. Feels a lot like the integer-knapsack problem.
First of all, whether variables are on each side is irrelevant, since an equation like
X + Y > Z
can be rewritten to
X + Y - Z > 0
So let's assume that all constraints are of the format
(const1 * var1) + ... + (const8 * var8) > const
To support less variables, just use the value 0 for one of the constants.
The way to visualize this is to see the case of 2 variables as determining the convex hull of the 'lines' corresponding to the constraints. So each constraint can be drawn as a 2D line, and only values on one side of the line are allowed.
To visualize this for 3 variables, it's the same as whether the convex hull of 'planes' determined by the constraint have any grid points ('natural numbers') in them.
The trouble in this case is the fact that the solution should have only natural numbers: this makes normal linear algebra impossible, since a grid is imposed. I would not know of any library supporting such restrictions.
But it would not be too difficult to write a solution yourself: the idea is to find a solution by trying every number by pruning aggressively.
So in your example: test all X in the range 0 to 100. Now go to the next variable, and determine the valid range for the free variable based on the constraints. Worked out for x == 8: then the range for y would be:
0 .. 7 because of constraint x > y
0 .. 100 because of constraint x + y > 7 (since x is already 8)
0 .. 9 because of constraint y < 10
...and we repeat this for all constraints. The final constraint for y is then 0 .. 7, because that is the most tight constraint. Now repeat this process for the left-over unbound variables, and you're done if you find at least one solution.
I expect this code to be about 100 lines with dynamic programming; computation time very much depends on the input and vary wildly.
For example, a set of equations which would take a long time:
A + B + C + D + E + F + G + H > 400.5
A + B + C + D + E + F + G + H < 400.6
As a human we can deduce that since we're requiring natural numbers, there is no solution to these equations. However, this solution is not prunable using the method described above, all combinations of A .. G will have to be tested before it will be concluded that there is no fitting H. Therefore it will look at about all possibilities. Not really pleasant, but unavoidable.

Generate 3 random number that sum to 1 in R

I am hoping to create 3 (non-negative) quasi-random numbers that sum to one, and repeat over and over.
Basically I am trying to partition something into three random parts over many trials.
While I am aware of
a = runif(3,0,1)
I was thinking that I could use 1-a as the max in the next runif, but it seems messy.
But these of course don't sum to one. Any thoughts, oh wise stackoverflow-ers?
This question involves subtler issues than might be at first apparent. After looking at the following, you may want to think carefully about the process that you are using these numbers to represent:
## My initial idea (and commenter Anders Gustafsson's):
## Sample 3 random numbers from [0,1], sum them, and normalize
jobFun <- function(n) {
m <- matrix(runif(3*n,0,1), ncol=3)
m<- sweep(m, 1, rowSums(m), FUN="/")
m
}
## Andrie's solution. Sample 1 number from [0,1], then break upper
## interval in two. (aka "Broken stick" distribution).
andFun <- function(n){
x1 <- runif(n)
x2 <- runif(n)*(1-x1)
matrix(c(x1, x2, 1-(x1+x2)), ncol=3)
}
## ddzialak's solution (vectorized by me)
ddzFun <- function(n) {
a <- runif(n, 0, 1)
b <- runif(n, 0, 1)
rand1 = pmin(a, b)
rand2 = abs(a - b)
rand3 = 1 - pmax(a, b)
cbind(rand1, rand2, rand3)
}
## Simulate 10k triplets using each of the functions above
JOB <- jobFun(10000)
AND <- andFun(10000)
DDZ <- ddzFun(10000)
## Plot the distributions of values
par(mfcol=c(2,2))
hist(JOB, main="JOB")
hist(AND, main="AND")
hist(DDZ, main="DDZ")
just random 2 digits from (0, 1) and if assume its a and b then you got:
rand1 = min(a, b)
rand2 = abs(a - b)
rand3 = 1 - max(a, b)
When you want to randomly generate numbers that add to 1 (or some other value) then you should look at the Dirichlet Distribution.
There is an rdirichlet function in the gtools package and running RSiteSearch('Dirichlet') brings up quite a few hits that could easily lead you to tools for doing this (and it is not hard to code by hand either for simple Dirichlet distributions).
I guess it depends on what distribution you want on the numbers, but here is one way:
diff(c(0, sort(runif(2)), 1))
Use replicate to get as many sets as you want:
> x <- replicate(5, diff(c(0, sort(runif(2)), 1)))
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 0.66855903 0.01338052 0.3722026 0.4299087 0.67537181
[2,] 0.32130979 0.69666871 0.2670380 0.3359640 0.25860581
[3,] 0.01013117 0.28995078 0.3607594 0.2341273 0.06602238
> colSums(x)
[1] 1 1 1 1 1
I would simply randomly select 3 numbers from uniform distribution and then divide by their sum:
n <- 3
x <- runif(n, 0, 1)
y <- x / sum(x)
sum(y) == 1
n could be any number you like.
This problem and the different solutions proposed intrigued me. I did a little test of the three basic algorithms suggested and what average values they would yield for the numbers generated.
choose_one_and_divide_rest
means: [ 0.49999212 0.24982403 0.25018384]
standard deviations: [ 0.28849948 0.22032758 0.22049302]
time needed to fill array of size 1000000 was 26.874945879 seconds
choose_two_points_and_use_intervals
means: [ 0.33301421 0.33392816 0.33305763]
standard deviations: [ 0.23565652 0.23579615 0.23554689]
time needed to fill array of size 1000000 was 28.8600130081 seconds
choose_three_and_normalize
means: [ 0.33334531 0.33336692 0.33328777]
standard deviations: [ 0.17964206 0.17974085 0.17968462]
time needed to fill array of size 1000000 was 27.4301018715 seconds
The time measurements are to be taken with a grain of salt as they might be more influenced by the Python memory management than by the algorithm itself. I'm too lazy to do it properly with timeit. I did this on 1GHz Atom so that explains why it took so long.
Anyway, choose_one_and_divide_rest is the algorithm suggested by Andrie and the poster of the question him/herself (AND): you choose one value a in [0,1], then one in [a,1] and then you look what you have left. It adds up to one but that's about it, the first division is twice as large as the other two. One might have guessed as much ...
choose_two_points_and_use_intervals is the accepted answer by ddzialak (DDZ). It takes two points in the interval [0,1] and uses the size of the three sub-intervals created by these points as the three numbers. Works like a charm and the means are all 1/3.
choose_three_and_normalize is the solution by Anders Gustafsson and Josh O'Brien (JOB). It just generates three numbers in [0,1] and normalizes them back to a sum of 1. Works just as well and surprisingly a little bit faster in my Python implementation. The variance is a bit lower than for the second solution.
There you have it. No idea to what beta distribution these solutions correspond or which set of parameters in the corresponding paper I referred to in a comment but maybe someone else can figure that out.
The simplest solution is the Wakefield package probs() function
probs(3) will yield a vector of three values with a sum of 1
given that you can rep(probs(3),x) where x is "over and over"
no drama

Implementing additional constraints in R's nnls

I am using the R interface to the Lawson-Hanson NNLS implementation of an algorithm for non-negative linear least squares that solves ||A x - b||^2 with the constraint that all elements of vector x ≥ 0. This works fine but I would like to add further constrains. Of interest to me are:
Also minimize "energy" of x:
||A x - b||^2 + m*||x||^2
Minimize "energy in the x derivative"
||A x - b||^2 + m ||H x||^2, where H is the sum of identity and a matrix with -1 on the first off-diagonal
Most generally, minimize ||A x - b||^2 + m ||H x - f||^2.
Is there are a way to coax nnls to do this by some clever way of restating the problems 1.-3. Above? The reason I have hope for such a thing is that there is a little-throw away comment in a paper by Whitall et al (sorry for the paywall) that claims that "fortunately, NNLS can be adopted from the original form above to accommodate something in problem 3".
I take it m is a scalar, right? Consider the simple case m=1; you can generalize for other values of m by letting H* = sqrt(m) H and f* = sqrt(m) f and using the solution method given here.
So now you're trying to minimise ||A x - b||^2 + ||H x - f||^2.
Let A* = [A' | H']' and let b* = [b' | f']' (i.e. stack up A on top of H and b on top of f) and solve the original problem of
non-negative linear least squares on ||A* x - b*||^2 with the constraint that all elements of vector x ≥ 0 .

Resources