Algorithm to generate stars-and-bars alike solutions from a known one? - r

In my pursue of doing a Simulated Annealing heuristic to solve a problem, I am trying to find the best way to generate neighbours of my current proposed solution.
Solutions come in form of a vector of integer positions (p1,...,pN), which I understand as a binary chain
0 0 0 0 1 0 ... 0 0 1 0 ... 0 1 0 0
p1 pj pN
With some restrictions (pj - p(j-1) > D for all j, and p1 > D/2, length - pN > D/2).
Now, my idea is to use something similar to the Levenshtein distance to create new solutions, so if I have [0,1,0,0,1,0] (D=3) and I want a new state within a distance lesser or equal than 1, then I can get [1,0,0,0,1,0], for example, but not [1,0,0,1,0,0].
What I do (in R) is the following:
GenNewSeq <- function(seq, D, dist){
for(i in 1:dist){
Diffs <- c((ceiling(D/2)+seq[1]),diff(seq),(ceiling(D/2)+seq[length(seq)]))
position <- sample((2:length(seq))[Diffs > D], size=1)
move <- sample(c(-1*as.integer(Diffs[position-1]>D),0,1*as.integer(Diffs[position]>D)), size = 1)
seq[position-1] <- seq[position-1]+move
}
seq
}
Maybe it is a bit obscure, if you want I can explain better what it does. The thing is that this is 1) slow (I don't know how can I avoid the for), 2) weirdly not working as intended. It tends too much to move only the last positions and/or stabilizing moving forward and backward the same element all the time, so I get biased results on my Simulated Annealing.
I have thought of removing the restriction of distances and put it in the fitness function (something like exp(D-(pj-p(j-1)))), so I can simply move them with normals, or make them move altogether and then oscillate... and I am starting to think that it would be the easiest way. However, I would appreciate very much a reference to how can I do an efficient and reliable algorithm that does what I ask for, I don't mind if I have to do it in C. I have checked this but I wasn't able to solve my doubts.
Thank you very much for your help.

The bug in your program is this. When you select position at random, you are choosing a segment at random from the set of segments of length at least D. The element you are going to end up moving is the right-hand endpoint of this segment.
And, although it seems as though you are choosing the direction of the move at random, in fact the move is more likely to be in the downward direction than upward. This is because Diffs[position-1] is guaranteed to be greater than D (due to the way position was selected), but Diffs[position] is not. Which means that in some cases move is going to be chosen at random from c(-1,0,1) and in other cases it is going to be chosen at random from c(-1,0,0). So, over time, downwards moves will occur more than upwards moves.
Your algorithm can be fixed by selecting at random among all points for which either adjacent segment has length at least D, that where there won't be any bias in move direction:
GenNewSeq2 <- function(seq, D, dist){
for(i in 1:dist){
Diffs <- c((ceiling(D/2)+seq[1]),diff(seq))
bigGaps <- Diffs>D
moveable <- bigGaps[-1] | head(bigGaps,-1)
position <- sample(which(moveable),1)
move <- sample(c(-1*(Diffs[position]>D),1*(Diffs[position+1]>D)), size = 1)
seq[position] <- seq[position]+move
}
seq
}
It is also possible to generate a random new sequence without a for loop. Here is one possible implementation:
newseq<-function(seq,D,dist){
diffs <- c((ceiling(D/2)+seq[1]),diff(seq))
bigGaps<-diffs>D
selected<-sample(which(bigGaps),min(length(bigGaps),dist))
directions<-sample(c(-1,1),length(selected),T)
down<-directions<0
up<-directions>0
selected[up]<-selected[up]-1
move<-rep(0,length(seq))
move[selected[up]]<-1
move[selected[down]]<-move[selected[down]]-1
move[length(seq)]<-0 ## the last element of seq stays fixed always
seq+move
}
This implementation is more efficient, and it doesn't slow down nearly as much when dist grows.
> set.seed(123)
> seq<-sort(sample(1000,20))
> microbenchmark(newseq(seq,20,3),GenNewSeq2(seq,20,3))
Unit: microseconds
expr min lq median uq max neval
newseq(seq, 20, 3) 53.503 55.0965 56.026 56.761 68.804 100
GenNewSeq2(seq, 20, 3) 183.091 188.0490 189.492 191.249 367.094 100
> microbenchmark(newseq(seq,20,6),GenNewSeq2(seq,20,6))
Unit: microseconds
expr min lq median uq max neval
newseq(seq, 20, 6) 54.027 56.4960 57.3865 58.2955 70.258 100
GenNewSeq2(seq, 20, 6) 368.306 373.7745 377.5225 381.4565 559.037 100
>
We can also verify that GenNewSeq2 and newseq don't drift towards zero by running the following code for each of the three functions, and then plotting the mean of seq over time:
set.seed(12345)
seq<-sort(sample(1000,20))
x<-rep(0,20000)
for(i in 1:20000){
x[i]<-mean(seq)
seq<-GenNewSeq(seq,20,3)
}
plot(x,type='l')

Related

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

Fastest way to sample real values using a proportional probability

Given a numeric vector with N real numbers, what's the fastest way to sample k values, such that higher values have greater probability of being selected?
mathematically
prob(X) > prob(Y) when X > Y (Linearly)
This is easy with sample() when all entries are positive, just use the prob arg:
N = 1000
k = 600
x = runif(N, 0, 10)
results = sample(x, k, replace = TRUE, prob = x)
But it does'n work in my case, because some values might be negative. I cannot drop or ignore negative values, that's the problem.
So, what's the fastest (code speed) way of doing this? Obviously i know how to solve this, the issue is code speed - one method should be slower than other i guess:
1 - Normalize the x vector (a call to `range()` would be necessary + division)
2 - Sum max(x) to x (a call to `max()` then sum)
Thanks.
A few comments. First, it's still not exactly clear what you want. Obviously, you want larger numbers to be chosen with higher probability, but there are a lot of ways of doing this. For example, either rank(x) or x-min(x) will produce a vector of non-negative weights which are monotonic in x.
Another point, you don't need to normalize the weights, because sample will do that for you, provided that the weights are non-negative:
> set.seed(1)
> sample(1:10,prob=1:10)
[1] 9 8 6 2 10 3 1 5 7 4
> set.seed(1)
> sample(1:10,prob=(1:10)/sum(1:10))
[1] 9 8 6 2 10 3 1 5 7 4
On edit: The OP is now asking for a weighting function which is "linear" in the input vector. Technically this is impossible, because linear functions are of the form f(X)=cX, so if a vector x contains both positive and negative values, then any linear function of x will also contain both positive and negative values, unless c=0, in which case it still does not give a valid vector of probability weights.
I think what you mean by "linear" is simply x-min(x). This is not a linear function, but an affine function. Moreover, even if you had specified that you wanted P(X) to vary as an affine function of X, that still would not have uniquely determined the probability weights, because there are an infinite number of possible affine functions that would yield valid weights (e.g. x-min(x)+1, etc.)
In any case, assuming x-min(x) is what you want, the question now becomes, what is the fastest way to compute x-min(x) in R. And I'm pretty sure that the answer is just x-min(x).
Finally, for constants anywhere near what you have in your example, there is not much point in trying to optimize the calculation of weights, because the random sampling is going to take much longer anyway. For example:
> x<-rnorm(1000)
> k<-600
> p<-x-min(x)
> microbenchmark(x-min(x),sample(x,k,T,p))
Unit: microseconds
expr min lq median uq max neval
x - min(x) 6.56 6.9105 7.0895 7.2515 13.629 100
sample(x, k, T, p) 50.30 51.4360 51.7695 52.1970 66.196 100

Plot a table of binomial distributions in R

For a game design issue, I need to better inspect binomial distributions. Using R, I need to build a two dimensional table that - given a fixed parameters 'pool' (the number of dice rolled), 'sides' (the number of sides of the die) has:
In rows --> minimum for a success (ranging from 0 to sides, it's a discrete distribution)
In columns --> number of successes (ranging from 0 to pool)
I know how to calculate it as a single task, but I'm not sure on how to iterate to fill the entire table
EDIT: I forgot to say that I want to calculate the probability p of gaining at least the number of successes.
Ok, i think this could be a simple solution. It has ratio of successes on rows and success thresholds on dice roll (p) on columns.
poolDistribution <- function(n, sides=10, digits=2, roll.Under=FALSE){
m <- 1:sides
names(m) <- paste(m,ifelse(roll.Under,"-", "+"),sep="")
s <- 1:n
names(s) <- paste(s,n,sep="/")
sapply(m, function(m.value) round((if(roll.Under) (1 - pbinom(s - 1, n, (m.value)/sides))*100 else (1 - pbinom(s - 1, n, (sides - m.value + 1)/sides))*100), digits=digits))
This gets you half of the way.
If you are new to R, you might miss out on the fact that a very powerful feature is that you can use a vector of values as an index to another vector. This makes part of the problem trivially easy:
pool <- 3
sides <- 20 # <cough>D&D<cough>
# you need to strore the values somewhere, use a vector
NumberOfRollsPerSide <- rep(0, sides)
names(NumberOfRollsPerSide) <- 1:sides # this will be useful in table
## Repeast so long as there are still zeros
## ie, so long as there is a side that has not come up yet
while (any(NumberOfRollsPerSide == 0)) {
# roll once
oneRoll <- sample(1:sides, pool, TRUE)
# add (+1) to each sides' total rolls
# note that you can use the roll outcome to index the vector. R is great.
NumberOfRollsPerSide[oneRoll] <- NumberOfRollsPerSide[oneRoll] + 1
}
# These are your results:
NumberOfRollsPerSide
All you have left to do now is count, for each side, in which roll number it first came up.

Generate 3 random number that sum to 1 in R

I am hoping to create 3 (non-negative) quasi-random numbers that sum to one, and repeat over and over.
Basically I am trying to partition something into three random parts over many trials.
While I am aware of
a = runif(3,0,1)
I was thinking that I could use 1-a as the max in the next runif, but it seems messy.
But these of course don't sum to one. Any thoughts, oh wise stackoverflow-ers?
This question involves subtler issues than might be at first apparent. After looking at the following, you may want to think carefully about the process that you are using these numbers to represent:
## My initial idea (and commenter Anders Gustafsson's):
## Sample 3 random numbers from [0,1], sum them, and normalize
jobFun <- function(n) {
m <- matrix(runif(3*n,0,1), ncol=3)
m<- sweep(m, 1, rowSums(m), FUN="/")
m
}
## Andrie's solution. Sample 1 number from [0,1], then break upper
## interval in two. (aka "Broken stick" distribution).
andFun <- function(n){
x1 <- runif(n)
x2 <- runif(n)*(1-x1)
matrix(c(x1, x2, 1-(x1+x2)), ncol=3)
}
## ddzialak's solution (vectorized by me)
ddzFun <- function(n) {
a <- runif(n, 0, 1)
b <- runif(n, 0, 1)
rand1 = pmin(a, b)
rand2 = abs(a - b)
rand3 = 1 - pmax(a, b)
cbind(rand1, rand2, rand3)
}
## Simulate 10k triplets using each of the functions above
JOB <- jobFun(10000)
AND <- andFun(10000)
DDZ <- ddzFun(10000)
## Plot the distributions of values
par(mfcol=c(2,2))
hist(JOB, main="JOB")
hist(AND, main="AND")
hist(DDZ, main="DDZ")
just random 2 digits from (0, 1) and if assume its a and b then you got:
rand1 = min(a, b)
rand2 = abs(a - b)
rand3 = 1 - max(a, b)
When you want to randomly generate numbers that add to 1 (or some other value) then you should look at the Dirichlet Distribution.
There is an rdirichlet function in the gtools package and running RSiteSearch('Dirichlet') brings up quite a few hits that could easily lead you to tools for doing this (and it is not hard to code by hand either for simple Dirichlet distributions).
I guess it depends on what distribution you want on the numbers, but here is one way:
diff(c(0, sort(runif(2)), 1))
Use replicate to get as many sets as you want:
> x <- replicate(5, diff(c(0, sort(runif(2)), 1)))
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 0.66855903 0.01338052 0.3722026 0.4299087 0.67537181
[2,] 0.32130979 0.69666871 0.2670380 0.3359640 0.25860581
[3,] 0.01013117 0.28995078 0.3607594 0.2341273 0.06602238
> colSums(x)
[1] 1 1 1 1 1
I would simply randomly select 3 numbers from uniform distribution and then divide by their sum:
n <- 3
x <- runif(n, 0, 1)
y <- x / sum(x)
sum(y) == 1
n could be any number you like.
This problem and the different solutions proposed intrigued me. I did a little test of the three basic algorithms suggested and what average values they would yield for the numbers generated.
choose_one_and_divide_rest
means: [ 0.49999212 0.24982403 0.25018384]
standard deviations: [ 0.28849948 0.22032758 0.22049302]
time needed to fill array of size 1000000 was 26.874945879 seconds
choose_two_points_and_use_intervals
means: [ 0.33301421 0.33392816 0.33305763]
standard deviations: [ 0.23565652 0.23579615 0.23554689]
time needed to fill array of size 1000000 was 28.8600130081 seconds
choose_three_and_normalize
means: [ 0.33334531 0.33336692 0.33328777]
standard deviations: [ 0.17964206 0.17974085 0.17968462]
time needed to fill array of size 1000000 was 27.4301018715 seconds
The time measurements are to be taken with a grain of salt as they might be more influenced by the Python memory management than by the algorithm itself. I'm too lazy to do it properly with timeit. I did this on 1GHz Atom so that explains why it took so long.
Anyway, choose_one_and_divide_rest is the algorithm suggested by Andrie and the poster of the question him/herself (AND): you choose one value a in [0,1], then one in [a,1] and then you look what you have left. It adds up to one but that's about it, the first division is twice as large as the other two. One might have guessed as much ...
choose_two_points_and_use_intervals is the accepted answer by ddzialak (DDZ). It takes two points in the interval [0,1] and uses the size of the three sub-intervals created by these points as the three numbers. Works like a charm and the means are all 1/3.
choose_three_and_normalize is the solution by Anders Gustafsson and Josh O'Brien (JOB). It just generates three numbers in [0,1] and normalizes them back to a sum of 1. Works just as well and surprisingly a little bit faster in my Python implementation. The variance is a bit lower than for the second solution.
There you have it. No idea to what beta distribution these solutions correspond or which set of parameters in the corresponding paper I referred to in a comment but maybe someone else can figure that out.
The simplest solution is the Wakefield package probs() function
probs(3) will yield a vector of three values with a sum of 1
given that you can rep(probs(3),x) where x is "over and over"
no drama

Political Science Programming Question

I am sorry I am being very descriptive here but I hope you could help me with the following problem (I try to program this in R):
Let's say we have array where rows are parties and columns are parties' issue positions (measured as distance from the median issue position across all parties). I want to model parties announcing an issue platform. This goes like this: start with the issue on which the distance from the median issue position is smallest and announce that platform with probability (1 minus issue distance from median....parties announce that issue that issue as their platform with probability = 1 if they are the median party on that issue). If rbinom(1,1, prob) ==1 they will announce that issue (i.e, the column indicator) as their platform. If rbinom(1,1, prob) == 0, they will move on to the issue on which the distance from the median issue position is second-to-smallest (and draw from a binomial distribution) And so forth until a platform is announced. All parties go through the same steps to find a issue platform for that run of the model, but parties differ on the issues on which they are closest to the median party.
Would you have advice on how to program such a set-up?
I built a toy model that can compute what you want. Assuming there are n parties and k issues, below I provide a code for computing the party choice for one party. It should be fairly straight to generalize the code for all parties. You just need to add a Loop. I left this as an exercise to you =):
n = 4 # example with n=4
k = 3 # k = 3 issues
party_position = matrix(runif(k*n),nrow=n, ncol=k) # matrix with party positions on each issue
med = apply(party_position,2 , median) # compute median of each column
gen.pos = function(party_position, median=med, k) {
if ( k ==1 ) { # case base, i.e., when all previous decisions were rbinom == 0, there is only one platafform left. Pick that one.
issue.announcing = which.max(abs(party_position[1,]-med))
return(issue.announcing)
}
else {
dif=abs(party_position[1,]-med) # difference between party position and median
value=min(dif) # value gets minor difference
pos=which.min(dif) # position in party_position matrix of value
decision = rbinom(1, 1, 1- value) # decision with probability 1 - difference of minimum value and median
if (decision < 1) { # if rbinom < 1, i.e, equals zero
k=k-1 # set new k, so recursive function can work
party_position[1,-pos] # it'll drop of matrix minimum value found before, so we can pick new minimum value
return (gen.pos(party.position, median=med, k)) } # call the function with new matrix
else { #i.e. if decision was equal 1, just pick pos as issue plataform
issue.announcing = pos
return (issue.announcing)
}
}
}
The functon "gen.pos" will find the party plataform for party one (row one). I guess you just need to apply a "for" to generate positions for all parties. Note that the function is recursive, which is, btw, the reason why I spent my time into this: I really like to write recursive functions!
ps.: Check my function. It seemed to work here and I think it's correct, but as some people say, 'trust, but check'.
ps.2: the function returns the position (i.e., the column) for party one. If you need the number, not the position, juts use,
position.final = gen.pos(Party_position, med, k)
plataform = party_position[1,position.final]

Resources