Is there any alternative for Excel solver in R?

Is there any alternative for Excel solver in R? - r

We have the below code for solving an optimization problem where we want to maximize sales by applying constraint on profit and no. of items.
We want to apply this profit threshold as a percentage of Revenue generated by 200 items only.
We have done it by applying a formula on profit using changing variable in Excel Solver using GRGE non-linear algorithm. We want a similar alternative for R.
Is there any way to assign changing variable in R?
Dataset
item sales profit
A 1200 120
B 5600 45
C 450 00
D 990 -90
E 1000 80
F 560 120
G 500 23
H 2000 350
Code
library(lpSolveAPI)
dataset<-read.csv("Dataset.csv",header=T,na.strings='NA',stringsAsFactors =F)
dataset$keep_flag <-1
**all the func in LPsolve API**
ls("package:lpSolveAPI")
summary(dataset)
**Passing the parameters**
ncol <- nrow(dataset)
**you have eight rows that can be picked or dropped from the solution set**
lp_rowpicker <- make.lp(ncol=ncol)
set.type(lp_rowpicker, columns=1:ncol, type = c("binary"))
**checking the model**
lp_rowpicker
**setting objective**
obj_vals <- dataset$Revenue_1hr.Projected
#obj_vals<- dataset[, 2]
obj_vals
set.objfn(lp_rowpicker, obj_vals)
lp.control(lp_rowpicker,sense='max')
**Adding contraints**
Profit constraint
xt<- (dataset$Profit_1hr.Projected)
add.constraint(lp_rowpicker, xt, ">=", 100)
xt
#No.of items to be kept
xt<- (dataset$keep_flag)
add.constraint(lp_rowpicker, xt, "=", 4)
xt
#model check
lp_rowpicker
#solving equation
solve(lp_rowpicker)
#Maximised revenue
get.objective(lp_rowpicker)
#The one with binary as 1 is our item
dataset$keep_flag<- get.variables(lp_rowpicker)
dataset$keep_flag <- as.data.frame(dataset$keep_flag)
sum(dataset$keep_flag)
final_set <- cbind(dataset,final_flag)
final_set <- final_set[which(final_set$final_flag==1),]
final_set$keep_flag <- NULL
final_set$final_flag<- NULL
This code snippet applies the profit threshold on total no. of items rather than applying it on selected items.
Edit
This is the model that got created when I ran #Karsten W. code:
C1 C2 C3 C4 C5 C6 C7 C8
Maximize 1200 5600 450 990 1000 560 500 2000
R1 120 45 0 -90 80 120 23 350 >= 100
R2 1 1 1 1 1 1 1 1 = 4
Kind Std Std Std Std Std Std Std Std
Type Int Int Int Int Int Int Int Int
Upper 1 1 1 1 1 1 1 1
Lower 0 0 0 0 0 0 0 0
And the output obtained is:
item sales profit
1 A 1200 120
1.1 A 1200 120
1.2 A 1200 120
1.3 A 1200 120
The same item is returned four times. I want 4 unique items. Plus I want to apply constraint of profit as a percentage of Sales generated by those 4 items.
By the way, we kept 'keep_flag' for the similar function to what your 'nitems' is doing. It is a changing variable that takes binary value.

Your code seems ok to me, except for that the variable names do not fit to the dataset you provided. In particular it is not clear to me what keep_flag stands for, is that some sort of preselection?
The profit constraint in your code is applied only the four from the solver selected variabes.
Here is your code, a bit cleaned up.
library(lpSolveAPI)
dataset <- data.frame(item=LETTERS[1:8], sales=c(1200, 5600, 450, 990, 1000, 560, 500, 2000), profit=c(120, 45, 0, -90, 80, 120, 23, 350))
nitems <- nrow(dataset)
# make lp
lprec <- make.lp(0, ncol=nitems)
set.type(lprec, columns=seq.int(nitems), type="binary")
# set objective
lp.control(lprec, sense="max", bb.rule="gap", timeout=30)
set.objfn(lprec, obj=dataset[, "sales"])
# constraints
min_rel_profit <- 0.10 # min. 10% profit
add.constraint(lprec, dataset[, "profit"]-min_rel_profit*dataset[,"sales"], ">=", 0) # required profit
add.constraint(lprec, rep(1, nitems), "=", 4) # four products
print(lprec)
solve(lprec)
dataset[get.variables(lprec)==1,]
The profit constraint is derived as follows (p is the vector of profits, s is the vector of sales, x is the decision variable 0/1, all of length nitems, minp is the minimum relative profit):
sum(profit) / sum(sales) >= minprofit translates to p'x/s'x >= minp
this is equivalent to (p - minp s)'x >= 0
Hence the minimum profit has to appear as part of the coefficients on the LHS.
If you are encountering long solving times, you can finetune the parameters. See ?lp.control.options for more details. Use timeout to set a time limit while testing. For this kind of problem (MIP) the bb.rule parameter is helpful. Given your example data, a solution for 9.5% was found in less than one second.

I would look at a few and choose the best
LPSolve https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf,
This is a simple linear solver. Its pretty much similar to LPSolve Api but I find it much more easier.
Minqa https://cran.r-project.org/web/packages/minqa/minqa.pdf
This is a quadriatic solver that works mostly for non linear problems
Gurobi http://www.gurobi.com/products/modeling-languages/r
This is an open source implementation of IBM's CPLEX solver. Very good and competent.

Related

R: too long computation of likelihood function for conditional logit model

I am trying to maximize loglikelihood function to get coefficients for conditional logit model. I have a big data frame with about 9M rows (300k choice sets) and about 40 parameters to be estimated. It looks like this:
ChoiceSet Choice SKU Price Caramel etc.
1 1 1234 1.0 1 ...
1 0 145 2.0 1 ...
1 0 5233 2.0 0 ...
2 0 1432 1.5 1 ...
2 0 5233 2.0 0 ...
2 1 8320 2.0 0 ...
3 0 1234 1.5 1 ...
3 1 145 1.0 1 ...
3 0 8320 1.0 0 ...
Where ChoiceSet is a set of products available in store in the moment of purchase and Choice=1 when the SKU is chosen.
Since ChoiceSets might vary I use loglikelihood function:
clogit.ll <- function(beta,X) { #### This is a function to be maximized
X <- as.data.table(X)
setkey(X,ChoiceSet,Choice)
sum((as.matrix(X[J(t(as.vector(unique(X[,1,with=F]))),1),3:ncol(X),with=F]))%*%beta)-
sum(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='c', .packages='data.table') %dopar% {
Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F])
Zb <- Z%*%beta
e <- exp(Zb)
log(sum(e))
})
}
Create new data frame without SKU (it's not needed) and zero vector:
X0 <- Data[,-3]
b0 <- rep(0,ncol(X0)-2)
I maximize this function with a help of maxLike package where I use gradient to make calculation faster:
grad.clogit.ll <- function(beta,X) { ###It is a gradient of likelihood function
X <- as.data.table(X)
setkey(X,ChoiceSet,Choice)
colSums(foreach(chset=unique(X[,list(ChoiceSet)])$ChoiceSet, .combine='rbind',.packages='data.table') %dopar% {
Z <- as.matrix(X[J(chset,0:1),3:ncol(X), with=F])
Zb <- Z%*%beta
e <- exp(Zb)
as.vector(X[J(chset,1),3:ncol(X),with=F]-t(as.vector(X[J(chset,0:1),3:ncol(X),with=F]))%*%(e/sum(e)))
})
}
Maximization problem is following:
fit <- maxLik(logLik = clogit.ll, grad = grad.clogit.ll, start=b0, X=X0, method="NR", tol=10^(-6), iterlim=100)
Generally, it works fine for small samples, but too long for big:
Number of Choice sets Duration of computation
300 4.5min
400 10.5min
1000 25min
But when I do it for 5000+ choice sets R terminate session.
So (if you are still reading it) how can I maximaze this function if I have 300,000+ choice sets and 1.5 weeks to finish my course work? Please help, I have no any idea.

Nonlinear discrete optimization in R

I have a simple (indeed standard in economics) nonlinear constrained discrete maximisation problem to solve in R and am having trouble. I found solutions for parts of the problem (nonlinear maximisation; discrete maximisation) but not for the union of all the problems.
Here is the problem. A consumer wants to buy three products (ananas, banana, cookie), knows the prices and has a budget of 20€. He likes variety (i.e., he wants to have all three products if possible) and his satisfaction is decreasing in the amount consumed (he likes his first cookie way more than his 100th).
The function he wishes to maximise is
and of course since each has a price, and he has a limited budget, he maximises this function under the constraint that
What I want to do is to find the optimal buying list (N ananas, M bananas, K cookies) that satisfies the constraint.
If the problem were linear, I would simply use linprog::solveLP(). But the objective function is nonlinear.
If the problem were of a continuous nature, ther would be a simple analytic solution to it.
The question being discrete and nonlinear, I do not know how to proceed.
Here is some toy data to play with.
df <- data.frame(rbind(c("ananas",2.17),c("banana",0.75),c("cookie",1.34)))
names(df) <- c("product","price")
I'd like to have an optimization routine that gives me an optimal buying list of (N,M,K).
Any hints?

1) no packages This can be done by brute force. Using df from the question as input ensure that price is numeric (it's a factor in the df of the question) and calculate the largest number mx for each variable. Then create grid g of variable counts and compute the total price of each and the associated objective giving gg. Now sort gg in descending order of objective and take those solutions satisfying the constraint. head will show the top few solutions.
price <- as.numeric(as.character(df$price))
mx <- ceiling(20/price)
g <- expand.grid(ana = 0:mx[1], ban = 0:mx[2], cook = 0:mx[3])
gg <- transform(g, total = as.matrix(g) %*% price, objective = sqrt(ana * ban * cook))
best <- subset(gg[order(-gg$objective), ], total <= 20)
giving:
> head(best) # 1st row is best soln, 2nd row is next best, etc.
ana ban cook total objective
1643 3 9 5 19.96 11.61895
1929 3 7 6 19.80 11.22497
1346 3 10 4 19.37 10.95445
1611 4 6 5 19.88 10.95445
1632 3 8 5 19.21 10.95445
1961 2 10 6 19.88 10.95445
2) dplyr This can also be nicely expressed using the dplyr package. Using g and price from above:
library(dplyr)
g %>%
mutate(total = c(as.matrix(g) %*% price), objective = sqrt(ana * ban * cook)) %>%
filter(total <= 20) %>%
arrange(desc(objective)) %>%
top_n(6)
giving:
Selecting by objective
ana ban cook total objective
1 3 9 5 19.96 11.61895
2 3 7 6 19.80 11.22497
3 3 10 4 19.37 10.95445
4 4 6 5 19.88 10.95445
5 3 8 5 19.21 10.95445
6 2 10 6 19.88 10.95445

If you do not mind using a "by hand" solution:
uf=function(x)prod(x)^.5
bf=function(x,pr){
if(!is.null(dim(x)))apply(x,1,bf,pr) else x%*%pr
}
budget=20
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
an=0:(budget/df$price[1]) #include 0 for all possibilities
bn=0:(budget/df$price[2])
co=0:(budget/df$price[3])
X=expand.grid(an,bn,co)
colnames(X)=df$product
EX=apply(X,1,bf,pr=df$price)
psX=X[which(EX<=budget),] #1st restrict
psX=psX[apply(psX,1,function(z)sum(z==0))==0,] #2nd restrict
Ux=apply(psX,1,uf)
cbind(psX,Ux)
(sol=psX[which.max(Ux),])
uf(sol) # utility
bf(sol,df$price) #budget
> (sol=psX[which.max(Ux),])
ananas banana cookie
1444 3 9 5
> uf(sol) # utility
[1] 11.61895
> bf(sol,df$price) #budget
1444
19.96

I think this problem is very similar in nature to this question (Solve indeterminate equation system in R). The answer by Richie Cotton was the basis to this possible solution:
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
FUN <- function(w, price=df$price){
total <- sum(price * w)
errs <- c((total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3])))
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.140093 9.085182 5.085095
sum(res$par*df$price) # 20.44192
Notice that the total cost (i.e. price) for the solution is $ 20.44. To solve this problem, we can weight the error terms to put more emphasis on the 1st term, which relates to the total cost:
### weighting of error terms
FUN2 <- function(w, price=df$price){
total <- sum(price * w)
errs <- c(100*(total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3]))) # 1st term weighted by 100
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN2, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.072868 8.890832 4.976212
sum(res$par*df$price) # 20.00437

As LyzandeR remarked there is no nonlinear integer programming solver available in R. Instead, you can use the R package rneos that sends data to one of the NEOS solvers and returns the results into your R process.
Select one of the solvers for "Mixed Integer Nonlinearly Constrained Optimization" on the NEOS Solvers page, e.g., Bonmin or Couenne. For your example above, send the following files in the AMPL modeling language to one of these solvers:
[Note that maximizing the product x1 * x2 * x3 is the same as maximising the product sqrt(x1) * sort(x2) * sqrt(x3).]
Model file:
param p{i in 1..3};
var x{i in 1..3} integer >= 1;
maximize profit: x[1] * x[2] * x[3];
subject to restr: sum{i in 1..3} p[i] * x[i] <= 20;
Data file:
param p:= 1 2.17 2 0.75 3 1.34 ;
Command file:
solve;
display x;
and you will receive the following solution:
x [*] :=
1 3
2 9
3 5
;
This approach will work for more extended examples were solutions "by hand" are not reasonable and rounded optim solutions are not correct.
To look at a more demanding example, let me propose the following problem:
Find an integer vector x = (x_i), i=1,...,10, that maximizes x1 * ... * x10, such that p1*x1 + ... + p10*x10 <= 10, where p = (p_i), i=1,...,10, is the following price vector
p <- c(0.85, 0.22, 0.65, 0.73, 0.91, 0.11, 0.31, 0.47, 0.93, 0.71)
Using constrOptim for this nonlinear optimization problem with a linear inequality constraint, I get solutions like 900 for different starting points, but never the optimal solutions that is 960 !

'At least one choice' Genetic Algorithm Constraint

I just started learning genetic algorithms in R using this example and I've run into an interesting problem trying to apply it. I have a dataset of shops, supply centers, and distance (miles) between shops and supply centers in the data frame dataset:
Shop Center Distance DistanceSave
A 1 700 300
A 2 200 800
A 3 300 700
B 1 400 600
B 2 100 900
B 3 150 850
C 1 600 400
C 2 500 500
C 3 200 800
I'm trying to minimize Distance (or maximize DistanceSave which is 1000 minus Distance) subject to the constraint that each shop must be tied to a Center and I'm having trouble coding that last part:
#Generate Evaluation Function
library(genalg)
evalFunc <- function(x) {
current_solution_savings <- x %*% dataset$DistanceSave
current_solution_shop <- length(unique(dataset$shop[x==1]))
#Set Conditions in Function
if (current_solution_shop != length(unique(dataset$Shop)))
return(0) else return(-current_solution_savings)
}
#Run GA Algorithm with 100 iterations
iter = 100
GAmodel <- rbga.bin(size = genes, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
cat(summary.rbga(GAmodel))
I thought the current_solution_shop != length(unique(dataset$Shop)) condition would be enough but unfortunately it's not, sometimes it still assigns the same restaurant twice.
EDIT: It looks like the Facility Location Problem is what I need to research, but can anyone recommend a multi-facility approach for R or Python?

If you're trying to assign each shop to exactly one center and aren't allowed to assign multiple shops to a particular center then this is called the assignment problem, and can be solved exactly in an efficient manner using linear programming.
Here's an approach using the lp.assign function from the lpSolve package:
# Cost matrix (shops as rows, centers as columns)
(dist.mat <- matrix(dat$Distance, nrow=3))
# [,1] [,2] [,3]
# [1,] 700 400 600
# [2,] 200 100 500
# [3,] 300 150 200
# Solve the assignment problem
library(lpSolve)
sol <- lp.assign(dist.mat, "min")
sol$solution
# [,1] [,2] [,3]
# [1,] 0 1 0
# [2,] 1 0 0
# [3,] 0 0 1
sol$objval
# [1] 800
The optimal solution assigns store A to center 2, store B to center 1, and store C to center 3, with cost 800.

lpsolveAPI in RStudio

I am using the lpsolveAPI in RStudio. When I type the name of a model with few decision variables, I can read a printout of the current constraints in the model. For example
> lprec
Model name:
COLONE COLTWO COLTHREE COLFOUR
Minimize 1 3 6.24 0.1
THISROW 0 78.26 0 2.9 >= 92.3
THATROW 0.24 0 11.31 0 <= 14.8
LASTROW 12.68 0 0.08 0.9 >= 4
Type Real Real Real Real
Upper Inf Inf Inf 48.98
Lower 28.6 0 0 18
But when I make a model that has more than 9 decision variables, it no longer gives the full summary and I instead see:
> lprec
Model name:
a linear program with 13 decision variables and 258 constraints
Does anyone know how I can see the same detailed summary of the model when there are large numbers of decision variables?
Bonus Question: Is RStudio the best console for working with R?
Here is an example:
>lprec <- make.lp(0,5)
This makes a new model called lprec, with 0 constraints and 5 variables. Even if you call the name now you get:
>lprec
Model name:
C1 C2 C3 C4 C5
Minimize 0 0 0 0 0
Kind Std Std Std Std Std
Type Real Real Real Real Real
Upper Inf Inf Inf Inf Inf
Lower 0 0 0 0 0
The C columns correspond to the 5 variables. Right now there are no constraints and the objective function is 0.
You can add a constraint with
>add.constraint(lprec, c(1,3,4,2,-8), "<=", 0)
This is the constraint C1 + 3*C2 + 4*C3 + 2*C4 - 8*C5 <= 0. Now the print out is:
Model name:
C1 C2 C3 C4 C5
Minimize 0 0 0 0 0
R1 1 3 4 2 -8 <= 0
Kind Std Std Std Std Std
Type Real Real Real Real Real
Upper Inf Inf Inf Inf Inf
Lower 0 0 0 0 0
Anyway the point is that no matter how many constraints, if there are more than 9 variables then I don't get the full print out.
>lprec <- make.lp(0,15)
>lprec
Model name:
a linear program with 15 decision variables and 0 constraints

Write it out to a file for examination
When I work with LPs using lpSolveAPI, I prefer to write them out to a file. The lp format works fine for my needs. I then examine the LP model using any text editor. If you click on the output file in the "Files" panel in RStudio, it will open it too, and you can inspect it.
write.lp(lprec, "lpfilename.lp", "lp") #write it to a file in LP format
You can also write it out as MPS format if you so choose.
Here's the help file on write.lp().
Hope that helps.

Since it is an S3 object of class lpExtPtr,
the function called to display it is print.lpExtPtr.
If you check its code, you will see that it displays the object
differently depending on its size --
details for very big objects would not be very useful.
Unfortunately, the threshold cannot be changed.
class(r)
# [1] "lpExtPtr"
print.lpExtPtr
# function (x, ...)
# {
# (...)
# if (n > 8) {
# cat(paste("Model name: ", name.lp(x), "\n", " a linear program with ",
# n, " decision variables and ", m, " constraints\n",
# sep = ""))
# return(invisible(x))
# }
# (...)
You can access the contents of the object with the various get.* functions,
as the print method does.
Alternatively, you can just change the print method.
# A function to modify functions
patch <- function( f, before, after ) {
f_text <- capture.output(dput(f))
g_text <- gsub( before, after, f_text )
g <- eval( parse( text = g_text ) )
environment(g) <- environment(f)
g
}
# Sample data
library(lpSolveAPI)
r <- make.lp(0,5)
r # Shows the details
r <- make.lp(0,20)
r # Does not show the details
# Set the threshold to 800 variables instead of 8
print.lpExtPtr <- patch( print.lpExtPtr, "8", "800" )
r # Shows the details

Probabilty heatmap in ggplot

I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?

I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6
require(plyr)
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
})
require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is there any alternative for Excel solver in R? - r

Related

R: too long computation of likelihood function for conditional logit model

Nonlinear discrete optimization in R

'At least one choice' Genetic Algorithm Constraint

lpsolveAPI in RStudio

Probabilty heatmap in ggplot

Categories

Resources