Not sure what is wrong here.
Building the model along with the example
LBsPD <- c()
for (i in 1:5000) {
FishCaught <- sample(x=c(7,4,2), size=1, prob = c(.1,.6,.3),replace = TRUE)
YellowPercent <- sample(x=c(0,.25,.35), size=1, prob = c(.25,.5,.25),replace = TRUE)
BluePercent <- 1-YellowPercent
BlueLBs <- rnorm(n=365, mean=35, sd=18)
YellowLBs <- rnorm(n=365, mean=30, sd=18)
LBsPerDay <- FishCaught * ((BluePercent * BlueLBs[BlueLBs > 20]) + (YellowPercent * YellowLBs[YellowLBs > 20]))
LBsPD[i] <- LBsPerDay
}
Keep getting the 50+ errors "Number of items to replace is not a multiple of replacement length" But in the example it is the same.
Here's a drawn-out explanation that #AndrewChisholm started.
I'll start by (1) setting the random seed, so you can repeat this in your console, and (2) stepping through the for loop's first iteration.
set.seed(42) # since this is a random process
i <- 1 # first pass in the loop
FishCaught <- sample(x=c(7,4,2), size=1, prob = c(.1,.6,.3),replace = TRUE)
YellowPercent <- sample(x=c(0,.25,.35), size=1, prob = c(.25,.5,.25),replace = TRUE)
BluePercent <- 1-YellowPercent
BlueLBs <- rnorm(n=365, mean=35, sd=18)
Now, let's look at the components of your next expression:
# FishCaught * ((BluePercent * BlueLBs[BlueLBs > 20]) +
# (YellowPercent * YellowLBs[YellowLBs > 20]))
str(BlueLBs[BlueLBs > 20])
# num [1:291] 24.8 41.5 46.4 42.3 33.1 ...
str(YellowLBs[YellowLBs > 20])
# num [1:255] 64.1 22.5 36.3 59.3 31.6 ...
It doesn't matter now that BluePercent is 1 and YellowPercent is 0, since 0*somevec is still the length of the vector, so you are effectively trying to add vectors of different sizes. What does this mean to you?
c(1,3,5,7,9) + c(1,1000,1)
# Warning in c(1, 3, 5, 7, 9) + c(1, 1000, 1) :
# longer object length is not a multiple of shorter object length
# [1] 2 1003 6 8 1009
The bigger problem here is that R does not consider this a problem: it warns you that it is suspect, but it happily "recycles" the values for you. So this is not what is causing your error.
LBsPerDay <- FishCaught * ((BluePercent * BlueLBs[BlueLBs > 20]) + (YellowPercent * YellowLBs[YellowLBs > 20]))
# Warning in (BluePercent * BlueLBs[BlueLBs > 20]) + (YellowPercent * YellowLBs[YellowLBs > :
# longer object length is not a multiple of shorter object length
str(LBsPerDay)
# num [1:291] 174 291 325 296 232 ...
This is not a syntax error, but you should treat it as a "data is now corrupt" error, because you really don't know for certain what numbers were added/multiplied to other numbers (see my previous example of with c(1,1000,1) to see that if we think things should be aligned, then those results are going to result in some incorrect logical conclusions from this process).
Here's the real problem:
LBsPD[i] <- LBsPerDay
# Warning in LBsPD[i] <- LBsPerDay :
# number of items to replace is not a multiple of replacement length
First, some clarification:
This is a Warning, not an Error. The only way I get an error with that is if I had previous set options(warn=2) (which, btw, is not a bad idea here). Warnings can often be ignored if you expect them, but in this case you should really pay attention to it and treat it as an error.
LBsPerDay is length 291, but you try trying to cram 291 numbers into one position in the vector LBsPD[i]. That is, the length of the LHS using [i] is always going to be the length of i, which is 1; whereas the length of the RHS is (in this case) 291.
Options:
I'm inferring that your BlueLBs[BlueLBs > 20] might be a filter so that fish caught below 20 (pounds? kilos? grams?) will not be "caught". In that case, let's just replace those under 20 with 0 ... but please check me on this logic ... a blue/yellow that is below 20 will be changed to 0, effectively "not caught":
LBsPerDay <- FishCaught * ((BluePercent * replace(BlueLBs, BlueLBs <= 20, 0)) + (YellowPercent * replace(YellowLBs, YellowLBs <= 20, 0)))
str(LBsPerDay)
# num [1:365] 174 291 325 296 232 ...
(No warning, no error.)
If you intend LBsPD to contain all of the weights for each iteration in your simulation, then start with LBsPD <- list(), in which case you'll eventually use
LBsPD <- list()
for (i in 1:5000) {
# ...
LBsPD[[i]] <- LBsPerDay
}
where after 3 (of 5000) iterations, your LBsPD looks like:
str(LBsPD)
# List of 3
# $ : num [1:365] 174 291 325 296 232 ...
# $ : num [1:365] 160.7 97 161.5 145 99.6 ...
# $ : num [1:365] 30.3 121.4 111.7 210.8 139.7 ...
BTW, you might notice that both BlueLBs and YellowLBs have negatives ... not sure if that's a problem, but negative pounds seems suspect. (Because "normal distributions" are by definition unbounded, many things labeled as "normally distributed" are typically not asymptotically compliant. For hasty simulations like this, I often revert to a triangular distribution, which may be normal-enough for some applications, and certainly never gives negative or extremely-positive weights.)
Related
I have a dataframe that contains 7 columns.
str(df)
'data.frame': 8760 obs. of 7 variables:
$ G1_d20_2014.SE1_ : num 25.1 25.1 25 25 25.1 ...
$ G1_d20_2014.SE4_ : num 42.4 42.3 42.3 42.3 42.3 ...
$ G1_d20_2014.SE7_ : num 34.4 34.4 34.4 34.4 34.4 ...
$ G1_d20_2014.SE22_: num 42.5 42.4 42.3 42.4 42.3 ...
$ G1_d20_2014.SE14_: num 52.5 52.5 52.5 52.5 52.4 ...
$ G1_d20_2014.SE26 : num 40.8 40.8 40.8 40.8 40.8 ...
Each column represents a unique sensor and the columns contain measurement data from sensors. Some of the columns contain missing values. I want to fill the data gaps in each column by linear regression. I already did this manually but there is one condition that is very important and I'm looking for a function that does this on its own, as it'd take too much time to do this for all the columns. Here's the condition:
Lets say G1_d20_2014_SE1 contains missing data. Then I want to fill the data gaps from that sensor with a complete dataset from another sensor where the correlation coefficient is highest.
Here is how I did that manually:
I created a function that creates an indicator variable. Indicator variable turns to 1 if value is not NA and to 0 if it is NA. Then I added this variable as a column to the dataset:
Indvar <- function(t) {
x <- dim(length(t))
x[which(!is.na(t))] = 1
x[which(is.na(t))] = 0
return(x)
}
df$I <- Indvar(df$G1_d20_2014.SE1_)
Next I looked between which sensor and sensor 1 the correlation coefficient is highest (in that case correlation coefficient highest between SE1 and SE14). Then I computed the linear regression, took the equation from it and put it into a for loop that fills up the NA values according to the equation whenever the indicator variable is 0:
lm(df$G1_d20_2014.SE1_ ~ df$G1_d20_2014.SE14_, data = df)
for (i in 1:nrow(df)) {
if (df$I[i] == 0)
{
df$G1_d20_2014.SE1_[i] = 8.037 + 0.315*df$G1_d20_2014.SE14_[i]
}
}
This works perfectly fine but it takes too much time doing this because I have a lot of dataframes that looks like the one up in the post.
I already tried using impute_lm from the simputation package but unfortunately it does not seem to care about where the correlation is highest before filling the data gaps. Here is what I wrote:
impute_fun <- impute_lm(df,
formula = SE1_ + SE4_ ~ SE14_ + SE26)
As I wrote SE14_ + SE26_ I checked if he uses the values from SE14 for imputing the values in SE1 but he doesn't, as the result is different from my manual result.
Is there any function that does what I want? I'm really frustrated because I've been looking for this for over 2 weeks now. I'd really really appreciate some help!
EDIT/Answer to #jay.sf
So I tried to make a function (s. below) out of it but there's something I struggle with:
I don't know how to specify in the function that I want to do this for for every column and that it removes the name of that sensor that I want to fill from the sapply(c("SE1_", "SE2_", ...) Because obviously, if I do this for SE1_ and SE1_ is still in the code the correlation will be 1 and nothing happens. Now as you can see this is also problematic for the rest of the code, e.g. in the line cor(df$SE1_, df[, x], use = "complete.obs")) as it says df$SE1_ here. Same for the df$SE1_imp <- ... line.
Of course I could just delete the sensor from the sapply(...) code so the first problem does not occur. I'm just wondering if there's a nicer way to do this. Same for the df$SE1_ parts, if I wanna impute the values for SE2_ then I'd have to change df$SE1_ to df$SE2_ and so on.
I tried to run the code like this (but without the SE1_ in the sapply(...) of course) and I got the error: Error in df[, x] : incorrect number of dimensions.
Any ideas how to solve these issues?
impFUN <- function(df) {
corr <- sapply(c("SE1_", "SE2_", "SE4_", "SE5_","SE6_",
"SE7_", "SE12_", "SE13_","SE14_", "SE15_",
"SE16_", "SE22_","SE23", "SE24", "SE25",
"SE26", "SE33", "SE34", "SE35", "SE36",
"SE37", "SE46", "SE51", "SE52", "SE53",
"SE54", "SE59", "SE60", "SE61", "SE62",
"SE68", "SE69", "SE70", "SE71", "SE72",
"SE73","SE74", "SE82", "SE83", "SE84",
"SE85", "SE86", "SE87", "SE99","SE100",
"SE101", "SE102", "SE103","SE104",
"SE106", "SE107","SE121"), function(x)
cor(df$SE1_, df[, x], use = "complete.obs"))
imp.use <- names(which.max(corr))
regr.model <- lm(reformulate(imp.use, "SE1_"))
df$SE1_imp <-
ifelse(is.na(df$SE1_), lm.cf[1] + df[[imp.use]]*lm.cf[2], df$SE1_)
}
What about this? First check which sensor correlates most with sensor 1.
corr <- sapply(c("sensor.2", "sensor.3", "sensor.4"), function(x)
cor(dat$sensor.1, dat[,x], use="complete.obs"))
# sensor.2 sensor.3 sensor.4
# 0.04397132 0.26880412 -0.06487781
imp.use <- names(which.max(corr))
# [1] "sensor.3"
Calculate the regression model,
lm.cf <- lm(reformulate(imp.use, "sensor.1"), dat)$coef
and to impute sensor 1 use the coefficients in an ifelse like this:
dat$sensor.1.imp <-
ifelse(is.na(dat$sensor.1), lm.cf[1] + dat[[imp.use]]*lm.cf[2], dat$sensor.1)
Result
head(dat)
# sensor.1 sensor.2 sensor.3 sensor.4 sensor.1.imp
# 1 2.0348728 -0.6374294 2.0005714 0.03403394 2.0348728
# 2 -0.8830567 -0.8779942 0.7914632 -0.66143678 -0.8830567
# 3 NA 1.2481243 -0.9897785 -0.36361831 -0.1943438
# 4 NA -0.1162450 0.6672969 -2.84821295 0.2312968
# 5 1.0407590 0.1906306 0.3327787 1.16064011 1.0407590
# 6 0.5817020 -0.6133034 0.5689318 0.71543751 0.5817020
Toy data:
library('MASS')
set.seed(42)
M <- mvrnorm(n=1e2, mu=c(0, 0, 0, 0),
Sigma=matrix(c(1, .2, .3, .1,
.2, 1, 0, 0,
.3, 0, 1, 0,
.1, 0, 0, 1), nrow=4),
empirical=TRUE)
dat <- as.data.frame(`colnames<-`(M, paste0("sensor.", 1:4)))
dat[sample(1:nrow(dat), 30), "sensor.1"] <- NA ## generate 30% missings
Scope:
[Code UPDATE.] To calculate confusionMatrix() sensitivity, specificity, accuracy from a loop, or sequence, of values ranging from seq(0.1,0.9, by=0.1).
Problem on confusionMatrix()
[Code UPDATE.] Using the caret::confusionMatrix function, I have variables to build the confusionMatrix inside the function (compute_seq_accuracy.func). The Try/Catch does not show any errors; BUT, this function does NOT create the confusionMatrix when the caret::confusionMatrix(csa.func.p, csa.func.confusion_table).
This is not the major problem to solve in this code, the other errors have been fixed by me when I change dataset.
Goal:
To iterate values: 0.1 to 0.9, by 0.1, calculating sensitivity, specificity, accuracy from custom coded confusionMatrix function that handles level errors when the caret::confusionMatrix error'd when levels were different.
Null records have been removed.
R Code WIP Solution
This is the R code work in process, function compute_seq_accuracy.func() executes without error, however, now the contingency table that is created inside function compute_confusion_matrix.func(), does not get create and returned from the function return. the following data trace is from an internal print statement which shows the contingency table created for each threshold evaluation:
## function
compute_seq_accuracy.func <- function(value) {
tryCatch({
csa.func.p <- factor(ifelse(loans_predict < value, 0, 1))
csa.func.confusion_table <- compute_confusion_matrix.func(loans_train_data$statusRank, csa.func.p)
tryCatch({
csa.cmt <- compute_matrix.func(csa.func.p, csa.func.confusion_table)
},
error = function(e) return(e)
)
return(csa.cmt$overall['Accuracy'])
},
error = function(e) return(e)
)
}
compute_matrix.func <- function(p, t) {
tryCatch({
cm.func.confusion_matrix <- caret::confusionMatrix(p, t)
return(cm.func.confusion_matrix) ### $overall['Accuracy'])
},
error = function(e) return(e)
)
}
## function
compute_confusion_matrix.func <- function(y, p) {
ccm.func.confusion_table <- table(y, p)
if(nrow(ccm.func.confusion_table)!=ncol(ccm.func.confusion_table)){
missings <- setdiff(colnames(ccm.func.confusion_table),rownames(ccm.func.confusion_table))
missing_mat <- mat.or.vec(nr = length(missings), nc = ncol(ccm.func.confusion_table))
ccm.func.confusion_table <- as.table(rbind(as.matrix(ccm.func.confusion_table), missing_mat))
rownames(ccm.func.confusion_table) <- colnames(ccm.func.confusion_table)
}
return(ccm.func.confusion_table)
}
## process run
compute_for_values = seq(0.1,0.9, by=0.1)
csa_computed_accuracies <- sapply(compute_for_values, compute_seq_accuracy.func, simplify = FALSE)
function returned variable: csa_computed_accuracies, matrix created, error message inside matrix, reads the follow:
> csa_computed_accuracies
[[1]]
<simpleError in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent>
Data Trace
Try...Catch is set, no warning messages. However, when line csa.func.confusion_matrix <- caret::confusionMatrix() is invoked, no confusionMatrix object is created. And no Try...Catch error or warning is issued:
> csa_computed_accuracies <- sapply(compute_for_values, compute_seq_accuracy.func, simplify = FALSE)
p
y 0 1
Bad 4 6009
Good 0 21411
p
y 0 1
Bad 38 5975
Good 15 21396
p
y 0 1
Bad 225 5788
Good 133 21278
p
y 0 1
Bad 702 5311
Good 533 20878
p
y 0 1
Bad 1575 4438
Good 1614 19797
p
y 0 1
Bad 2836 3177
Good 4002 17409
p
y 0 1
Bad 4382 1631
Good 8646 12765
p
y 0 1
Bad 5627 386
Good 15856 5555
>
> csa_computed_accuracies
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
[[6]]
NULL
[[7]]
NULL
[[8]]
NULL
[[9]]
NULL
>
Partial Correction
Corrected Data set:
head(loans_predict,50)
11413 2561 25337 1643 14264 24191 33989 28193 21129 7895 29007 26622 3065
0.8375821 0.7516343 0.8375704 0.7671279 0.7201578 0.7917037 0.8980501 0.8259884 0.8604232 0.8664207 0.7609676 0.7753622 0.9321958
11423 3953 5789 30150 6070 1486 13195 30344 26721 716 24609 22196 10770
0.8325967 0.9459098 0.5903160 0.5997290 0.9045176 0.6782181 0.7546154 0.8381577 0.7943421 0.7198638 0.4522069 0.7129170 0.8632025
18042 3710 21750 23492 10680 5088 10434 3228 8696 29688 33847 2997 24772
0.8941667 0.6445716 0.7659989 0.2616490 0.7402274 0.7115220 0.8985310 0.7300686 0.8737217 0.6712457 0.7037675 0.6868837 0.7534947
28396 6825 27619 26433 25542 33853 32926 33585 20362 6895 20634
0.7516796 0.7261610 0.8437550 0.8662871 0.8620579 0.9355447 0.6786310 0.6017286 0.9340776 0.9022817 0.7832571
>
> compute_for_values
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Consider wrapping your method in tryCatch to catch exceptions and return NULL on error which you can investigate further which 0.1 causes the error and such NULL elements can be removed with Filter at the end. Below also uses sapply (wrapper to lapply) which returns a named list if character vector is used as input.
compute_seq_accuracy.func <- function(value) {
tryCatch({
p <- factor(ifelse(loans_predict_fcm < as.numeric(value), 'Bad', 'Good'))
confusion_table <- compute_confusion_matrix(loans_train_data$statusRank, p)
c_matrix <- confusionMatrix(confusion_table)
return(c_matrix$overall['Accuracy'])
},
# RETURN ERROR MESSAGE
error = function(e) return(e)
)
}
compute_for_values <- as.character(seq(0.1, 0.9, by=0.1))
## WIP error in !all.equal(nrow(data, ncol(data)))
computed_accuracies <- sapply(compute_for_values, compute_seq_accuracy.func, simplify = FALSE)
# REMOVE NULLs FROM LIST
computed_accuracies <- Filter(LENGTH, computed_accuracies)
I'm a beginner in R programming. I was doing an exercise for an online course and I came up with a question (which does not concern the exercise itself). I have the following DF:
> str(DF)
'data.frame': 915 obs. of 3 variables:
$ sal: int 22000000 15714286 13650000 13571429 13350000 13050000 13000000 12600000 12500000 12500000 ...
$ AB : int 632 36 574 503 80 529 NA 614 577 364 ...
$ OBP: num 0.399 0.154 0.408 0.384 0.143 ...
I had to make a function which returned me the first three numbers of the column $sal for which the sum was 15 million. Here's what I did:
> for(i in DF){
+ x <- 1
+ y <- 3
+ while(sum(i[x:y]) > 15*10^6){
+ x <- x + 1
+ y <- y + 1
+ if(sum(i[x:y]) <= 15*10^6){
+ print(c(x:y))
+ }
+ }
+ }
[1] 138 139 140
This works, but of course, the for will run through the entire DF. If I wanted it to run only through a specific column, I'd set for(i in DF$column). However, when I do this, I get the following error:
Error in while (sum(i[x:y]) > 15 * 10^6) { :
missing value where TRUE/FALSE needed
This is my question: why is this happening? Hope my question was clear.
The problem is that you have missing values in you data (see your column AB).
Try replacing sum(...) by sum(..., na.rm = TRUE).
By the way, what you could do instead of looping is something like:
test <- round(runif(100) * 1e7)
test[1] <- NA
test2 <- RcppRoll::roll_sum(test, n = 3, na.rm = TRUE)
sapply(which(test2 > 15e6), function(x) x + 0:2)
# outer(0:2, which(test2 > 15e6), '+') may be faster
After creating my CART with rpart I proceed to convert it to a party object with the as.party function from the partykit package. The subsecuent error appears:
as.party(tree.hunterpb1)
Error in partysplit(varid = which(rownames(obj$split)[j] == names(mf)), :
‘index’ has less than two elements
I can only assume thet it's refering to the partitioning made by factor variables as I´ve understood from the literature, since the index applies to factors. My tree looks like this:
tree.hunterpb1
n= 354
node), split, n, deviance, yval
* denotes terminal node
1) root 354 244402.100 75.45134
2) hr.11a14>=49.2125 19 3378.322 33.44274 *
3) hr.11a14< 49.2125 335 205592.400 77.83391
6) month=April,February,June,March,May 141 58656.390 68.57493 *
7) month=August,December,January,July,November,October,September 194 126062.800 84.56338
14) presion.11a14>=800.925 91 74199.080 81.32755
28) month=January,November,October 16 9747.934 63.13394 *
29) month=August,December,July,September 75 58025.190 85.20885 *
15) presion.11a14< 800.925 103 50069.100 87.42223 *
The traceback shows that the first partition´s conversion to party class is done correctly but the second one based on the factor variables fails and produced said error.
Previously when working on similar data this error has not appeared. I can only assume that the as.party function isn't finding the indeces. Any advice on how to solve this will be appreciated.
Possibly, the problem is caused by the following situation. (Thanks to Yan Tabachek for e-mailing me a similar example.) If one of the partitioning variables passed on to rpart() is a character variable, then it is processed as if it were a factor by rpart() but not by the conversion in as.party(). As a simple example consider this small data set:
d <- data.frame(y = c(1:10, 101:110))
d$x <- rep(c("a", "b"), each = 10)
Fitting the rpart() tree treats the character variable x as a factor:
library("rpart")
(rp <- rpart(y ~ x, data = d))
## n= 20
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 20 50165.0 55.5
## 2) x=a 10 82.5 5.5 *
## 3) x=b 10 82.5 105.5 *
However, the as.party() conversion does not work:
library("partykit")
as.party(rp)
## Error in partysplit(varid = which(rownames(obj$split)[j] == names(mf)), :
## 'index' has less than two elements
The best fix is to transform x to a factor variable and re-fit the tree. Then the conversion also works smoothly:
d$x <- factor(d$x)
rp <- rpart(y ~ x, data = d)
as.party(rp)
## Model formula:
## y ~ x
##
## Fitted party:
## [1] root
## | [2] x in a: 5.500 (n = 10, err = 82.5)
## | [3] x in b: 105.500 (n = 10, err = 82.5)
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
I also added a fix in the development version of partykit on R-Forge to avoid the problem in the first place. It will be included in the next CRAN release (probably 1.0-1 for which a release date has not yet been scheduled).
I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?
I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6
require(plyr)
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
})
require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).