I have a classic dice simulation problem, which I'm struggling to implement since I'm new with R syntax. The function (which I have called simu) works as follows:
Start with 0 points
Simulate n random draws of three six-sided dice
For each draw:
If sum of three dice >12 --> +1 point
If sum of three dice <6 --> -1 point
Otherwise (ie sum between 6 and 12):
If three dice have same number --> +5 points
Otherwise --> 0 points
Return total # of points obtained at the end of n simulations
Having tried a number of different methods I seem to be pretty close with:
simu <- function(n){
k <- 0
for(i in 1:n) {
a <- sample(y,1,replace=TRUE)
b <- sample(y,1,replace=TRUE)
c <- sample(y,1,replace=TRUE)
if ((a + b + c) > 12) {
k <- k+1
} else if ((a + b + c) < 6) {
k <- k-1
} else if ((a == b) & (b == c)) {
k <- k+5
} else k <- 0
}
return(k)
}
The problem seems to be that I am failing to iterate over new simulations (for a, b, c) for each "i" in the function.
I have commented the only issue I have found... The last else that always re-initialize k to 0. Instead it should have been k <- k + 0 but it does not change anything to remove it.
y <- seq(1,6) # 6-sided dice
simu <- function(n){
k <- 0
for(i in 1:n) {
a <- sample(y,1,replace=TRUE)
b <- sample(y,1,replace=TRUE)
c <- sample(y,1,replace=TRUE)
if ((a + b + c) > 12) {
k <- k+1
} else if ((a + b + c) < 6) {
k <- k-1
} else if ((a == b) & (b == c)) {
k <- k+5
} #else k <- 0
}
return(k)
}
The results look quite fine :
> simu(1000)
[1] 297
> simu(100)
[1] 38
If you are going to use R, then you should learn to create vectorized operations instead of 'for' loops. Here is a simulation of 1 million rolls of the dice that took less than 1 second to calculate. I am not sure how long the 'for' loop approach would have taken.
n <- 1000000 # trials
start <- proc.time() # time how long it takes
result <- matrix(0L, ncol = 6, nrow = n)
colnames(result) <- c('d1', 'd2', 'd3', 'sum', 'same', 'total')
# initial the roll of three dice
result[, 1:3] <- sample(6L, n * 3L, replace = TRUE)
# compute row sum
result[, 'sum'] <- as.integer(rowSums(result[, 1:3]))
# check for being the same
result[, 'same'] <- result[,1L] == result[, 2L] & result[, 2L] == result[, 3L]
result[, 'total'] <- ifelse(result[, 'sum'] > 12L,
1L,
ifelse(result[, 'sum'] < 6L,
-1L,
ifelse(result[, 'same'] == 1L,
5L,
0L
)
)
)
table(result[, 'total'])
-1 0 1 5
46384 680762 259083 13771
cat("simulation took:", proc.time() - start, '\n')
simulation took: 0.7 0.1 0.8 NA NA
I am not sure that's what you need, but you can try something like that:
# Draw the dice(s) - returns vector of length == n_dices
draw <- function(sides = 6, dices = 3){
sample(1:sides, dices, replace = T)
}
# test simulation x and return -1, 0, 1, 1 or 5
test <- function(x){
(sum(x) > 12)*1 + (sum(x) < 6)*(-1) + (sum(x) >= 6 &
sum(x) <= 12 &
var(x) == 0)*5
}
# simulate n draws of x dices with y sides
simu <- function(sides = 6, dices = 3, n = 100){
sum(replicate(n, test(draw(sides, dices))))
}
# run simulations of 100 draws for 1, 2, ..., 11, 12-side dices (3 dices each simulation)
dt <- lapply(1:12, function(side) replicate(100, simu(side, 3, 100)))
# plot dicstribution of scores
par(mfrow = c(3,4))
lapply(1:length(dt), function(i) hist(dt[[i]],
main = sprintf("%i sides dice", i),
xlab = "Score"
)
)
Related
I want to calculate how many values are taken until the cumulative reaches a certain value.
This is my vector: myvec = seq(0,1,0.1)
I started with coding the cumulative sum function:
cumsum_for <- function(x)
{
y = 1
for(i in 2:length(x)) # pardon the case where x is of length 1 or 0
{x[i] = x[i-1] + x[i]
y = y+1}
return(y)
}
Now, with the limit
cumsum_for <- function(x, limit)
{
y = 1
for(i in 2:length(x)) # pardon the case where x is of length 1 or 0
{x[i] = x[i-1] + x[i]
if(x >= limit) break
y = y+1}
return(y)
}
which unfortunately errors:
myvec = seq(0,1,0.1)
cumsum_for(myvec, 0.9)
[1] 10
Warning messages:
1: In if (x >= limit) break :
the condition has length > 1 and only the first element will be used
[...]
What about this? You can use cumsum to compute the cumulative sum, and then count the number of values that are below a certain threshold n:
f <- function(x, n) sum(cumsum(x) <= n)
f(myvec, 4)
#[1] 9
f(myvec, 1.1)
#[1] 5
You can put a while loop in a function. This stops further calculation of the cumsum if the limit is reached.
cslim <- function(v, l) {
s <- 0
i <- 0L
while (s < l) {
i <- i + 1
s <- sum(v[1:i])
}
i - 1
}
cslim(v, .9)
# [1] 4
Especially useful for longer vectors, e.g.
v <- seq(0, 3e7, 0.1)
I need help with a code to generate random numbers according to constraints.
Specifically, I am trying to simulate random numbers ALFA and BETA from, respectively, a Normal and a Gamma distribution such that ALFA - BETA < 1.
Here is what I have written but it does not work at all.
set.seed(42)
n <- 0
repeat {
n <- n + 1
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1)
alfa[n] <- a
beta[n] <- b
l = length(alfa)
if (l == 10000) break
}
Due to vectorization, it will be faster to generate the numbers "all at once" rather than in a loop:
set.seed(42)
N = 1e5
a = rnorm(N, 10, 2)
b = rgamma(N, 8, 1)
d = a - b
alfa = a[d < 1]
beta = b[d < 1]
length(alfa)
# [1] 36436
This generated 100,000 candidates, 36,436 of which met your criteria. If you want to generate n samples, try setting N = 4 * n and you'll probably generate more than enough, keep the first n.
Your loop has 2 problems: (a) you need curly braces to enclose multiple lines after an if statement. (b) you are using n as an attempt counter, but it should be a success counter. As written, your loop will only stop if the 10000th attempt is a success. Move n <- n + 1 inside the if statement to fix:
set.seed(42)
n <- 0
alfa = numeric(0)
beta = numeric(0)
repeat {
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1) {
n <- n + 1
alfa[n] <- a
beta[n] <- b
l = length(alfa)
if (l == 500) break
}
}
But the first way is better... due to "growing" alfa and beta in the loop, and generating numbers one at a time, this method takes longer to generate 500 numbers than the code above takes to generate 30,000.
As commented by #Gregor Thomas, the failure of your attempt is due to the missing of curly braces to enclose the if statement. If you would like to skip {} for if control, maybe you can try the code below
set.seed(42)
r <- list()
repeat {
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1) r[[length(r)+1]] <- cbind(alfa = a, beta = b)
if (length(r) == 100000) break
}
r <- do.call(rbind,r)
such that
> head(r)
alfa beta
[1,] 9.787751 12.210648
[2,] 9.810682 14.046190
[3,] 9.874572 11.499204
[4,] 6.473674 8.812951
[5,] 8.720010 8.799160
[6,] 11.409675 10.602608
I have a piece of working code that is taking too many hours (days?) to compute.
I have a sparse matrix of 1s and 0s, I need to subtract each row from any other row, in all possible combinations, multiply the resulting vector by another vector, and finally average the values in it so to get a single scalar which I need to insert in a matrix. What I have is:
m <- matrix(
c(0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0), nrow=4,ncol=4,
byrow = TRUE)
b <- c(1,2,3,4)
for (j in 1:dim(m)[1]){
for (i in 1:dim(m)[1]){
a <- m[j,] - m[i,]
a[i] <- 0L
a[a < 0] <- 0L
c <- a*b
d[i,j] <- mean(c[c > 0])
}
}
The desired output is matrix with the same dimensions of m, where each entry is the result of these operations.
This loop works, but are there any ideas on how to make this more efficient? Thank you
My stupid solution is to use apply or sapply function, instead of for loop to do the iterations:
sapply(1:dim(m)[1], function(k) {z <- t(apply(m, 1, function(x) m[k,]-x)); diag(z) <- 0; z[z<0] <- 0; apply(t(apply(z, 1, function(x) x*b)),1,function(x) mean(x[x>0]))})
I tried to compare your solution and this in terms of running time in my computer, yours takes
t1 <- Sys.time()
d1 <- m
for (j in 1:dim(m)[1]){
for (i in 1:dim(m)[1]){
a <- m[j,] - m[i,]
a[i] <- 0L
a[a < 0] <- 0L
c <- a*b
d1[i,j] <- mean(c[c > 0])
}
}
Sys.time()-t1
Yours needs Time difference of 0.02799988 secs. For mine, it is reduced a bit but not too much, i.e., Time difference of 0.01899815 secs, when you run
t2 <- Sys.time()
d2 <- sapply(1:dim(m)[1], function(k) {z <- t(apply(m, 1, function(x) m[k,]-x)); diag(z) <- 0; z[z<0] <- 0; apply(t(apply(z, 1, function(x) x*b)),1,function(x) mean(x[x>0]))})
Sys.time()-t2
You can try it on your own computer with larger matrix, good luck!
1) create test sparse matrix:
nc <- nr <- 100
p <- 0.001
require(Matrix)
M <- Matrix(0L, nr, nc, sparse = T) # 0 matrix
n1 <- ceiling(p * (prod(dim(M)))) # 1 count
M[1:n1] <- 1L # fill only first column, to approximate max non 0 row count
# (each row has at maximum 1 positive element)
sum(M)/(prod(dim(M)))
b <- 1:ncol(M)
sum(rowSums(M))
So, if the proportion given is correct then we have at most 10 rows that contain non 0 elements
Based on this fact and your supplied calculations:
# a <- m[j, ] - m[i, ]
# a[i] <- 0L
# a[a < 0] <- 0L
# c <- a*b
# mean(c[c > 0])
we can see that the result will be meaningful only form[, j] rows which have at least 1 non 0 element
==> we can skip calculations for all m[, j] which contain only 0s, so:
minem <- function() { # write as function
t1 <- proc.time() # timing
require(data.table)
i <- CJ(1:nr, 1:nr) # generate all combinations
k <- rowSums(M) > 0L # get index where at least 1 element is greater that 0
i <- i[data.table(V1 = 1:nr, k), on = 'V1'] # merge
cat('at moust', i[, sum(k)/.N*100], '% of rows needs to be calculated \n')
i[k == T, rowN := 1:.N] # add row nr for 0 subset
i2 <- i[k == T] # subset only those indexes who need calculation
a <- M[i2[[1]],] - M[i2[[2]],] # operate on all combinations at once
a <- drop0(a) # clean up 0
ids <- as.matrix(i2[, .(rowN, V2)]) # ids for 0 subset
a[ids] <- 0L # your line: a[i] <- 0L
a <- drop0(a) # clean up 0
a[a < 0] <- 0L # the same as your line
a <- drop0(a) # clean up 0
c <- t(t(a)*b) # multiply each row with vector
c <- drop0(c) # clean up 0
c[c < 0L] <- 0L # for mean calculation
c <- drop0(c) # clean up 0
r <- rowSums(c)/rowSums(c > 0L) # row means
i[k == T, result := r] # assign results to data.table
i[is.na(result), result := NaN] # set rest to NaN
d2 <- matrix(i$result, nr, nr, byrow = F) # create resulting matrix
t2 <- proc.time() # timing
cat(t2[3] - t1[3], 'sec \n')
d2
}
d2 <- minem()
# at most 10 % of rows needs to be calculated
# 0.05 sec
Test on smaller example if results matches
d <- matrix(NA, nrow(M), ncol(M))
for (j in 1:dim(M)[1]) {
for (i in 1:dim(M)[1]) {
a <- M[j, ] - M[i, ]
a[i] <- 0L
a[a < 0] <- 0L
c <- a*b
d[i, j] <- mean(c[c > 0])
}
}
all.equal(d, d2)
Can we get results for your real data size?:
# generate data:
nc <- nr <- 6663L
b <- 1:nr
p <- 0.0001074096 # proportion of 1s
M <- Matrix(0L, nr, nc, sparse = T) # 0 matrix
n1 <- ceiling(p * (prod(dim(M)))) # 1 count
M[1:n1] <- 1L
object.size(as.matrix(M))/object.size(M)
# storing this data in usual matrix uses 4000+ times more memory
# calculation:
d2 <- minem()
# at most 71.57437 % of rows needs to be calculated
# 28.33 sec
So you need to convert your matrix to sparse one with
M <- Matrix(m, sparse = T)
From a sequence of TRUEs and falses, I wanted to make a function that returns TRUE whether there is a series of at least n1 TRUEs somewhere in the sequence. Here is that function:
fun_1 = function(TFvec, n1){
nbT = 0
solution = -1
for (i in 1:length(x)){
if (x[i]){
nbT = nbT + 1
if (nbT == n1){
return(T)
break
}
} else {
nbT = 0
}
}
return (F)
}
Test:
x = c(T,F,T,T,F,F,T,T,T,F,F,T,F,F)
fun_1(x,3) # TRUE
fun_1(x,4) # FALSE
Then, I needed a function that returns TRUE if in a given list boolean vector, there is a series of at least n1 TRUEs wrapped by at least two series (one on each side) of n2 falses. Here the function:
fun_2 = function(TFvec, n1, n2){
if (n2 == 0){
fun_1(TFvec, n2)
}
nbFB = 0
nbFA = 0
nbT = 0
solution = -1
last = F
for (i in 1:length(TFvec)){
if(TFvec[i]){
nbT = nbT + 1
if (nbT == n1 & nbFB >= n2){
solution = i-n1+1
}
last = T
} else {
if (last){
nbFB = 0
nbFA = 0
}
nbFB = nbFB + 1
nbFA = nbFA + 1
nbT = 0
if (nbFA == n2 & solution!=-1){
return(T)
}
last = F
}
}
return(F)
}
It is maybe not a very efficient function though! And I haven't tested it 100 times but it looks like it works fine!
Test:
x = c(T,F,T,T,F,F,T,T,T,F,F,T,F,F)
fun_2(x, 3, 2) # TRUE
fun_2(x, 3, 3) # FALSE
Now, believe it or not, I'd like to make a function (fun_3) that returns TRUE if in the boolean vector there is a (at least) series of at least n1 TRUEs wrapped in between (at least) two (one on each side) series of n2 falses where the whole thing (the three series) are wrapped in between (at least) two (one on each side) series of n3 TRUEs. And as I am afraid to have to bring this problem even further, I am asking here for help to create a function fun_n in which we enter two arguments TFvec and list_n where list_n is a list of n of any length.
Can you help me to create the function fun_n?
For convenience, record the length of the number of thresholds
n = length(list_n)
Represent the vector of TRUE and FALSE as a run-length encoding, remembering the length of each run for convenience
r = rle(TFvec); l = r$length
Find possible starting locations
idx = which(l >= list_n[1] & r$value)
Make sure the starting locations are embedded enough to satisfy all tests
idx = idx[idx > n - 1 & idx + n - 1 <= length(l)]
Then check that lengths of successively remote runs are consistent with the condition, keeping only those starting points that are
for (i in seq_len(n - 1)) {
if (length(idx) == 0)
break # no solution
thresh = list_n[i + 1]
test = (l[idx + i] >= thresh) & (l[idx - i] >= thresh)
idx = idx[test]
}
If there are any values left in idx, then these are the indexes into the rle satisfying the condition; the starting point(s) in the initial vector are cumsum(l)[idx - 1] + 1.
Combined:
runfun = function(TFvec, list_n) {
## setup
n = length(list_n)
r = rle(TFvec); l = r$length
## initial condition
idx = which(l >= list_n[1] & r$value)
idx = idx[idx > n - 1 & idx + n - 1 <= length(l)]
## adjacent conditions
for (i in seq_len(n - 1)) {
if (length(idx) == 0)
break # no solution
thresh = list_n[i + 1]
test = (l[idx + i] >= thresh) & (l[idx - i] >= thresh)
idx = idx[test]
}
## starts = cumsum(l)[idx - 1] + 1
## any luck?
length(idx) != 0
}
This is fast and allows for runs >= the threshold, as stipulated in the question; for example
x = sample(c(TRUE, FALSE), 1000000, TRUE)
system.time(runfun(x, rep(2, 5)))
completes in less than 1/5th of a second.
A fun generalization allows for flexible condition, e.g., runs of exactly list_n, as in the rollapply solution
runfun = function(TFvec, list_n, cond=`>=`) {
## setup
n = length(list_n)
r = rle(TFvec); l = r$length
## initial condition
idx = which(cond(l, list_n[1]) & r$value)
idx = idx[idx > n - 1 & idx + n - 1 <= length(l)]
## adjacent conditions
for (i in seq_len(n - 1)) {
if (length(idx) == 0)
break # no solution
thresh = list_n[i + 1]
test = cond(l[idx + i], thresh) & cond(l[idx - i], thresh)
idx = idx[test]
}
## starts = cumsum(l)[idx - 1] + 1
## any luck?
length(idx) != 0
}
Create a template, tpl of zeros and ones, convert it to a regex pattern pat. Convert x to a single string of zeros and ones and use grepl to match pat to it. No packages are used.
fun_n <- function(x, lens) {
n <- length(lens)
reps <- c(rev(lens), lens[-1])
TF <- if (n == 1) 1 else if (n %% 2) 1:0 else 0:1
tpl <- paste0(rep(TF, length = n), "{", reps, ",}")
pat <- paste(tpl, collapse = "")
grepl(pat, paste(x + 0, collapse = ""))
}
# test
x <- c(F, T, T, F, F, T, T, T, F, F, T, T, T, F)
fun_n(x, 3:1)
## TRUE
fun_n(x, 1:3)
## FALSE
fun_n(x, 100)
## FALSE
fun_n(x, 3)
## TRUE
fun_n(c(F, T, F), c(1, 1))
## [1] TRUE
fun_n(c(F, T, T, F), c(1, 1))
## [1] TRUE
Run time is not as fast as runfun on the example below but still quite fast running 10,000 instances of the example shown in slightly over 2 seconds on my laptop. Also the code is relatively short in length and loop-free.
> library(rbenchmark)
> benchmark(runfun(x, 1:3), fun_n(x, 1:3), replications = 10000)[1:4]
test replications elapsed relative
2 fun_n(x, 1:3) 10000 2.29 1.205
1 runfun(x, 1:3) 10000 1.90 1.000
Suppose that I have a 20 X 5 matrix, I would like to select subsets of the matrix and do some computation with them. Further suppose that each sub-matrix is 7 X 5. I could of course do
ncomb <- combn(20, 7)
which gives me all possible combinations of 7 row indices, and I can use these to obtain sub-matrices. But with a small, 20 X 5 matrix, there are already 77520 possible combination. So I would like to instead randomly sample some of the combinations, e.g., 5000 of them.
One possibility is the following:
ncomb <- combn(20, 7)
ncombsub <- ncomb[, sample(77520, 5000)]
In other words, I obtain all possible combinations, and then randomly select only 5000 of the combinations. But I imagine it would be problematic to compute all possible combinations if I had a larger matrix - say, 100 X 7.
So I wonder if there is a way to get subsets of combinations without first obtaining all possible combinations.
Your approach:
op <- function(){
ncomb <- combn(20, 7)
ncombsub <- ncomb[, sample(choose(20,7), 5000)]
return(ncombsub)
}
A different strategy that simply samples seven rows from the original matrix 5000 times (replacing any duplicate samples with a new sample until 5000 unique row combinations are found):
me <- function(){
rowsample <- replicate(5000,sort(sample(1:20,7,FALSE)),simplify=FALSE)
while(length(unique(rowsample))<5000){
rowsample <- unique(rowsample)
rowsample <- c(rowsample,
replicate(5000-length(rowsample),
sort(sample(1:20,7,FALSE)),simplify=FALSE))
}
return(do.call(cbind,rowsample))
}
This should be more efficient because it prevents you from having to calculate all of the combinations first, which will get costly as the matrix gets larger.
And yet, some benchmarking reveals that is not the case. At least on this matrix:
library(microbenchmark)
microbenchmark(op(),me())
Unit: milliseconds
expr min lq median uq max neval
op() 184.5998 201.9861 206.3408 241.430 299.9245 100
me() 411.7213 422.9740 429.4767 474.047 490.3177 100
I ended up doing what #Roland suggested, by modifying combn(), and byte-compiling the code:
combn_sub <- function (x, m, nset = 5000, seed=123, simplify = TRUE, ...) {
stopifnot(length(m) == 1L)
if (m < 0)
stop("m < 0", domain = NA)
if (is.numeric(x) && length(x) == 1L && x > 0 && trunc(x) ==
x)
x <- seq_len(x)
n <- length(x)
if (n < m)
stop("n < m", domain = NA)
m <- as.integer(m)
e <- 0
h <- m
a <- seq_len(m)
len.r <- length(r <- x[a] )
count <- as.integer(round(choose(n, m)))
if( count < nset ) nset <- count
dim.use <- c(m, nset)
##-----MOD 1: Change the output matrix size--------------
out <- matrix(r, nrow = len.r, ncol = nset)
if (m > 0) {
i <- 2L
nmmp1 <- n - m + 1L
##----MOD 2: Select a subset of indices
set.seed(seed)
samp <- sort(c(1, sample( 2:count, nset - 1 )))
##----MOD 3: Start a counter.
counter <- 2L
while (a[1L] != nmmp1 ) {
if (e < n - h) {
h <- 1L
e <- a[m]
j <- 1L
}
else {
e <- a[m - h]
h <- h + 1L
j <- 1L:h
}
a[m - h + j] <- e + j
#-----MOD 4: Whenever the counter matches an index in samp,
#a combination of row indices is produced and stored in the matrix `out`
if(samp[i] == counter){
out[, i] <- x[a]
if( i == nset ) break
i <- i + 1L
}
#-----Increase the counter by 1 for each iteration of the while-loop
counter <- counter + 1L
}
}
array(out, dim.use)
}
library("compiler")
comb_sub <- cmpfun(comb_sub)