Performance R applying indicator and rbinding xts

Performance R applying indicator and rbinding xts - r

I'm new to R and currently poking that thing with a stick till it does, what I need to be done. Unfortunately I hit a wall with some performance issues.
My problem is, that I need a CCI indicator calculated on minute periods but refreshed every second for the "actual" minute of the iteration.
My implementation works but is incredibly slow. For 4 days of forex data on EUR/USD, based on second periods, I need almost 15 minutes to apply the indicator.
I did read some stuff about preallocation and slow rbind operations. I already reduced my rbind calls by refactoring the loops. But this didn't improve the performance. So I assume I'm loosing the time elsewhere.
Since I don't know anyone who is fit in R, I post my code here and hope for some help.
What I do is basically looping my second data, accumulate the seconds to minutes, calculate CCI and after I once did that for periode I then refresh the last minutebar every second.
addCCIToData <- function(bars, periode) {
#bars is OHLC based on second periods
#periode is number of periods for cci calculation
require(xts)
require(quantmod)
bars <- as.xts(bars)
bars$CCI <- 0
x <- 1
##scope is time of observation == periode
scope <- list()
for (i in 1:periode ) {
scope[[i]] <- 1 # save time by preallocating?
}
y <- nrow(bars)
lastminute <- 0
createdBarCount <- 1
enoughData <- FALSE
zeit1 <- as.POSIXlt(time(bars[x]))
while(x < y) {
zeit <- as.POSIXlt(time(bars[x]))
if(zeit$min != lastminute) {
zeit2 <-zeit
lastminute <- zeit$min
zeit1 <- as.POSIXlt(time(bars[x])) #reset zeit1 because of new 1 minute bar
createdBarCount <- createdBarCount + 1
if(createdBarCount > periode && enoughData == FALSE) {
enoughData = TRUE
i = 2
dataPeriodeMinus1 = scope[[1]]
while(i <= periode-1) {
dataPeriodeMinus1 = rbind(dataPeriodeMinus1, scope[[i]])
i = i + 1
}
createdBarCount <- periode
}
else if(enoughData == TRUE) {
newScope <- list()
for(i in 1:periode-1) {
newScope[i] <- scope[i+1]
}
scope = newScope
i = 2
dataPeriodeMinus1 = scope[[1]]
while(i <= periode-1) {
dataPeriodeMinus1 = rbind(dataPeriodeMinus1, scope[[i]])
i = i + 1
}
createdBarCount <- periode
}
}
a <- as.character(zeit1)
b <- as.character(as.POSIXlt(time(bars[x])))
c <- paste(a, b, sep = "::")
scope[createdBarCount] <- list(to.minutes(OHLC(bars[c]), 1,"CCI")) #merge the seconds to minutes
if(enoughData == TRUE) {
data = rbind(dataPeriodeMinus1, scope[[periode]])
# i = 2
# while(i <= periode) { #improve!, we need only the last bar to be binded here
# data = rbind(data, scope[[i]]) #internet says this is slow
# i = i + 1
# }
#bars[[x,5]] = SMA(data$CCI.Close, periode)[[periode]][[1]]
bars[[x,5]] = CCI(data[,c("CCI.High","CCI.Close", "CCI.Low")], periode, SMA)[[periode]]
}
x <- x+1
}
bars
}
Edit: fixed code.
Edit2: Testdata can be optained from here Tesdata
It can be loaded using the command "load("path/to/file)". Then just call addCCIToData(bars_seconds["2015-01-05 00:00:00::2015-01-05 02:00:00"], 14) after sourcing the above function. I really do think, that the continous merging of seconds to minutebars is the timeconsuming task. How can I optimize that?
Edit 3: Seems that calculation of CCI is also taking some time...
For the complete set of testdata I need
357s without cci calculation
902s with cci calculation
Thank you very much!

Related

R: Problems with updating list objects using a while-loop

I am currently trying to simulate a (random) market in R using a while-loop that runs while the market is open: while the time is less than 600 minutes.
On this market only one of four events may happen at any time: birth of a supply, birth of a demand, death of a supply or death of a demand.
These are all drawn from exponential distributions using the rexp()-command with their own intensity. Their amounts and respective prices are each drawn from their own normal distribution (only values greater than 0), and the time is then updated depending on which of the events is drawn first.
Then I would like to update these intensities (using cox-regression), and for this to happen I need to store previous information about each of the events, preferably in a list, so that I can for example draw samples from the living supplies and remove them, to imitate a purchase. I basically want to keep track of what is "alive" on the market at a given time. Here is some of my code:
TIME <- 0
count <- 1
...
my.stores <- c()
while(TIME < 600){
time.supply.birth <- rexp(1, intensity1)
time.supply.death <- rexp(1, intensity2)
time.demand.birth <- rexp(1, intensity3)
time.demand.death <- rexp(1, intensity4)
case1 <- time.supply.birth == min(time.supply.birth, time.demand.birth, time.supply.death, time.demand.death)
case2 <- time.supply.death == min(time.supply.birth, time.demand.birth, time.supply.death, time.demand.death)
case3 <- time.demand.birth == min(time.supply.birth, time.demand.birth, time.supply.death, time.demand.death)
case4 <- time.demand.death == min(time.supply.birth, time.demand.birth, time.supply.death, time.demand.death)
TIME <- TIME + time.supply.birth*case1 + time.supply.death*case2 + time.demand.birth*case3 + time.demand.death*case4
if(case1 == T){
amount.supply.birth <- rnorm() # with values
price.supply.birth <- rnorm()
count.supply.birth.event <- count.supply.birth.event + 1
my.stores[[count]]$amount.supply.birth <- c(my.stores[[count-1]]$amount.supply.birth, amount.supply.birth)
my.stores[[count]]$price.supply.birth <- c(my.stores[[count-1]]$price.supply.birth, price.supply.birth)
} else if(case2 ==T) {
# Death supply event: here a sample from the living supplies should be drawn
} else if(case3 == T){
# Similar to case 1
} else if(case4 == T){
# similar to case 2
} else{
}
count <- count + 1
}
My problem is that I cannot even store any information in the list, since the while-loop breaks immediately after one iteration, which results in the length of the list my.stores to be only 1 - I bet it is something about my indexing in the list, but I'm not sure how to get around it. I get the following warning:
Error in my.stores[[count - 1]] :
attempt to select less than one element in get1index <real>
and when I print the list I get the following:
> my.stores[[1]]
$amount.demand.birth
[1] 6.044815
Say I draw a demand.birth with an amount and a price, and then the next iteration I similarly draw a supply.birth, I would have liked something like:
> my.stores[[1]]
$amount.demand.birth
[1] 6.044815
$amount.supply.birth
[1] 0
$price.demand.birth
[1] 50.78
$price.supply.birth
[1] 0
> my.stores[[2]]
$amount.demand.birth
[1] 6.044815
[2] 6.044815
$amount.supply.birth
[1] 0
[2] 7.1312
$price.demand.birth
[1] 50.78
[2] 50.78
$price.supply.birth
[1] 0
[2] 95.00
Anyone who can help me with this or who has other suggestions?
Sorry about the long post.
Cheers!

Since my.stores[[0]] is not valid, did you try :
if (count==1) {
my.stores[[count]]$amount.supply.birth <- amount.supply.birth
my.stores[[count]]$price.supply.birth <- price.supply.birth
}
else {
my.stores[[count]]$amount.supply.birth <- c(my.stores[[count-1]]$amount.supply.birth, amount.supply.birth)
my.stores[[count]]$price.supply.birth <- c(my.stores[[count-1]]$price.supply.birth, price.supply.birth)
}

Changing start dates of schedules to optimize resources

I have a bunch of work that needs to be performed at specific time intervals. However, we have limited resources to do that work, each day. Therefore, I am trying to optimize the start time dates (start time dates can only be moved forward not backward) so that resources used everyday are more less similar to what we have budgeted for.
These functions are used in example below::
# Function to shift/rotate a vector
shifter <- function(x, n = 1) {
if (n == 0) x else c(tail(x, -n), head(x, n))
}
# Getting a range of dates
get_date_range <- function(current_date = Sys.Date(), next_planned_date = Sys.Date() + 5)
{
seq.Date(as.Date(current_date), as.Date(next_planned_date), "days")
}
Assume a toy example dataset :: Here task P1 starts on 14th while P2 starts on 15th. Value of zero means that no work is done for that task on that day.
# EXAMPLE TOY DATASET
datain = data.frame(dated = c("2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17"),
P1 = c(1,2,0,3), P2 = c(0,4,0,6)) %>%
mutate(dated = as.character(dated))
#The amount of resources that can be used in a day
max_work = 4
# We will use all the possible combination of start dates to
# search for the best one
possible_start_dates <- do.call(expand.grid, date_range_of_all)
# Utilisation stores the capacity used during each
# combination of start dates
# We will use the minimum of thse utilisation
utilisation <- NULL # utilisation difference; absolute value
utilisation_act <- NULL # actual utilisation including negative utilisation
# copy of data for making changes
ndatain <- datain
# Move data across possible start dates and
# calculate the possible utilisation in each movements
for(i in 1:nrow(possible_start_dates)) # for every combination
{
for(j in 1:ncol(possible_start_dates)) # for every plan
{
# Number of days that are different
days_diff = difftime(oriz_start_date[["Plan_Start_Date"]][j],
possible_start_dates[i,j], tz = "UTC", units = "days" ) %>% as.numeric()
# Move the start dates
ndatain[, (j+1)] <- shifter(datain[, (j+1)], days_diff)
}
if(is.null(utilisation)) # first iteration
{
# calculate the utilisation
utilisation = c(i, abs(max_work - rowSums(ndatain %>% select(-dated))))
utilisation_act <- c(i, max_work - rowSums(ndatain %>% select(-dated)))
}else{ # everything except first iteration
utilisation = rbind(utilisation, c(i,abs(max_work - rowSums(ndatain %>% select(-dated)))))
utilisation_act <- rbind(utilisation_act, c(i, max_work - rowSums(ndatain %>% select(-dated))))
}
}
# convert matrix to dataframe
row.names(utilisation) <- paste0("Row", 1:nrow(utilisation))
utilisation <- as.data.frame(utilisation)
row.names(utilisation_act) <- paste0("Row", 1:nrow(utilisation_act))
utilisation_act <- as.data.frame(utilisation_act)
# Total utilisation
tot_util = rowSums(utilisation[-1])
# replace negative utilisation with zero
utilisation_act[utilisation_act < 0] <- 0
tot_util_act = rowSums(utilisation_act[-1])
# Index of all possible start dates producing minimum utilization changes
indx_min_all = which(min(tot_util) == tot_util)
indx_min_all_act = which(min(tot_util_act) == tot_util_act)
# The minimum possible dates that are minimum of actual utilisation
candidate_dates <- possible_start_dates[intersect(indx_min_all, indx_min_all_act), ]
# Now check which of them are closest to the current starting dates; so that the movement is not much
time_diff <- c()
for(i in 1:nrow(candidate_dates))
{
# we will add this value in inner loop so here we
timediff_indv <- 0
for(j in 1:ncol(candidate_dates))
{
diff_days <- difftime(oriz_start_date[["Plan_Start_Date"]][j],
candidate_dates[i,j], tz = "UTC", units = "days" ) %>% as.numeric()
# print(oriz_start_date[["Plan_Start_Date"]][j])
# print(candidate_dates[i,j])
#
# print(diff_days)
timediff_indv <- timediff_indv + diff_days
}
time_diff <- c(time_diff, timediff_indv)
}
# Alternatives
fin_dates <- candidate_dates[min(time_diff) == time_diff, ]
The above code runs well and produces the expected output; however it does not scale well. I have very large dataset (Two years worth of work and for more than thousand different tasks repeating in intervals) and searching through every possible combination is not a viable option. Are there ways I can formulate this problem as a standard optimization problem and use Rglpk or Rcplex or some even better solution. Thanks for inputs.

Here comes my longest StackOverflow answer ever, but I really like optimization problems. This is a variant of the so called job shop problem with a single machine, which you might be able to solve with Rcplex if you first formulate it as a LP-model. However, these formulations often scale poorly and computational times can grow exponentially, dependent on the formulation. For big problems, it is very common to use a heuristic, for example a genetic algorithm, which is what I often use in cases like this. It does not guarantee to give the optimal solution, but it gives us more control over performance vs runtime and the solution usually scales very well. Basically, it works by creating a large set of random solutions, called the population. Then we iteratively update this population by combining the solutions to make 'offspring', where better solutions should have a higher probability of creating offspring.
As a scoring function (to determine which solutions are 'better'), I used the sum of squares of the overcapacity per day, which penalizes very large overcapacity on a day. Note that you can use any scoring function that you want, so you could also penalize under-utilization of capacity if you deem that important.
The code for the example implementation is shown below. I generated some data of 200 days and 80 tasks. It runs in about 10 seconds on my laptop, improving the score of the random solution by over 65% from 2634 to 913. With an input of 700 days and 1000 tasks, the algorithm still runs within a matter of minutes with the same parameters.
Best solution score per iteration:
I also included use_your_own_sample_data, which you can set to TRUE to have the algorithm solve a simpler and smaller example to confirm that it gives the expected output:
dated P1 P2 P3 P4 P5 dated P1 P2 P3 P4 P5
2018-12-14 0 0 0 0 0 2018-12-14 0 0 3 1 0
2018-12-15 0 0 0 0 0 2018-12-15 0 3 0 0 1
2018-12-16 0 0 0 0 0 ----> 2018-12-16 0 0 3 1 0
2018-12-17 0 3 3 1 1 2018-12-17 0 3 0 0 1
2018-12-18 4 0 0 0 0 2018-12-18 4 0 0 0 0
2018-12-19 4 3 3 1 1 2018-12-19 4 0 0 0 0
I hope this helps! Let me know if you have more questions regarding this.
CODE
### PARAMETERS -------------------------------------------
n_population = 100 # the number of solutions in a population
n_iterations = 100 # The number of iterations
n_offspring_per_iter = 80 # number of offspring to create per iteration
max_shift_days = 20 # Maximum number of days we can shift a task forward
frac_perm_init = 0.25 # fraction of columns to change from default solution while creating initial solutions
early_stopping_rounds = 100 # Stop if score not improved for this amount of iterations
capacity_per_day = 4
use_your_own_sample_data = FALSE # set to TRUE to use your own test case
### SAMPLE DATA -------------------------------------------------
# datain should contain the following columns:
# dated: A column with sequential dates
# P1, P2, ...: columns with values for workload of task x per date
n_days = 200
n_tasks = 80
set.seed(1)
if(!use_your_own_sample_data)
{
# my sample data:
datain = data.frame(dated = seq(Sys.Date()-n_days,Sys.Date(),1))
# add some random tasks
for(i in 1:n_tasks)
{
datain[[paste0('P',i)]] = rep(0,nrow(datain))
rand_start = sample(seq(1,nrow(datain)-5),1)
datain[[paste0('P',i)]][seq(rand_start,rand_start+4)] = sample(0:5,5,replace = T)
}
} else
{
# your sample data:
library(dplyr)
datain = data.frame(dated = c("2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17","2018-12-18","2018-12-19"),
P1 = c(0,0,0,0,4,4), P2 = c(0,0,0,3,0,3), P3=c(0,0,0,3,0,3), P4=c(0,0,0,1,0,1),P5=c(0,0,0,1,0,1)) %>%
mutate(dated = as.Date(dated,format='%Y-%m-%d'))
}
tasks = setdiff(colnames(datain),c("dated","capacity")) # a list of all tasks
# the following vector contains for each task its maximum start date
max_date_per_task = lapply(datain[,tasks],function(x) datain$dated[which(x>0)[1]])
### ALL OUR PREDEFINED FUNCTIONS ----------------------------------
# helper function to shift a task
shifter <- function(x, n = 1) {
if (n == 0) x else c(tail(x, n), head(x, -n))
}
# Score a solution
# We calculate the score by taking the sum of the squares of our overcapacity (so we punish very large overcapacity on a day)
score_solution <- function(solution,tasks,capacity_per_day)
{
cap_left = capacity_per_day-rowSums(solution[,tasks]) # calculate spare capacity
over_capacity = sum(cap_left[cap_left<0]^2) # sum of squares of overcapacity (negatives)
return(over_capacity)
}
# Merge solutions
# Get approx. 50% of tasks from solution1, and the remaining tasks from solution 2.
merge_solutions <- function(solution1,solution2,tasks)
{
tasks_from_solution_1 = sample(tasks,round(length(tasks)/2))
tasks_from_solution_2 = setdiff(tasks,tasks_from_solution_1)
new_solution = cbind(solution1[,'dated',drop=F],solution1[,tasks_from_solution_1,drop=F],solution2[,tasks_from_solution_2,drop=F])
return(new_solution)
}
# Randomize solution
# Create an initial solution
randomize_solution <- function(solution,max_date_per_task,tasks,tasks_to_change=1/8)
{
# select some tasks to reschedule
tasks_to_change = max(1, round(length(tasks)*tasks_to_change))
selected_tasks <- sample(tasks,tasks_to_change)
for(task in selected_tasks)
{
# shift task between 14 and 0 days forward
new_start_date <- sample(seq(max_date_per_task[[task]]-max_shift_days,max_date_per_task[[task]],by='day'),1)
new_start_date <- max(new_start_date,min(solution$dated))
solution[,task] = shifter(solution[,task],as.numeric(new_start_date-max_date_per_task[[task]]))
}
return(solution)
}
# sort population based on scores
sort_pop <- function(population)
{
return(population[order(sapply(population,function(x) {x[['score']]}),decreasing = F)])
}
# return the scores of a population
pop_scores <- function(population)
{
sapply(population,function(x) {x[['score']]})
}
### RUN SCRIPT -------------------------------
# starting score
print(paste0('Starting score: ',score_solution(datain,tasks,capacity_per_day)))
# Create initial population
population = vector('list',n_population)
for(i in 1:n_population)
{
# create initial solutions by making changes to the initial solution
solution = randomize_solution(datain,max_date_per_task,tasks,frac_perm_init)
score = score_solution(solution,tasks,capacity_per_day)
population[[i]] = list('solution' = solution,'score'= score)
}
population = sort_pop(population)
score_per_iteration <- score_solution(datain,tasks,capacity_per_day)
# Run the algorithm
for(i in 1:n_iterations)
{
print(paste0('\n---- Iteration',i,' -----\n'))
# create some random perturbations in the population
for(j in 1:10)
{
sol_to_change = sample(2:n_population,1)
new_solution <- randomize_solution(population[[sol_to_change]][['solution']],max_date_per_task,tasks)
new_score <- score_solution(new_solution,tasks,capacity_per_day)
population[[sol_to_change]] <- list('solution' = new_solution,'score'= new_score)
}
# Create offspring, first determine which solutions to combine
# determine the probability that a solution will be selected to create offspring (some smoothing)
probs = sapply(population,function(x) {x[['score']]})
if(max(probs)==min(probs)){stop('No diversity in population left')}
probs = 1-(probs-min(probs))/(max(probs)-min(probs))+0.2
# create combinations
solutions_to_combine = lapply(1:n_offspring_per_iter, function(y){
sample(seq(length(population)),2,prob = probs)})
for(j in 1:n_offspring_per_iter)
{
new_solution <- merge_solutions(population[[solutions_to_combine[[j]][1]]][['solution']],
population[[solutions_to_combine[[j]][2]]][['solution']],
tasks)
new_score <- score_solution(new_solution,tasks,capacity_per_day)
population[[length(population)+1]] <- list('solution' = new_solution,'score'= new_score)
}
population = sort_pop(population)
population= population[1:n_population]
print(paste0('Best score:',population[[1]]['score']))
score_per_iteration = c(score_per_iteration,population[[1]]['score'])
if(i>early_stopping_rounds+1)
{
if(score_per_iteration[[i]] == score_per_iteration[[i-10]])
{
stop(paste0("Score not improved in the past ",early_stopping_rounds," rounds. Halting algorithm."))
}
}
}
plot(x=seq(0,length(score_per_iteration)-1),y=score_per_iteration,xlab = 'iteration',ylab='score')
final_solution = population[[1]][['solution']]
final_solution[,c('dated',tasks)]
And indeed, as we expect, the algorithm turns out to be very good in reducing the number of days with a very high overcapacity:
final_solution = population[[1]][['solution']]
# number of days with workload higher than 10 in initial solution
sum(rowSums(datain[,tasks])>10)
> 19
# number of days with workload higher than 10 in our solution
sum(rowSums(final_solution[,tasks])>10)
> 1

Optimising a calculation on every cumulative subset of a vector in R

I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.

As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))

Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.

How to use a while() loop within a for() loop in R

I'm new to R, so most of my code is most likely wrong. However, I was wondering how to use a while() loop within a for() loop. I'm trying to simulate rolling a pair of dice several times if the total 2,3,7,11,or 12 then I stop. If the total 4,5,6,8,9, or 10 then I continue to the roll the dice until the initial total appears or 7. I'm trying to find the average number of rolls it take to end the game
count = 0
x = NULL
for (i in 1:10) {
x[i] = c(sample(1:6,1) +sample(1:6,1))
if(x[i] == c(2||3||7||11||12)) {
if(TRUE) {count = count +1}
} else { while(x[i] == c(4||5||6||8||9||10)) {
x[i +1] = c(sample(1:6,1)+sample(1:6,1))
if(x[i+1] == c(x[i]||7)) {
if(TRUE){count = count + x[i+1]}
}
}
}
}
print(count)

I think there are a few issues with your logic. I'm not quite sure what you're trying to do in your code, but this is my interpretation of your description of your problem ... this only runs a single round of your game -- it should work if you embed it in a for loop though (just don't reset count or reset the random-number seed in side your loop -- then count will give you the total number of rolls, and you can divide by the number of rounds to get the average)
Setup:
count = 0
sscore <- c(2,3,7,11,12)
set.seed(101)
debug = TRUE
Running a single round:
x = sample(1:6,1) +sample(1:6,1) ## initial roll
count = count + 1
if (x %in% sscore) {
## don't need to do anything if we hit,
## as the roll has already been counted
if (debug) cat("hit",x[i],"\n")
} else {
## initialize while loop -- try one more time
y = c(sample(1:6,1)+sample(1:6,1))
count = count + 1
if (debug) cat("initial",x,"next",y,"\n")
while(!(y %in% c(x,7))) {
y = c(sample(1:6,1)+sample(1:6,1))
count = count+1
if (debug) cat("keep trying",y,"\n")
} ## end while
} ## end if-not-hit
print(count)
I tried embedding this in a for loop and got a mean of 3.453 for 1000 rounds, close to #PawelP's answer.
PS I hope this isn't homework, as I prefer not to answer homework questions ...

EDIT: I had a bug - forgot to remove if negation. Now the below seems to be 100% true to your description of the problem.
This is my implementation of the game you've described. It calculates the average number of rolls it took to end the game over a TOTAL_GAMES many games.
TOTAL_GAMES = 1000
counts = rep(0, TOTAL_GAMES)
x = NULL
for (i in 1:TOTAL_GAMES) {
x_start = c(sample(1:6,1) +sample(1:6,1))
counts[i] = counts[i] + 1
x = x_start
if(x %in% c(2, 3, 7, 11, 12)){
next
}
repeat {
x = c(sample(1:6,1)+sample(1:6,1))
counts[i] = counts[i] + 1
if(x %in% c(x_start, 7)){
break
}
}
}
print(mean(counts))
It seems that the average number of rolls is around 3.38

Here's one approach to this question - I made a function that runs a single trial, and another function which conducts a variable number of these trials and returns the cumulative average.
## Single trial
rollDice <- function(){
init <- sample(1:6,1)+sample(1:6,1)
rolls <- 1
if( init %in% c(2,3,7,11,12) ){
return(1)
} else {
Flag <- TRUE
while( Flag ){
roll <- sample(1:6,1)+sample(1:6,1)
rolls <- rolls + 1
if( roll %in% c(init,7) ){
Flag <- FALSE
}
rolls
}
}
return(rolls)
}
## Multiple trials
simAvg <- function(nsim = 100){
x <- replicate(nsim,rollDice())
Reduce("+",x)/nsim
}
##
## Testing
nTrial <- seq(1,1000,25)
Results <- sapply(nTrial, function(X){ simAvg(X) })
##
## Plot over varying number of simulations
plot(x=nTrial,y=Results,pch=20)
As #Ben Bolker pointed out, you had a couple of syntax errors with ||, which is understandable for someone new to R. Also, you'll probably hear it a thousand times, but for and while loops are pretty inefficient in R so you generally want to avoid them if possible. In the case of the while loop in the above rollDice() function, it probably isn't a big deal because the probability of the loop executing a large number of times is very low. I used the functions Reduce and replicate to serve the role of a for loop in the second function. Good question though, it was fun to work on.

Doubling time on R

I have a problem with calculating the doubling time for cancer growth in R. The data contains multiple scans of the same patient taken over 5 years. There seem to be, however, cases where the patient has been scanned multiple times in a year. I want to calculate the doubling time of the mass of nodes for all patients for 1 scan and the last scan.
I have calculated the doubling time of a node of the last patient, but I need to get the doubling time for all patients.
The code i have used:
Nod <- read.table("NoData270513.txt" , header = T)
Nod$CoNo <- 10*Nod$StNo + Nod$LeNo
length(Nod$CoNo); length(unique (Nod$CoNo))
Nod$CoNo <- factor(Nod$CoNo)
Nod$CTDato <- as.Date(Nod$CTDato)
NodTyp1 <- rep(NA, length(unique(Nod$CoNo)))
i <- 0; i1 <- 0; i2 <- 0
for (j in unique(Nod$CoNo)) { temp <- Nod[Nod$CoNo==j, ]
i <- i + 1; i1 <- i2 + 1; i2 <- i2 + length(temp$CoNo)
NodTyp1[1:20]
vdt <- rep(NA, 1216)
if (length(temp$Age) > 1 )
{
vdt[j] <- (as.numeric(temp$CTDato[length(temp$Age)]) - as.numeric(temp$CTDato[1])) * log(2)/log((temp$SDia[length(temp$Age)]/temp$SDia[1]))
}

If I got it right, the only thing you need is to create a function that takes data filename and returns what you need. Then just iterate through all data files.
It seems that will be the patern:
# declare function for one patient
calculate.doub.time <- function(filename){
Nod <- read.table(filename , header = T)
# ...
# ...
# return what you want
}
# calculate all data files
all.data <- list.files() # assuming your working directory contains all data
result <- sapply(all.data, calculate.doub.time)
Sorry in advance if I misunderstood what you want to achieve.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Performance R applying indicator and rbinding xts - r

Related

R: Problems with updating list objects using a while-loop

Changing start dates of schedules to optimize resources

Optimising a calculation on every cumulative subset of a vector in R

How to use a while() loop within a for() loop in R

Doubling time on R

Categories

Resources