Minimax theory including alpha beta pruning for R code - r

Need help with developing an alpha beta pruning minimax algorithm in R. Currently I have implemented the minimax algorithm but it is only usable for 3x3 board. 4x4 boards do not run --> to long run time
I have copied the code from the 3x3 board but I realize I cannot provide a depth. So I assume it runs for all examples of a 4x4 board. What do I need to change to implement the alpha beta pruning in the minimax code section. Since I am fairly new to this field, I am trying to modify existing code to understand what each part is doing.
# draw the board for tic tac toe
draw_board <- function(board) {
xo <- c("X", " ", "O") # symbols
par(mar = rep(0, 4))
plot.new()
plot.window(xlim = c(0, 40), ylim = c(0, 40))
abline(h = c(10, 20, 30), col = "darkgrey", lwd = 4)
abline(v = c(10, 20, 30), col = "darkgrey", lwd = 4)
pieces <- xo[board + 2]
text(rep(c(5, 15, 25, 35), 4), c(rep(35, 4), rep(25, 4), rep(15, 4), rep(5, 4)), pieces, cex = 6)
# identify location of any three in a row
square <- t(matrix(board, nrow = 4))
hor <- abs(rowSums(square))
if(any(hor == 4))
hor <- (5 - which(hor == 4)) * 10 - 5
else
hor <- 0
ver <- abs(colSums(square))
if(any(ver == 4))
ver <- which(ver == 4) * 10 - 5
else
ver <- 0
diag1 <- sum(diag(square))
diag2 <- sum(diag(t(apply(square, 2, rev))))
# draw winning lines
if(hor > 0) lines(c(0, 40), rep(hor, 2), lwd = 10, col = "red")
if(ver > 0) lines(rep(ver, 2), c(0, 40), lwd = 10, col = "red")
if(abs(diag1) == 4) lines(c(2, 38), c(38, 2), lwd = 10, col = "red")
if(abs(diag2) == 4) lines(c(2, 38), c(2, 38), lwd = 10, col = "red")
}
# Human player enters a move
move_human <- function(game) {
text(4, 0, "Click on screen to move", col = "grey", cex=.7)
empty <- which(game == 0)
move <- 0
while (!move %in% empty) {
coords <- locator(n = 1) # add lines
coords$x <- floor(abs(coords$x) / 10) + 1
coords$y <- floor(abs(coords$y) / 10) + 1
move <- coords$x + 4 * (4 - coords$y) # 4 is the number of rows/columns --> needs to be a square
}
return (move)
}
# Evaluate winner function
eval_winner <- function(game_1, player) {
game <- matrix(game_1, nrow = 3, byrow = T)
hor <- rowSums(game)
ver <- colSums(game)
diag <- c(sum(diag(game)), sum(diag(apply(game, 1, rev))))
if (-4 %in% c(hor, ver, diag))
return(-10)
if (4 %in% c(hor, ver, diag))
return(10)
else
return(0)
}
# Minimax AI function
minimax <- function(game_1, player) {
free <- which(game_1 == 0)
if(length(free) == 1) {
game_1[free] <- player
return(list(move = free, U = eval_winner(game_1, player)))
}
poss.results <- rep(0, 16)
for(i in free) {
game <- game_1
game[i] <- player
poss.results[i] <- eval_winner(game, player)
}
mm <- ifelse(player == -1, "which.min", "which.max")
if(any(poss.results == (player * 10))) {
move <- do.call(mm, list(poss.results))
return(list(move = move, U = poss.results[move]))
}
for(i in free) {
game <- game_1
game[i] <- player
poss.results[i] <- minimax(game, -player)$U
}
random <- runif(16, 0, 0.1)
poss.results[-free] <- 100 * -player
poss.results <- poss.results + (player * random)
move <- do.call(mm, list(poss.results))
return(list(move = move, U = poss.results[move]))
}
# Main game engine human versus randomly choosing computer!
tic_tac_toe <- function(player1 = "human", player2 = "computer") {
game <- rep(0, 16) # Empty board
winner <- FALSE # Define winner
player <- 1 # First player
#players <- c(player1, player2)
players <- c("human", "computer")
draw_board(game)
while (0 %in% game & winner == 0) { # Keep playing until win or full board
if (players[(player + 3) %% 3] == "human") # Human player
move <- move_human(game)
else { # Computer player
move <- minimax(game, player)
move <- move$move
}
game[move] <- player # Change board
draw_board(game)
winner <- max(eval_winner(game, 1), abs(eval_winner(game, -1))) == 6 # Winner, winner, chicken dinner?
player <- -player # Change player
}
if (winner == 1)
print("Human has won")
else if (winner == 2)
print("Computer has won")
else
print("Play ended in a draw")
}

Going to reiterate what TheWhiteRabbit said: you should have a look at https://stackoverflow.com/help/how-to-ask
If you are more specific and include your code in the future, we can be way more helpful, but I'll give you some general suggestions based on what you have provided.
I theorize your issue might be some of the following:
You are not limiting your depth. You are trying to search all the way to the endgame. Minimax should only search enough turns ahead that your hardware can handle the strain.
Your scoring function is too inefficient. The score function is often the majority of your computation time in a Minimax search. If it is inefficient, you will pay for it.
Similarly, your code that generates a list of valid moves might be inefficient.
You might be considering invalid moves, causing your tree to branch way more than it should.
You are not generalizing your code enough. It doesn't work for 4x4 because you have hardcoded something to rely on a 3x3 board without realizing it.
Your Alpha-Beta pruning is incorrect. You are pruning nothing.
From my experience implementing MiniMax + variants, these tend to be some of the failure points.

Related

How to rotate nodes of a time-calibrated phylogenetic tree to match a particular order in R?

I have a time-calibrated phylogenetic tree from BEAST and I would like to make a figure in which its nodes are rotated to match an arbitrary ordering. The following code works perfectly to plot the tree with the nodes in the order they are in the input file.
library("phytools")
library("phyloch")
library("strap")
library("coda")
t <- read.beast("mcctree.tre") # I couldn't upload the file here
t$root.time <- t$height[1]
num_taxa <- length(t$tip.label)
display_all_node_bars <- TRUE
names_list <-vector()
for (name in t$tip){
v <- strsplit(name, "_")[[1]]
if(display_all_node_bars){
names_list = c(names_list, name)
}
else if(v[length(v)]=="0"){
names_list = c(names_list, name)
}
}
nids <- vector()
pos <- 1
len_nl <- length(names_list)
for(n in names_list){
for(nn in names_list[pos:len_nl]){
if(n != nn){
m <- getMRCA(t,c(n, nn))
if(m %in% nids == FALSE){
nids <- c(nids, m)
}
}
}
pos <- pos+1
}
pdf("tree.pdf", width = 20, height = 20)
geoscalePhylo(tree = t,
x.lim = c(-2,21),
units = c("Epoch"),
tick.scale = "myr",
boxes = FALSE,
width = 1,
cex.tip = 2,
cex.age = 3,
cex.ts = 2,
erotate = 0,
label.offset = 0.1)
lastPP <- get("last_plot.phylo", envir = .PlotPhyloEnv)
for(nv in nids){
bar_xx_a <- c(lastPP$xx[nv]+t$height[nv-num_taxa]-t$"height_95%_HPD_MIN"[nv-num_taxa],
lastPP$xx[nv]-(t$"height_95%_HPD_MAX"[nv-num_taxa]-t$height[nv-num_taxa]))
lines(bar_xx_a, c(lastPP$yy[nv], lastPP$yy[nv]), col = rgb(0, 0, 1, alpha = 0.3), lwd = 12)
}
t$node.label <- t$posterior
p <- character(length(t$node.label))
p[t$node.label >= 0.95] <- "black"
p[t$node.label < 0.95 & t$node.label >= 0.75] <- "gray"
p[t$node.label < 0.75] <- "white"
nodelabels(pch = 21, cex = 1.5, bg = p)
dev.off()
The following code is my attempt to rotate the nodes in the way I want (following this tutorial: http://blog.phytools.org/2015/04/finding-closest-set-of-node-rotations.html). And it works for rotating the nodes. However, the blue bars indicating the confidence intervals of the divergence time estimates get out of their correct place - this is what I would like help to correct. This will be used in much larger files with hundreds of branches - the example here is simplified.
new.order <- c("Sp8","Sp9","Sp10","Sp7","Sp6","Sp5","Sp4","Sp2","Sp3","Ou1","Ou2","Sp1")
t2 <- setNames(1:Ntip(t), new.order)
new.order.tree <- minRotate(t, t2)
new.order.tree$root.time <- t$root.time
new.order.tree$height <- t$height
new.order.tree$"height_95%_HPD_MIN" <- t$"height_95%_HPD_MIN"
new.order.tree$"height_95%_HPD_MAX" <- t$"height_95%_HPD_MAX"
pdf("reordered_tree.pdf", width = 20, height = 20)
geoscalePhylo(tree = new.order.tree,
x.lim = c(-2,21),
units = c("Epoch"),
tick.scale = "myr",
boxes = FALSE,
width = 1,
cex.tip = 2,
cex.age = 3,
cex.ts = 2,
erotate = 0,
label.offset = 0.1)
lastPP <- get("last_plot.phylo", envir = .PlotPhyloEnv)
for(nv in nids){
bar_xx_a <- c(lastPP$xx[nv]+new.order.tree$height[nv-num_taxa]-new.order.tree$"height_95%_HPD_MIN"[nv-num_taxa],
lastPP$xx[nv]-(new.order.tree$"height_95%_HPD_MAX"[nv-num_taxa]-new.order.tree$height[nv-num_taxa]))
lines(bar_xx_a, c(lastPP$yy[nv], lastPP$yy[nv]), col = rgb(0, 0, 1, alpha = 0.3), lwd = 12)
}
new.order.tree$node.label <- t$posterior
p <- character(length(new.order.tree$node.label))
p[new.order.tree$node.label >= 0.95] <- "black"
p[new.order.tree$node.label < 0.95 & new.order.tree$node.label >= 0.75] <- "gray"
p[new.order.tree$node.label < 0.75] <- "white"
nodelabels(pch = 21, cex = 1.5, bg = p)
dev.off()
I've found several similar questions here and in other forums, but none dealing specifically with time-calibrated trees - which is the core of the problem described above.
The short answer is that phyTools::minRotate() doesn't recognize the confidence intervals as associated with nodes. If you contact the phyTools maintainers, they may well be able to add this functionality quite easily.
Meanwhile, you can correct this yourself.
I don't know how read.beast() saves confidence intervals – let's say they're saved in t$conf.int. (Type unclass(t) at the R command line to see the full structure; you should be able to identify the appropriate property.)
If the tree's node labels are unique, then you can infer the new sequence of nodes using match():
library("phytools")
new.order <- c("Sp8","Sp9","Sp10","Sp7","Sp6","Sp5","Sp4","Sp2","Sp3","Ou1","Ou2","Sp1")
# Set up a fake initial tree -- you would load the tree from a file
tree <- rtree(length(new.order))
tree$tip.label <- sort(new.order)
tree$node.label <- seq_len(tree$Nnode)
tree$conf.int <- seq_len(tree$Nnode) * 10
# Plot tree
par(mfrow = c(1, 2), mar = rep(0, 4), cex = 0.9) # Create space
plot(tree, show.node.label = TRUE)
nodelabels(tree$conf.int, adj = 1) # Annotate "correct" intervals
# Re-order nodes with minRotate
noTree <- minRotate(tree, setNames(seq_along(new.order), new.order))
plot(noTree, show.node.label = TRUE)
# Move confidence intervals to correct node
tree$conf.int <- tree$conf.int[match(noTree$node.label, tree$node.label)]
nodelabels(tree$conf.int, adj = 1)
If you can't guarantee that the node labels are unique, you can always overwrite them in a temporary object:
# Find node order
treeCopy <- tree
treeCopy$node.label <- seq_len(tree$Nnode)
nodeOrder <- match(minRotate(treeCopy)$node.label, treeCopy$node.label)
# Apply node order
tree$conf.int <- tree$conf.int[nodeOrder]

Infinite loop in coin-flipping game

Consider the following coin-flipping game:
A single play of the game consists of repeatedly flipping a fair coin until the difference between the number of heads tossed and the number of tails is 4.
You are required to pay 1 dollar for each flip of the coin, and you may not quit during the play of the game.
You receive 10 dollars at the end of each play of the game. The “winnings” from the game is defined as the 10 received at the end minus the amount paid. a. Simulate this game to estimate the expected winnings from many plays of the game. b. Suppose we use a biased coin. Find value(s) of P(tail) that make the game fair, meaning the expected winnings is 0 dollar.
This is the question that I'm supposed to answer and here is my try
h <- function() {
A <- c("H", "T")
s <- sample(A,4, replace = T)
heads <- length(which(s=="H"))
tails <- length(which(s =="T"))
w <- heads - tails
counter <- 4
while (w != 4) {
s <- sample(A,1)
w <- heads - tails
heads <- length(which(s=="H"))
tails <- length(which(s =="T"))
counter <- counter +1
}
return(counter)
}
h()
But I think this gave me a infinite loop, can anyone help please?
You are recomputing w in ever iteration of the loop based on the current value of heads and tails. But these values will always be 1 and 0 (or 0 and 1). So w is always either -1 or 1, never any other value.
Another error in your code is that you only stop when heads is 4 ahead. But according to the rules, the game should also stop when tails is 4 ahead: only the absolute difference matters.
The logic of your code could be fixed, but a much simpler logic would work (note that the following code uses self-explanatory variable names, which makes the resulting code much more readable):
h = function () {
sides = c('H', 'T')
diff = 0L
cost = 0L
repeat {
cost = cost + 1L
flip = sample(sides, 1L)
if (flip == 'H') diff = diff + 1L
else diff = diff - 1L
if (abs(diff) == 4L) return(cost)
}
}
You can simplify this further because the labels of the coin sides don’t actually matter. All you care about is a coin toss that returns one of two results.
We can implement that as a separate function. The return value of the function isn’t very important, as long as we have a fixed convention: it could be in c('H', 'T'), or c(FALSE, TRUE), or c(0L, 1L), etc. For our purposes, it would be convenient to return either -1L or 1L, so that our function h could directly add that value to diff:
coin_toss = function () {
sample(c(-1L, 1L), 1L)
}
But there’s a different way of obtaining a coin toss: a Bernoulli trial of size 1. And using a Bernoulli trial has a nice property: we can trivially extend our function to allow unfair (biased) coin tosses. So here’s the same function, but with an optional bias parameter (by default the coin toss is fair):
coin_toss = function (bias = 0.5) {
rbinom(1L, 1L, prob = bias) * 2L - 1L
}
(rbinom(…) returns either 0L or 1L. To transform the domain of values into c(-1L, 1L), we multiply by 2 and subtract 1.)
Now let’s change h to use this function:
h = function (bias = 0.5) {
cost = 0L
diff = 0L
repeat {
cost = cost + 1L
diff = diff + coin_toss(bias)
if (abs(diff) == 4L) return(cost)
}
}
coin_toss() is either 0 or 1 but, depending on its value, we either
I'd like to answer your questions, both a) and b) part. I'll use my codes to save my time.
It's a cool game, where software simulation could prove to be very helpful.
The bare bones of the game is "never ending loop", which eventually ends when absolute difference of the number of heads and tails is equal 4. The payoff is then recorded. As Konrad Rudolph mentioned, the game is of Bernoulli type. The game is simulated with the code below:
n_games <- 1000 # number of games to play
bias <- 0.5
game_payoff <- c()
for (i in seq_len(n_games)) {
cost <- 0
flip_record <- c()
payoff <- c()
repeat{
cost <- cost + 1
flip <- rbinom(1, 1, prob = bias)
flip_record <- c(flip_record, flip)
n_tails <- length(flip_record) - sum(flip_record) # number of 0s/tails
n_heads <- sum(flip_record) # number of 1s/heads
if (abs(n_tails - n_heads) == 4) {
game_payoff <- c(game_payoff, 10 - cost) # record game payoff
print(paste0("single game payoff: ", 10 - cost)) # print game payoff
break
}
}
}
With a large number of runs, e.g. another loop over this loop, we learn, that the expected value is very close to -6. Thus, the game has negative expected value. It follows from this code:
library(ggplot2)
seed <- 122334
# simulation
n_runs <- 100
n_games <- 10000
bias <- 0.5
game_payoff <- c()
expected_value_record <- c()
for (j in seq_len(n_runs)) {
for (i in seq_len(n_games)) {
cost <- 0
flip_record <- c()
payoff <- c()
repeat{
cost <- cost + 1
flip <- rbinom(1, 1, prob = bias)
flip_record <- c(flip_record, flip)
# print(flip_record)
n_tails <- length(flip_record) - sum(flip_record) # number of 0s/tails
n_heads <- sum(flip_record) # number of 1s/heads
if (abs(n_tails - n_heads) == 4) {
game_payoff <- c(game_payoff, 10 - cost) # record game payoff
print(paste0("single game payoff: ", 10 - cost))
break
}
}
}
expected_value_record <- c(expected_value_record, mean(game_payoff))
game_payoff <- c()
}
# plot expected value
expected_value_record <- cbind.data.frame("run" = seq_len(length(expected_value_record)), expected_value_record)
ggplot(data = expected_value_record) +
geom_line(aes(x = run, y = expected_value_record)) +
scale_x_continuous(breaks = c(seq(1, max(expected_value_record$run), by = 3), max(expected_value_record$run))) +
labs(
title = "Coin flip experiment: expected value in each run. ",
caption = paste0("Number of runs: ", n_runs, ". ", "Number of games in each run: ", n_games, "."),
x = "Run",
y = "Expected value") +
geom_hline(yintercept = mean(expected_value_record$expected_value_record), size = 1.4, color = "red") +
annotate(
geom = "text",
x = 0.85 * n_runs,
y = max(expected_value_record$expected_value_record),
label = paste0("Mean across runs: ", mean(expected_value_record$expected_value_record)),
color = "red") +
theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.5))
Graphics:
Let's now look at part b) of the question with another simulation. The loop has been wrapped into a function, which with the help of sapply we run over a sequence of probabilities:
library(ggplot2)
seed <- 122334
# simulation function
coin_game <- function(n_runs, n_games, bias = 0.5){
game_payoff <- c()
expected_value_record <- c()
for (j in seq_len(n_runs)) {
for (i in seq_len(n_games)) {
cost <- 0
flip_record <- c()
payoff <- c()
repeat{
cost <- cost + 1
flip <- rbinom(1, 1, prob = bias)
flip_record <- c(flip_record, flip)
# print(flip_record)
n_tails <- length(flip_record) - sum(flip_record) # number of 0s/tails
n_heads <- sum(flip_record) # number of 1s/heads
if (abs(n_tails - n_heads) == 4) {
game_payoff <- c(game_payoff, 10 - cost) # record game payoff
break
}
}
}
expected_value_record <- c(expected_value_record, mean(game_payoff))
game_payoff <- c()
}
return(expected_value_record)
}
# run coin_game() on a vector of probabilities - introduce bias to find fair game conditions
n_runs = 1
n_games = 1000
expected_value_record <- sapply(seq(0.01, 0.99, by = 0.01), coin_game, n_runs = n_runs, n_games = n_games)
# plot expected value
expected_value_record <- cbind.data.frame("run" = seq_len(length(expected_value_record)), "bias" = c(seq(0.01, 0.99, by = 0.01)), expected_value_record)
ggplot(data = expected_value_record) +
geom_line(aes(x = bias, y = expected_value_record)) +
scale_x_continuous(breaks = c(seq(min(expected_value_record$bias), max(expected_value_record$bias), by = 0.1), max(expected_value_record$bias))) +
scale_y_continuous(breaks = round(c(0, seq(min(expected_value_record$expected_value_record), max(expected_value_record$expected_value_record), length.out = 10)), digits = 4)) +
labs(
title = "Coin flip experiment: expected value for each probability level",
caption = paste0("Number of runs per probability level: ", n_runs, ". ", "Number of games in each run: ", n_games, "."),
x = "Probability of success in Bernoulli trial",
y = "Expected value") +
geom_hline(yintercept = 0, size = 1.4, color = "red") +
geom_text(aes(x = 0.1, y = 0, label = "Fair game", hjust = 1, vjust = -1), size = 4, color = "red") +
theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.5))
Graphics:
Examination of the expected_value_record dataframe suggests, the game is fair when probability values are within ranges: 0.32-0.33 or 0.68-0.69.
It's easy to tweak the last code to squeeze more robust numbers out of it.

How to plot points aligned at different angles and connect these by a line?

I am trying to simulate the shape of the rings from a trunk section in R, but each time that I want to approach the real shape it get more difficult. I started doing it with four radii measurements, and I got a nice solution (see here).
However, now I want to plot more than four radii but at different angles, and connect these points with a line simulating the rings like this sketch that I did:
My first approach was to rotate the matrix of data, but I could not make that all radii started in the same position (0,0). I also tried to rate the axes without success.
That is why I would like to ask for some direction to do it, and finally calculate the area of each ring.
Any help will be welcome
I am using the spline.poly function from here.
spline.poly
spline.poly <- function(xy, vertices, k=3, ...) {
# Assert: xy is an n by 2 matrix with n >= k.
# Wrap k vertices around each end.
n <- dim(xy)[1]
if (k >= 1) {
data <- rbind(xy[(n-k+1):n,], xy, xy[1:k, ])
} else {
data <- xy
}
# Spline the x and y coordinates.
data.spline <- spline(1:(n+2*k), data[,1], n=vertices, ...)
x <- data.spline$x
x1 <- data.spline$y
x2 <- spline(1:(n+2*k), data[,2], n=vertices, ...)$y
# Retain only the middle part.
cbind(x1, x2)[k < x & x <= n+k, ]
}
DATA
df = data.frame(A = c(1, 4, 5, 8, 10),
B = c(1, 3, 7, 9, 10),
C = c(2, 6, 8, 9, 10),
D = c(1, 3, 4, 7, 9),
E = c(1, 2, 3, 4, 5))
DRAW
#Calculate angles based on number of columns
angles = 0:(NCOL(df) - 1) * 2*pi/NCOL(df)
#Calculate x and y corresponding to each radial distance
toplot = lapply(1:NCOL(df), function(i){
data.frame(x = df[,i]*cos(angles[i]),
y = df[,i]*sin(angles[i]))
})
#Split toplot and merge back together the same rows
toplot2 = lapply(toplot, function(x) data.frame(x, ID = sequence(NROW(x))))
toplot2 = do.call(rbind, toplot2)
toplot2 = split(toplot2, toplot2$ID)
#Create empty plot
graphics.off()
plot(do.call(rbind, toplot), type = "n", axes = FALSE, ann = FALSE, asp = 1)
#Allow drawing outside the plot region just in case
par(xpd = TRUE)
#Draw polygons
lapply(toplot2, function(a){
polygon(spline.poly(xy = cbind(a$x, a$y), vertices = 100, k = 3))
})
#Draw points
lapply(toplot, function(a){
points(a)
})
#Draw radial lines
lapply(toplot, function(a){
lines(a)
})
AREA
area_data = lapply(toplot2, function(a){
spline.poly(xy = cbind(a$x, a$y), vertices = 100, k = 3)
})
library(geometry)
lapply(area_data, function(P) polyarea(P[,1], P[,2]))
#$`1`
#[1] 4.35568
#$`2`
#[1] 38.46985
#$`3`
#[1] 96.41331
#$`4`
#[1] 174.1584
#$`5`
#[1] 240.5837

Stepfun function markov

Don't be scared by my long code. What i am wondering is about the last part, the plot(step fun... part. When i enter this into Rstudio i get "stepfun "x" must be ordered increasingly"
Is there any1 here who knows what I have to do to finish this correctly?
bd_process <- function(lambda, mu, initial_state = 0, steps = 100) {
time_now <- 0
state_now <- initial_state
time <- 0
state <- initial_state
for (i in 1:steps) {
if (state_now == 3) {
lambda_now <- 0
} else {
lambda_now <- lambda
}
if (state_now == 0) {
mu_now <- 0
} else {
mu_now <- mu
}
time_to_transition <- rexp(mu, rate = 1) + rexp(lambda, rate = 1)
X <- rexp(mu, rate = 1)
Y <- rexp(lambda, rate = 1)
if (X < Y) {
state_now <- state_now - 1
} else {
state_now <- state_now + 1
}
time_now <- time_now + time_to_transition
time <- c(time, time_now)
state <- c(state, state_now)
}
list(time = time, state = state) }
set.seed(19930628)
proposal1 <- bd_process(lambda = 2, mu = 10)
proposal2 <- bd_process(lambda = 6, mu = 10)
proposal3 <- bd_process(lambda = 10, mu = 10)
time1 <- proposal1$time
state1 <- proposal1$state
plot(stepfun(time1[-1], state1),
do.points = FALSE,
xlab = "Tid",
ylab = "Tillstånd",
main = "",
yaxt = "n")
axis(2, at = c(0, 1, 2, 3), las = 2)
I don't know what your code is doing but you've asked us not to worry about that. At the moment it appears that you have only constructed "time intervals" but now need to "stack them together" or "integrate" them along a proper time axis. In order to plot a simulation of a stepfunction, you should be using cumsum to construct an increasing time1 vector. Because the "time" and "state" variables are of such different lengths a quick fix to the function arguments is trimming the time1 vector so it is the correct length for the state1 variable, and you get no error with:
plot(stepfun(cumsum(time1[2:101]), state1),
do.points = FALSE,
xlab = "Tid",
ylab = "Tillstånd",
main = "",
yaxt = "n")
axis(2, at = c(0, 1, 2, 3), las = 2)
Maybe if you "march step-by-step" through the code and explain the code (to yourself and the rest of us) using comments you will figure out why you have 10 times as many time1's as you have state1's. I suspect it may have something to do with using "mu" as the first argument to rexp(mu, rate = 1). The first argument to random number generators in R is usually a positive integer that determines length (the number of random numbers) from the distribution.

n-armed bandit simulation in R

I'm using Sutton & Barto's ebook Reinforcement Learning: An Introduction to study reinforcement learning. I'm having some issues trying to emulate the results (plots) on the action-value page.
More specifically, how can I simulate the greedy value for each task? The book says:
...we can plot the performance and behavior of various methods as
they improve with experience over 1000 plays...
So I guess I have to keep track of the exploratory values as better ones are found. The issue is how to do this using the greedy approach - since there are no exploratory moves, how do I know what is a greedy behavior?
Thanks for all the comments and answers!
UPDATE: See code on my answer.
I finally got this right. The eps player should beat the greedy player because of the exploratory moves, as pointed out int the book.
The code is slow and need some optimizations, but here it is:
get.testbed = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1){
optimal = rnorm(arms, u, sdev.arm)
rewards = sapply(optimal, function(x)rnorm(plays, x, sdev.rewards))
list(optimal = optimal, rewards = rewards)
}
play.slots = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = 0.1){
testbed = get.testbed(arms, plays, u, sdev.arm, sdev.rewards)
optimal = testbed$optimal
rewards = testbed$rewards
optim.index = which.max(optimal)
slot.rewards = rep(0, arms)
reward.hist = rep(0, plays)
optimal.hist = rep(0, plays)
pulls = rep(0, arms)
probs = runif(plays)
# vetorizar
for (i in 1:plays){
## dont use ifelse() in this case
## idx = ifelse(probs[i] < eps, sample(arms, 1), which.max(slot.rewards))
idx = if (probs[i] < eps) sample(arms, 1) else which.max(slot.rewards)
reward.hist[i] = rewards[i, idx]
if (idx == optim.index)
optimal.hist[i] = 1
slot.rewards[idx] = slot.rewards[idx] + (rewards[i, idx] - slot.rewards[idx])/(pulls[idx] + 1)
pulls[idx] = pulls[idx] + 1
}
list(slot.rewards = slot.rewards, reward.hist = reward.hist, optimal.hist = optimal.hist, pulls = pulls)
}
do.simulation = function(N = 100, arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = c(0.0, 0.01, 0.1)){
n.players = length(eps)
col.names = paste('eps', eps)
rewards.hist = matrix(0, nrow = plays, ncol = n.players)
optim.hist = matrix(0, nrow = plays, ncol = n.players)
colnames(rewards.hist) = col.names
colnames(optim.hist) = col.names
for (p in 1:n.players){
for (i in 1:N){
play.results = play.slots(arms, plays, u, sdev.arm, sdev.rewards, eps[p])
rewards.hist[, p] = rewards.hist[, p] + play.results$reward.hist
optim.hist[, p] = optim.hist[, p] + play.results$optimal.hist
}
}
rewards.hist = rewards.hist/N
optim.hist = optim.hist/N
optim.hist = apply(optim.hist, 2, function(x)cumsum(x)/(1:plays))
### Plot helper ###
plot.result = function(x, n.series, colors, leg.names, ...){
for (i in 1:n.series){
if (i == 1)
plot.ts(x[, i], ylim = 2*range(x), col = colors[i], ...)
else
lines(x[, i], col = colors[i], ...)
grid(col = 'lightgray')
}
legend('topleft', leg.names, col = colors, lwd = 2, cex = 0.6, box.lwd = NA)
}
### Plot helper ###
#### Plots ####
require(RColorBrewer)
colors = brewer.pal(n.players + 3, 'Set2')
op <-par(mfrow = c(2, 1), no.readonly = TRUE)
plot.result(rewards.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Average reward', lwd = 2)
plot.result(optim.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Optimal move %', lwd = 2)
#### Plots ####
par(op)
}
To run it just call
do.simulation(N = 100, arms = 10, eps = c(0, 0.01, 0.1))
You could also choose to make use of the R package "contextual", which aims to ease the implementation and evaluation of both context-free (as described in Sutton & Barto) and contextual (such as for example LinUCB) Multi-Armed Bandit policies.
The package actually offers a vignette on how to replicate all Sutton & Barto bandit plots. For example, to generate the ε-greedy plots, just simulate EpsilonGreedy policies against a Gaussian bandit :
library(contextual)
set.seed(2)
mus <- rnorm(10, 0, 1)
sigmas <- rep(1, 10)
bandit <- BasicGaussianBandit$new(mu_per_arm = mus, sigma_per_arm = sigmas)
agents <- list(Agent$new(EpsilonGreedyPolicy$new(0), bandit, "e = 0, greedy"),
Agent$new(EpsilonGreedyPolicy$new(0.1), bandit, "e = 0.1"),
Agent$new(EpsilonGreedyPolicy$new(0.01), bandit, "e = 0.01"))
simulator <- Simulator$new(agents = agents, horizon = 1000, simulations = 2000)
history <- simulator$run()
plot(history, type = "average", regret = FALSE, lwd = 1, legend_position = "bottomright")
plot(history, type = "optimal", lwd = 1, legend_position = "bottomright")
Full disclosure: I am one of the developers of the package.
this is what I have so far based on our chat:
set.seed(1)
getRewardsGaussian <- function(arms, plays) {
## assuming each action has a normal distribution
# first generate new means
QStar <- rnorm(arms, 0, 1)
# then for each mean, generate `play`-many samples
sapply(QStar, function(u)
rnorm(plays, u, 1))
}
CalculateRewardsPerMethod <- function(arms=7, epsi1=0.01, epsi2=0.1
, plays=1000, methods=c("greedy", "epsi1", "epsi2")) {
# names for easy handling
names(methods) <- methods
arm.names <- paste0("Arm", ifelse((1:arms)<10, 0, ""), 1:arms)
# this could be different if not all actions' rewards have a gaussian dist.
rewards.source <- getRewardsGaussian(arms, plays)
# Three dimensional array to track running averages of each method
running.avgs <-
array(0, dim=c(plays, arms, length(methods))
, dimnames=list(PlayNo.=NULL, Arm=arm.names, Method=methods))
# Three dimensional array to track the outcome of each play, according to each method
rewards.received <-
array(NA_real_, dim=c(plays, 2, length(methods))
, dimnames=list(PlayNo.=seq(plays), Outcome=c("Arm", "Reward"), Method=methods))
# define the function internally to not have to pass running.avgs
chooseAnArm <- function(p) {
# Note that in a tie, which.max returns the lowest value, which is what we want
maxes <- apply(running.avgs[p, ,methods, drop=FALSE], 3, which.max)
# Note: deliberately drawing two separate random numbers and keeping this as
# two lines of code to accent that the two draws should not be related
if(runif(1) < epsi1)
maxes["epsi1"] <- sample(arms, 1)
if(runif(1) < epsi2)
maxes["epsi2"] <- sample(arms, 1)
return(maxes)
}
## TODO: Perform each action at least once, then select according to algorithm
## Starting points. Everyone starts at machine 3
choice <- c(3, 3, 3)
reward <- rewards.source[1, choice]
## First run, slightly different
rewards.received[1,,] <- rbind(choice, reward)
running.avgs[1, choice, ] <- reward # if different starting points, this needs to change like below
## HERE IS WHERE WE START PULLING THE LEVERS ##
## ----------------------------------------- ##
for (p in 2:plays) {
choice <- chooseAnArm(p)
reward <- rewards.source[p, choice]
# Note: When dropping a dim, the methods will be the columns
# and the Outcome info will be the rows. Use `rbind` instead of `cbind`.
rewards.received[p,,names(choice)] <- rbind(choice, reward)
## Update the running averages.
## For each method, the current running averages are the same as the
## previous for all arms, except for the one chosen this round.
## Thus start with last round's averages, then update the one arm.
running.avgs[p,,] <- running.avgs[p-1,,]
# The updating is only involved part (due to lots of array-indexing)
running.avgs[p,,][cbind(choice, 1:3)] <-
sapply(names(choice), function(m)
# Update the running average for the selected arm (for the current play & method)
mean( rewards.received[ 1:p,,,drop=FALSE][ rewards.received[1:p,"Arm",m] == choice[m],"Reward",m])
)
} # end for-loop
## DIFFERENT RETURN OPTIONS ##
## ------------------------ ##
## All rewards received, in simplifed matrix (dropping information on arm chosen)
# return(rewards.received[, "Reward", ])
## All rewards received, along with which arm chosen:
# return(rewards.received)
## Running averages of the rewards received by method
return( apply(rewards.received[, "Reward", ], 2, cumsum) / (1:plays) )
}
### EXECUTION (AND SIMULATION)
## PARAMETERS
arms <- 10
plays <- 1000
epsi1 <- 0.01
epsi2 <- 0.1
simuls <- 50 # 2000
methods=c("greedy", "epsi1", "epsi2")
## Single Iteration:
### we can run system time to get an idea for how long one will take
tme <- system.time( CalculateRewardsPerMethod(arms=arms, epsi1=epsi1, epsi2=epsi2, plays=plays) )
cat("Expected run time is approx: ", round((simuls * tme[["elapsed"]]) / 60, 1), " minutes")
## Multiple iterations (simulations)
rewards.received.list <- replicate(simuls, CalculateRewardsPerMethod(arms=arms, epsi1=epsi1, epsi2=epsi2, plays=plays), simplify="array")
## Compute average across simulations
rewards.received <- apply(rewards.received.list, 1:2, mean)
## RESULTS
head(rewards.received, 17)
MeanRewards <- rewards.received
## If using an alternate return method in `Calculate..` use the two lines below to calculate running avg
# CumulRewards <- apply(rewards.received, 2, cumsum)
# MeanRewards <- CumulRewards / (1:plays)
## PLOT
plot.ts(MeanRewards[, "greedy"], col = 'red', lwd = 2, ylim = range(MeanRewards), ylab = 'Average reward', xlab="Plays")
lines(MeanRewards[, "epsi1"], col = 'orange', lwd = 2)
lines(MeanRewards[, "epsi2"], col = 'navy', lwd = 2)
grid(col = 'darkgray')
legend('bottomright', c('greedy', paste("epsi1 =", epsi1), paste("epsi2 =", epsi2)), col = c('red', 'orange', 'navy'), lwd = 2, cex = 0.8)
You may also want to check this link
https://www.datahubbs.com/multi_armed_bandits_reinforcement_learning_1/
Copy of the relevant code from the above source
It does not use R but simply np.random.rand() from numpy
class eps_bandit:
'''
epsilon-greedy k-bandit problem
Inputs
=====================================================
k: number of arms (int)
eps: probability of random action 0 < eps < 1 (float)
iters: number of steps (int)
mu: set the average rewards for each of the k-arms.
Set to "random" for the rewards to be selected from
a normal distribution with mean = 0.
Set to "sequence" for the means to be ordered from
0 to k-1.
Pass a list or array of length = k for user-defined
values.
'''
def __init__(self, k, eps, iters, mu='random'):
# Number of arms
self.k = k
# Search probability
self.eps = eps
# Number of iterations
self.iters = iters
# Step count
self.n = 0
# Step count for each arm
self.k_n = np.zeros(k)
# Total mean reward
self.mean_reward = 0
self.reward = np.zeros(iters)
# Mean reward for each arm
self.k_reward = np.zeros(k)
if type(mu) == list or type(mu).__module__ == np.__name__:
# User-defined averages
self.mu = np.array(mu)
elif mu == 'random':
# Draw means from probability distribution
self.mu = np.random.normal(0, 1, k)
elif mu == 'sequence':
# Increase the mean for each arm by one
self.mu = np.linspace(0, k-1, k)
def pull(self):
# Generate random number
p = np.random.rand()
if self.eps == 0 and self.n == 0:
a = np.random.choice(self.k)
elif p < self.eps:
# Randomly select an action
a = np.random.choice(self.k)
else:
# Take greedy action
a = np.argmax(self.k_reward)
reward = np.random.normal(self.mu[a], 1)
# Update counts
self.n += 1
self.k_n[a] += 1
# Update total
self.mean_reward = self.mean_reward + (
reward - self.mean_reward) / self.n
# Update results for a_k
self.k_reward[a] = self.k_reward[a] + (
reward - self.k_reward[a]) / self.k_n[a]
def run(self):
for i in range(self.iters):
self.pull()
self.reward[i] = self.mean_reward
def reset(self):
# Resets results while keeping settings
self.n = 0
self.k_n = np.zeros(k)
self.mean_reward = 0
self.reward = np.zeros(iters)
self.k_reward = np.zeros(k)

Resources