I'm wondering what I should be doing here (please refer to image). I have already defined two vectors which are k=c(0,1) and v=c(runif(2,0.3,0.7)) where alpha=v[1] and beta=v[2].
Afterwards, I used an if statement, if(Xn==k[1]){...} However this is where I am stuck at. According to the question, I have to assign Xn+1=k[1] with probability (alpha) at the same time Xn+1=k[2] with probability (1-alpha) and if(Xn==k[2]){...} then Xn+1=k[1] has probability (beta) and Xn+1=k[2] will have probability (1-beta).
So my question is how do you assign the values to the respective Xn+1 values of 0 and 1 with probabilities [(alpha), (1-alpha)] and [(beta),(1-beta)]. After assigning it, how do you then run a simulation of 500 observations from X1 to X500 of the random variable by using a for loop This is similar to the coin toss experiment with the exception being that probability of Heads and Tails are decided by [alpha,beta] = runif(2,0.3,0.7)`.
Here is a base R solution.
toss <- function(n = 500L){
a <- runif(2, min = 0.3, max = 0.7)
alpha <- a[1]
beta <- a[2]
x <- integer(n)
x[1] <- rbinom(1, size = 1, prob = alpha)
for(i in seq_len(n - 1)){
if(x[i] == 0)
x[i + 1L] <- rbinom(1, size = 1, prob = 1 - alpha)
else
x[i + 1L] <- rbinom(1, size = 1, prob = 1 - beta)
}
list(x = x, alpha = alpha, beta = beta)
}
set.seed(2021)
X <- toss()
table(X$x)
#
# 0 1
#277 223
mean(X$x)
#[1] 0.446
X$alpha
#[1] 0.4805069
X$beta
#[1] 0.6135119
Histogram of 1000 runs.
To run the function repeatedly, use replicate.
Y <- replicate(1000, mean(toss()$x))
hist(Y, xlab = "Proportion of successes")
I am trying to integrate the following function using a Monte Carlo Integration. The interval I want to integrate is x <- seq(0, 1, by = 0.01) and y <- seq(0, 1, by = 0.01).
my.f <- function(x, y){
result = x^2 + sin(x) + exp(cos(y))
return(result)
}
I calculated the integral using the cubature package.
library(cubature)
library(plotly)
# Rewriting the function, so it can be integrated
cub.function <- function(x){
result = x[1]^2 + sin(x[1]) + exp(cos(x[2]))
return(result)
}
cub.integral <- adaptIntegrate(f = cub.function, lowerLimit = c(0,0), upperLimit = c(1,1))
The result is 3.134606. But when I use my Monte Carlo Integration Code, see below, my result is about 1.396652. My code is wrong by more than a factor of 2!
What I did:
Since I need a volume to conduct a Monte Carlo Integration, I calculated the function values on the mentioned interval. This will give me an estimation of the maximum and minimum of the function.
# My data range
x <- seq(0, 1, by = 0.01)
y <- seq(0, 1, by = 0.01)
# The matrix, where I save the results
my.f.values <- matrix(0, nrow = length(x), ncol = length(y))
# Calculation of the function values
for(i in 1:length(x)){
for(j in 1:length(y)){
my.f.values[i,j] <- my.f(x = x[i], y = y[j])
}
}
# The maximum and minimum of the function values
max(my.f.values)
min(my.f.values)
# Plotting the surface, but this is not necessary
plot_ly(y = x, x = y, z = my.f.values) %>% add_surface()
So, the volume that we need is simply the maximum of the function values, since 1 * 1 * 4.559753 is simply 4.559753.
# Now, the Monte Carlo Integration
# I found the code online and modified it a bit.
monte = function(x){
tests = rep(0,x)
hits = 0
for(i in 1:x){
y = c(runif(2, min = 0, max = 1), # y[1] is y; y[2] is y
runif(1, min = 0, max = max(my.f.values))) # y[3] is z
if(y[3] < y[1]**2+sin(y[1])*exp(cos(y[2]))){
hits = hits + 1
}
prop = hits / i
est = prop * max(my.f.values)
tests[i] = est
}
return(tests)
}
size = 10000
res = monte(size)
plot(res, type = "l")
lines(x = 1:size, y = rep(cub.integral$integral, size), col = "red")
So, the result is completely wrong. But if I change the function a bit, suddenly is works.
monte = function(x){
tests = rep(0,x)
hits = 0
for(i in 1:x){
x = runif(1)
y = runif(1)
z = runif(1, min = 0, max = max(my.f.values))
if(z < my.f(x = x, y = y)){
hits = hits + 1
}
prop = hits / i
est = prop * max(my.f.values)
tests[i] = est
}
return(tests)
}
size = 10000
res = monte(size)
plot(res, type = "l")
lines(x = 1:size, y = rep(cub.integral$integral, size), col = "red")
Can somebody explain why the result suddenly changes? To me, both functions seem to do the exact same thing.
In your (first) code for monte, this line is in error:
y[3] < y[1]**2+sin(y[1])*exp(cos(y[2]))
Given your definition of my.f, it should surely be
y[3] < y[1]**2 + sin(y[1]) + exp(cos(y[2]))
Or..., given that you shouldn't be repeating yourself unnecessarily:
y[3] < my.f(y[1], y[2])
Need help with developing an alpha beta pruning minimax algorithm in R. Currently I have implemented the minimax algorithm but it is only usable for 3x3 board. 4x4 boards do not run --> to long run time
I have copied the code from the 3x3 board but I realize I cannot provide a depth. So I assume it runs for all examples of a 4x4 board. What do I need to change to implement the alpha beta pruning in the minimax code section. Since I am fairly new to this field, I am trying to modify existing code to understand what each part is doing.
# draw the board for tic tac toe
draw_board <- function(board) {
xo <- c("X", " ", "O") # symbols
par(mar = rep(0, 4))
plot.new()
plot.window(xlim = c(0, 40), ylim = c(0, 40))
abline(h = c(10, 20, 30), col = "darkgrey", lwd = 4)
abline(v = c(10, 20, 30), col = "darkgrey", lwd = 4)
pieces <- xo[board + 2]
text(rep(c(5, 15, 25, 35), 4), c(rep(35, 4), rep(25, 4), rep(15, 4), rep(5, 4)), pieces, cex = 6)
# identify location of any three in a row
square <- t(matrix(board, nrow = 4))
hor <- abs(rowSums(square))
if(any(hor == 4))
hor <- (5 - which(hor == 4)) * 10 - 5
else
hor <- 0
ver <- abs(colSums(square))
if(any(ver == 4))
ver <- which(ver == 4) * 10 - 5
else
ver <- 0
diag1 <- sum(diag(square))
diag2 <- sum(diag(t(apply(square, 2, rev))))
# draw winning lines
if(hor > 0) lines(c(0, 40), rep(hor, 2), lwd = 10, col = "red")
if(ver > 0) lines(rep(ver, 2), c(0, 40), lwd = 10, col = "red")
if(abs(diag1) == 4) lines(c(2, 38), c(38, 2), lwd = 10, col = "red")
if(abs(diag2) == 4) lines(c(2, 38), c(2, 38), lwd = 10, col = "red")
}
# Human player enters a move
move_human <- function(game) {
text(4, 0, "Click on screen to move", col = "grey", cex=.7)
empty <- which(game == 0)
move <- 0
while (!move %in% empty) {
coords <- locator(n = 1) # add lines
coords$x <- floor(abs(coords$x) / 10) + 1
coords$y <- floor(abs(coords$y) / 10) + 1
move <- coords$x + 4 * (4 - coords$y) # 4 is the number of rows/columns --> needs to be a square
}
return (move)
}
# Evaluate winner function
eval_winner <- function(game_1, player) {
game <- matrix(game_1, nrow = 3, byrow = T)
hor <- rowSums(game)
ver <- colSums(game)
diag <- c(sum(diag(game)), sum(diag(apply(game, 1, rev))))
if (-4 %in% c(hor, ver, diag))
return(-10)
if (4 %in% c(hor, ver, diag))
return(10)
else
return(0)
}
# Minimax AI function
minimax <- function(game_1, player) {
free <- which(game_1 == 0)
if(length(free) == 1) {
game_1[free] <- player
return(list(move = free, U = eval_winner(game_1, player)))
}
poss.results <- rep(0, 16)
for(i in free) {
game <- game_1
game[i] <- player
poss.results[i] <- eval_winner(game, player)
}
mm <- ifelse(player == -1, "which.min", "which.max")
if(any(poss.results == (player * 10))) {
move <- do.call(mm, list(poss.results))
return(list(move = move, U = poss.results[move]))
}
for(i in free) {
game <- game_1
game[i] <- player
poss.results[i] <- minimax(game, -player)$U
}
random <- runif(16, 0, 0.1)
poss.results[-free] <- 100 * -player
poss.results <- poss.results + (player * random)
move <- do.call(mm, list(poss.results))
return(list(move = move, U = poss.results[move]))
}
# Main game engine human versus randomly choosing computer!
tic_tac_toe <- function(player1 = "human", player2 = "computer") {
game <- rep(0, 16) # Empty board
winner <- FALSE # Define winner
player <- 1 # First player
#players <- c(player1, player2)
players <- c("human", "computer")
draw_board(game)
while (0 %in% game & winner == 0) { # Keep playing until win or full board
if (players[(player + 3) %% 3] == "human") # Human player
move <- move_human(game)
else { # Computer player
move <- minimax(game, player)
move <- move$move
}
game[move] <- player # Change board
draw_board(game)
winner <- max(eval_winner(game, 1), abs(eval_winner(game, -1))) == 6 # Winner, winner, chicken dinner?
player <- -player # Change player
}
if (winner == 1)
print("Human has won")
else if (winner == 2)
print("Computer has won")
else
print("Play ended in a draw")
}
Going to reiterate what TheWhiteRabbit said: you should have a look at https://stackoverflow.com/help/how-to-ask
If you are more specific and include your code in the future, we can be way more helpful, but I'll give you some general suggestions based on what you have provided.
I theorize your issue might be some of the following:
You are not limiting your depth. You are trying to search all the way to the endgame. Minimax should only search enough turns ahead that your hardware can handle the strain.
Your scoring function is too inefficient. The score function is often the majority of your computation time in a Minimax search. If it is inefficient, you will pay for it.
Similarly, your code that generates a list of valid moves might be inefficient.
You might be considering invalid moves, causing your tree to branch way more than it should.
You are not generalizing your code enough. It doesn't work for 4x4 because you have hardcoded something to rely on a 3x3 board without realizing it.
Your Alpha-Beta pruning is incorrect. You are pruning nothing.
From my experience implementing MiniMax + variants, these tend to be some of the failure points.
I am analysing a time series signal. I set a threshold to separate the noise from the baseline noise. In order to identify the properties of each signal sequence (duration, amplitude, maximum signal...), I built a function to aggregate all the signal points that are continuous as different "peaks". Despite this function does what I want, I was wondering if anyone can help me to make it more efficient -e. g. vectorization, because I aim to run the function on a data.table of more than 1M rows. Here is a sample data with the function:
# Generate dummy data
x <- sin(seq(from = 0, to = 20, length.out = 200)) + rnorm(200, 0,0.1)
x <- zoo(x)
plot(x)
# Label each point as signal (== )1) or noise (0)
y <- ifelse(x > 0.5, 1, 0)
# Function to label each peak
peak_labeler <- function(x) {
tmp <- NULL
for (i in seq_along(x)) {
if (x[i] == 0) { tmp[i] <- 0 } # If baseline, mark as 0
if (x[i] == 1) {
# If x[n] belongs to a peak
if (i == 1) {tmp[i] <- 1} # Label as 1 at t0
else{
if (!exists("Peak")) {Peak <- 0}
if (x[i - 1] == 0) {
# if previous point is no peak, add as peak
Peak <- Peak + 1
tmp[i] <- Peak
}
if (x[i - 1] == 1) {
tmp[i] <- Peak
}
}
}
}
return(tmp)
rm(tmp, Peak, i) # Garbage collection
}
# Label peaks
dummy <- data.frame(t = 1:200, x,y,tmp = peak_labeler(y))
# Show data
ggplot(dummy, aes(x = t, y = x)) +
geom_point(aes(col = as.factor(tmp), group = 1))
Here's an approach using dplyr.
The test in the cross_threshold line works by evaluating whether y is on a different side of 0.5 than the prior y. If so, the sign of the two terms y - threshold and lag(y) - threshold will be different, leading to a TRUE, which is multiplied by 1 to become 1. If they're on the same side of 0.5, you'll get a FALSE and a 0. The default = 0 part deals with the first line, where lag(y) is undefined. Then we add up how many cumulative crosses there have been to define the tmp group.
library(dplyr)
threshold = 0.5
dummy <- data.frame(t = 1:200, x, y) %>%
mutate(cross_threshold = 1 * (sign(y - threshold) != sign(lag(y, default = 0) - threshold)),
# Line above now optional, just if we want to label all crossings
up = 1 * ((y > threshold) & (lag(y) < threshold)),
tmp = if_else(y > threshold, cumsum(up), 0))
ggplot(dummy, aes(x = t, y = x)) +
geom_point(aes(col = as.factor(tmp), group = 1)) +
geom_point(data = filter(dummy, cross_threshold == 1), shape = 21, size = 5)
I'm using Sutton & Barto's ebook Reinforcement Learning: An Introduction to study reinforcement learning. I'm having some issues trying to emulate the results (plots) on the action-value page.
More specifically, how can I simulate the greedy value for each task? The book says:
...we can plot the performance and behavior of various methods as
they improve with experience over 1000 plays...
So I guess I have to keep track of the exploratory values as better ones are found. The issue is how to do this using the greedy approach - since there are no exploratory moves, how do I know what is a greedy behavior?
Thanks for all the comments and answers!
UPDATE: See code on my answer.
I finally got this right. The eps player should beat the greedy player because of the exploratory moves, as pointed out int the book.
The code is slow and need some optimizations, but here it is:
get.testbed = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1){
optimal = rnorm(arms, u, sdev.arm)
rewards = sapply(optimal, function(x)rnorm(plays, x, sdev.rewards))
list(optimal = optimal, rewards = rewards)
}
play.slots = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = 0.1){
testbed = get.testbed(arms, plays, u, sdev.arm, sdev.rewards)
optimal = testbed$optimal
rewards = testbed$rewards
optim.index = which.max(optimal)
slot.rewards = rep(0, arms)
reward.hist = rep(0, plays)
optimal.hist = rep(0, plays)
pulls = rep(0, arms)
probs = runif(plays)
# vetorizar
for (i in 1:plays){
## dont use ifelse() in this case
## idx = ifelse(probs[i] < eps, sample(arms, 1), which.max(slot.rewards))
idx = if (probs[i] < eps) sample(arms, 1) else which.max(slot.rewards)
reward.hist[i] = rewards[i, idx]
if (idx == optim.index)
optimal.hist[i] = 1
slot.rewards[idx] = slot.rewards[idx] + (rewards[i, idx] - slot.rewards[idx])/(pulls[idx] + 1)
pulls[idx] = pulls[idx] + 1
}
list(slot.rewards = slot.rewards, reward.hist = reward.hist, optimal.hist = optimal.hist, pulls = pulls)
}
do.simulation = function(N = 100, arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = c(0.0, 0.01, 0.1)){
n.players = length(eps)
col.names = paste('eps', eps)
rewards.hist = matrix(0, nrow = plays, ncol = n.players)
optim.hist = matrix(0, nrow = plays, ncol = n.players)
colnames(rewards.hist) = col.names
colnames(optim.hist) = col.names
for (p in 1:n.players){
for (i in 1:N){
play.results = play.slots(arms, plays, u, sdev.arm, sdev.rewards, eps[p])
rewards.hist[, p] = rewards.hist[, p] + play.results$reward.hist
optim.hist[, p] = optim.hist[, p] + play.results$optimal.hist
}
}
rewards.hist = rewards.hist/N
optim.hist = optim.hist/N
optim.hist = apply(optim.hist, 2, function(x)cumsum(x)/(1:plays))
### Plot helper ###
plot.result = function(x, n.series, colors, leg.names, ...){
for (i in 1:n.series){
if (i == 1)
plot.ts(x[, i], ylim = 2*range(x), col = colors[i], ...)
else
lines(x[, i], col = colors[i], ...)
grid(col = 'lightgray')
}
legend('topleft', leg.names, col = colors, lwd = 2, cex = 0.6, box.lwd = NA)
}
### Plot helper ###
#### Plots ####
require(RColorBrewer)
colors = brewer.pal(n.players + 3, 'Set2')
op <-par(mfrow = c(2, 1), no.readonly = TRUE)
plot.result(rewards.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Average reward', lwd = 2)
plot.result(optim.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Optimal move %', lwd = 2)
#### Plots ####
par(op)
}
To run it just call
do.simulation(N = 100, arms = 10, eps = c(0, 0.01, 0.1))
You could also choose to make use of the R package "contextual", which aims to ease the implementation and evaluation of both context-free (as described in Sutton & Barto) and contextual (such as for example LinUCB) Multi-Armed Bandit policies.
The package actually offers a vignette on how to replicate all Sutton & Barto bandit plots. For example, to generate the ε-greedy plots, just simulate EpsilonGreedy policies against a Gaussian bandit :
library(contextual)
set.seed(2)
mus <- rnorm(10, 0, 1)
sigmas <- rep(1, 10)
bandit <- BasicGaussianBandit$new(mu_per_arm = mus, sigma_per_arm = sigmas)
agents <- list(Agent$new(EpsilonGreedyPolicy$new(0), bandit, "e = 0, greedy"),
Agent$new(EpsilonGreedyPolicy$new(0.1), bandit, "e = 0.1"),
Agent$new(EpsilonGreedyPolicy$new(0.01), bandit, "e = 0.01"))
simulator <- Simulator$new(agents = agents, horizon = 1000, simulations = 2000)
history <- simulator$run()
plot(history, type = "average", regret = FALSE, lwd = 1, legend_position = "bottomright")
plot(history, type = "optimal", lwd = 1, legend_position = "bottomright")
Full disclosure: I am one of the developers of the package.
this is what I have so far based on our chat:
set.seed(1)
getRewardsGaussian <- function(arms, plays) {
## assuming each action has a normal distribution
# first generate new means
QStar <- rnorm(arms, 0, 1)
# then for each mean, generate `play`-many samples
sapply(QStar, function(u)
rnorm(plays, u, 1))
}
CalculateRewardsPerMethod <- function(arms=7, epsi1=0.01, epsi2=0.1
, plays=1000, methods=c("greedy", "epsi1", "epsi2")) {
# names for easy handling
names(methods) <- methods
arm.names <- paste0("Arm", ifelse((1:arms)<10, 0, ""), 1:arms)
# this could be different if not all actions' rewards have a gaussian dist.
rewards.source <- getRewardsGaussian(arms, plays)
# Three dimensional array to track running averages of each method
running.avgs <-
array(0, dim=c(plays, arms, length(methods))
, dimnames=list(PlayNo.=NULL, Arm=arm.names, Method=methods))
# Three dimensional array to track the outcome of each play, according to each method
rewards.received <-
array(NA_real_, dim=c(plays, 2, length(methods))
, dimnames=list(PlayNo.=seq(plays), Outcome=c("Arm", "Reward"), Method=methods))
# define the function internally to not have to pass running.avgs
chooseAnArm <- function(p) {
# Note that in a tie, which.max returns the lowest value, which is what we want
maxes <- apply(running.avgs[p, ,methods, drop=FALSE], 3, which.max)
# Note: deliberately drawing two separate random numbers and keeping this as
# two lines of code to accent that the two draws should not be related
if(runif(1) < epsi1)
maxes["epsi1"] <- sample(arms, 1)
if(runif(1) < epsi2)
maxes["epsi2"] <- sample(arms, 1)
return(maxes)
}
## TODO: Perform each action at least once, then select according to algorithm
## Starting points. Everyone starts at machine 3
choice <- c(3, 3, 3)
reward <- rewards.source[1, choice]
## First run, slightly different
rewards.received[1,,] <- rbind(choice, reward)
running.avgs[1, choice, ] <- reward # if different starting points, this needs to change like below
## HERE IS WHERE WE START PULLING THE LEVERS ##
## ----------------------------------------- ##
for (p in 2:plays) {
choice <- chooseAnArm(p)
reward <- rewards.source[p, choice]
# Note: When dropping a dim, the methods will be the columns
# and the Outcome info will be the rows. Use `rbind` instead of `cbind`.
rewards.received[p,,names(choice)] <- rbind(choice, reward)
## Update the running averages.
## For each method, the current running averages are the same as the
## previous for all arms, except for the one chosen this round.
## Thus start with last round's averages, then update the one arm.
running.avgs[p,,] <- running.avgs[p-1,,]
# The updating is only involved part (due to lots of array-indexing)
running.avgs[p,,][cbind(choice, 1:3)] <-
sapply(names(choice), function(m)
# Update the running average for the selected arm (for the current play & method)
mean( rewards.received[ 1:p,,,drop=FALSE][ rewards.received[1:p,"Arm",m] == choice[m],"Reward",m])
)
} # end for-loop
## DIFFERENT RETURN OPTIONS ##
## ------------------------ ##
## All rewards received, in simplifed matrix (dropping information on arm chosen)
# return(rewards.received[, "Reward", ])
## All rewards received, along with which arm chosen:
# return(rewards.received)
## Running averages of the rewards received by method
return( apply(rewards.received[, "Reward", ], 2, cumsum) / (1:plays) )
}
### EXECUTION (AND SIMULATION)
## PARAMETERS
arms <- 10
plays <- 1000
epsi1 <- 0.01
epsi2 <- 0.1
simuls <- 50 # 2000
methods=c("greedy", "epsi1", "epsi2")
## Single Iteration:
### we can run system time to get an idea for how long one will take
tme <- system.time( CalculateRewardsPerMethod(arms=arms, epsi1=epsi1, epsi2=epsi2, plays=plays) )
cat("Expected run time is approx: ", round((simuls * tme[["elapsed"]]) / 60, 1), " minutes")
## Multiple iterations (simulations)
rewards.received.list <- replicate(simuls, CalculateRewardsPerMethod(arms=arms, epsi1=epsi1, epsi2=epsi2, plays=plays), simplify="array")
## Compute average across simulations
rewards.received <- apply(rewards.received.list, 1:2, mean)
## RESULTS
head(rewards.received, 17)
MeanRewards <- rewards.received
## If using an alternate return method in `Calculate..` use the two lines below to calculate running avg
# CumulRewards <- apply(rewards.received, 2, cumsum)
# MeanRewards <- CumulRewards / (1:plays)
## PLOT
plot.ts(MeanRewards[, "greedy"], col = 'red', lwd = 2, ylim = range(MeanRewards), ylab = 'Average reward', xlab="Plays")
lines(MeanRewards[, "epsi1"], col = 'orange', lwd = 2)
lines(MeanRewards[, "epsi2"], col = 'navy', lwd = 2)
grid(col = 'darkgray')
legend('bottomright', c('greedy', paste("epsi1 =", epsi1), paste("epsi2 =", epsi2)), col = c('red', 'orange', 'navy'), lwd = 2, cex = 0.8)
You may also want to check this link
https://www.datahubbs.com/multi_armed_bandits_reinforcement_learning_1/
Copy of the relevant code from the above source
It does not use R but simply np.random.rand() from numpy
class eps_bandit:
'''
epsilon-greedy k-bandit problem
Inputs
=====================================================
k: number of arms (int)
eps: probability of random action 0 < eps < 1 (float)
iters: number of steps (int)
mu: set the average rewards for each of the k-arms.
Set to "random" for the rewards to be selected from
a normal distribution with mean = 0.
Set to "sequence" for the means to be ordered from
0 to k-1.
Pass a list or array of length = k for user-defined
values.
'''
def __init__(self, k, eps, iters, mu='random'):
# Number of arms
self.k = k
# Search probability
self.eps = eps
# Number of iterations
self.iters = iters
# Step count
self.n = 0
# Step count for each arm
self.k_n = np.zeros(k)
# Total mean reward
self.mean_reward = 0
self.reward = np.zeros(iters)
# Mean reward for each arm
self.k_reward = np.zeros(k)
if type(mu) == list or type(mu).__module__ == np.__name__:
# User-defined averages
self.mu = np.array(mu)
elif mu == 'random':
# Draw means from probability distribution
self.mu = np.random.normal(0, 1, k)
elif mu == 'sequence':
# Increase the mean for each arm by one
self.mu = np.linspace(0, k-1, k)
def pull(self):
# Generate random number
p = np.random.rand()
if self.eps == 0 and self.n == 0:
a = np.random.choice(self.k)
elif p < self.eps:
# Randomly select an action
a = np.random.choice(self.k)
else:
# Take greedy action
a = np.argmax(self.k_reward)
reward = np.random.normal(self.mu[a], 1)
# Update counts
self.n += 1
self.k_n[a] += 1
# Update total
self.mean_reward = self.mean_reward + (
reward - self.mean_reward) / self.n
# Update results for a_k
self.k_reward[a] = self.k_reward[a] + (
reward - self.k_reward[a]) / self.k_n[a]
def run(self):
for i in range(self.iters):
self.pull()
self.reward[i] = self.mean_reward
def reset(self):
# Resets results while keeping settings
self.n = 0
self.k_n = np.zeros(k)
self.mean_reward = 0
self.reward = np.zeros(iters)
self.k_reward = np.zeros(k)