Recently, I learned how to write a loop that initializes some number, and then randomly generates numbers until the initial number is guessed (while recording the number of guesses it took) such that no number will be guessed twice:
# https://stackoverflow.com/questions/73216517/making-sure-a-number-isnt-guessed-twice
all_games <- vector("list", 100)
for (i in 1:100){
guess_i = 0
correct_i = sample(1:100, 1)
guess_sets <- 1:100 ## initialize a set
trial_index <- 1
while(guess_i != correct_i){
guess_i = sample(guess_sets, 1) ## sample from this set
guess_sets <- setdiff(guess_sets, guess_i) ## remove it from the set
trial_index <- trial_index + 1
}
## no need to store `i` and `guess_i` (as same as `correct_i`), right?
game_results_i <- data.frame(i, trial_index, guess_i, correct_i)
all_games[[i]] <- game_results_i
}
all_games <- do.call("rbind", all_games)
I am now trying to modify the above code to create the following two loops:
(Deterministic) Loop 1 will always guess the midpoint (round up) and told if their guess is smaller or bigger than the correct number. They will then re-take the midpoint (e.g. their guess and the floor/ceiling) until they reach the correct number.
(Semi-Deterministic) Loop 2 first makes a random guess and is told if their guess is bigger or smaller than the number. They then divide the difference by half and makes their next guess randomly in a smaller range. They repeat this process many times until they reach the correct number.
I tried to write a sketch of the code:
#Loop 2:
correct = sample(1:100, 1)
guess_1 = sample(1:100, 1)
guess_2 = ifelse(guess_1 > correct, sample(50:guess_1, 1), sample(guess_1:100, 1))
guess_3 = ifelse(guess_2 > correct, sample(50:guess_2, 1), sample(guess_2:100, 1))
guess_4 = ifelse(guess_4 > correct, sample(50:guess_3, 1), sample(guess_3:100, 1))
#etc
But I am not sure if I am doing this correctly.
Can someone please help me with this?
Thank you!
Example : Suppose I pick the number 68
Loop 1: first random guess = 51, (100-51)/2 + 51 = 75, (75-50)/2 + 50 = 63, (75 - 63)/2 + 63 = 69, (69 - 63)/2 + 63 = 66, etc.
Loop 2: first random guess = 53, rand_between(53,100) = 71, rand_between(51,71) = 65, rand(65,71) = 70, etc.
I don't think you need a for loop for this, you can create structures since the beginning, with sample, sapply and which:
## correct values can repeat, so we set replace to TRUE
corrects <- sample(1:100, 100, replace = TRUE)
## replace is by default FALSE in sample(), if you don't want repeated guesses
## sapply() creates a matrix
guesses <- sapply(1:100, function(x) sample(1:100, 100))
## constructing game_results_i equal to yours, but could be simplified
game_results_i <- data.frame(
i = 1:100,
trial_index = sapply(
1:100,
function(x) which(
## which() returns the index of the first element that makes the predicate true
guesses[, x] == corrects[x]
)
),
guess_i = corrects,
correct_i = corrects # guess_i and correct_i are obviously equal
)
Ok, let's see if now I match question and answer properly :)
If I got correctly your intentions, in both loops, you are setting increasingly finer lower and upper bounds. Each guess reduces the search space. However, this interpretation does not always match your description, please double check if it can be acceptable for your purposes.
I wrote two functions, guess_bisect for the deterministic loop_1 and guess_sample for loop_2:
guess_bisect <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- round((ub - lb) / 2) + lb
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- round((ub - lb) / 2) + lb
trial_index <- trial_index + 1
}
trial_index
}
guess_sample <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- sample((lb + 1):(ub - 1), 1)
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- sample((lb + 1):(ub - 1), 1)
trial_index <- trial_index + 1
}
trial_index
}
Obviously, guess_bisect always produces the same results with the same input, guess_sample changes randomly instead.
By plotting the results in a simple chart, it seems that the deterministic bisection is on the average much better, as the random sampling may become happen to pick improvements from the wrong sides. x-axis is the correct number, spanning 1 to 100, y-axis is the trial index, with guess_bisect you get the red curve, with many attempts of guess_sample you get the blue curves.
Related
I'm looking to make a set of two random numbers (e.g., [1,2], [3,12]) with the first number between 1-12, and the second between 1-4. I know how to sample the two numbers independently using:
sample(1:12, 1, replace = T)
sample(1:4, 1, replace = T)
but don't know how to create a system to determine if the pairing of the two numbers has already been rolled, and if so, roll again. Any tips!?
Thanks :)
While this doesn't scale happily (in case you need large-scale simulation), you can do this:
set.seed(42)
di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size = 1)), size = 1)
c(di1, di2)
# [1] 1 2
The inner (di1) assignment takes the first from 1:12, so far so good.
We then set-diff 1:4 from this so that the second sampling only has candidates that are not equal to di1;
The outer (di2) assignment samples from 1:4 without di1 if it was within 1-4.
While not an authoritative proof of correctness,
rand <- replicate(100000, local({ di2 <- sample(setdiff(1:4, di1 <- sample(1:12, size=1)), size = 1); c(di1, di2); }))
dim(rand)
# [1] 2 100000
any(rand[1,] == rand[2,])
# [1] FALSE
Are you looking for sth like:
library(tidyverse)
expand.grid(1:12,1:4) %>%
as.data.frame() %>%
slice_sample (n = 5, replace = FALSE)
Just for fun and to train R, I tried to proof the Monty Hall Game rule (changing your choice after one gate opened gives you more probability to win), I made this reproducible code (The explanation of every step is within the code):
## First I set the seed
set.seed(4)
## Then I modelize the presence of the prize as a random variable between gates 1,2,3
randomgates <- ceiling(runif(10000, min = 0, max = 3))
## so do I with the random choice.
randomchoice <- ceiling(runif(10000, min = 0, max = 3))
## As the opening of a gate is dependent from the gate you chose (the gate you chose cannot be opened)
## I modelize the opening of the gate as a variable which cannot be equal to the choice.
options <- c(1:3)
randomopen <- rep(1,10000)
for (i in 1:length(randomgates)) {
realoptions <- options[options != randomchoice[i]]
randomopen[i] <- realoptions[ceiling(runif(1,min = 0, max = 2))]
}
##Just to make data more easy to handle, I make a dataset
dataset <- cbind(randomgates, randomchoice, randomopen)
## Then I creat a dataset which only keeps the realization of the games in which we carry on (
## the opened gate wasn't the one with the price within)
steptwo <- dataset[randomopen != randomgates,]
## The next step is just to check if the probability of carry on is 2/3, which indeed is
carryon <- randomopen != randomgates
sum(carryon)/length(randomgates)
## I format the dataset as a data frame
steptwo <- as.data.frame(steptwo)
## Now we check what happens if we hold our initial choice when game carries on
prizesholding <- steptwo$randomgates == steptwo$randomchoice
sum(prizesholding)
## creating a vector of changing option, dependant on the opened gate, in the dataset that
## keeps only the cases in which we carried on playing (the opened gate wasn't the one with the prize)
switchedchoice <- rep(1,length(steptwo$randomgates))
for (i in 1:length(steptwo$randomgates)) {
choice <- options[options != steptwo$randomchoice[i]]
switchedchoice[i] <- choice[ceiling(runif(1,min = 0, max = 2))]
}
## Now we check how many times you guess the prize gate when you switch your initial choice
prizesswitching <- steptwo$randomgates == switchedchoice
sum(prizesswitching)/length(steptwo$randomgates)
When I check the probability without changing my initial choice in the cases in which the game carried on (the gate opening didn't match the one with the prize) I obtain what I exepected (close 1/3 of probability of winning the prize), which refers to the following instruction:
carryon <- randomopen != randomgates
sum(carryon)/length(randomgates)
My problem arises when I check the probability of winning the prize after changing my choice (conditionate, obviously to not having opened the door which holds the prize), instead of getting 1/2 as Monty Hall states, I get 1/4, it refers to the following instruction:
prizesswitching <- steptwo$randomgates == switchedchoice
sum(prizesswitching)/length(steptwo$randomgates)
I know that I am doing something bad because it is already more than proofed that Monty Hall holds, but I am not able to detect the flaw. Does anyone know what it is?
If you don't know what Monty Hall problem is, you can find easy-to-read information at wikipedia:
Monty Hall Game
Edit: As #Dason pointed out, one of the problem was I was introducing some kind of randomness in the changing of the initial choice, which doesn't makes sense as there is only one option left.
Other problem was that I was not approaching the problem under the assumption of Monty Hall knowing where the prize is. I changed my code from the initial to this, and the problem is solved:
# Prepare each variable for 10000 experiments
## First I set the seed
set.seed(4)
## Then I modelize the presence of the prize as a random variable between gates 1,2,3
randomgates <- ceiling(runif(10000, min = 0, max = 3))
## so do I with the random choice.
randomchoice <- ceiling(runif(10000, min = 0, max = 3))
## As the opening of a gate is dependent from the gate you chose (the gate you chose cannot be opened
##, neither the one with the prize does), I modelize the opening of the gate as a variable which cannot be equal to the choice.
options <- c(1:3)
randomopen <- rep(1,10000)
for (i in 1:length(randomgates)) {
randomopen[i] <- options[options != randomchoice[i] & options != randomgates[i]]
}
##Just to make data more easy to handle, I make a dataset
dataset <- cbind(randomgates, randomchoice, randomopen)
## I format the dataset as a data frame
steptwo <- as.data.frame(dataset)
## Now we check what happens if we hold our initial choice when game carries on
steptwo$prizesholding <- steptwo$randomgates == steptwo$randomchoice
with(steptwo, sum(prizesholding))
## creating a vector of changing option, dependant on the opened gate, in the dataset that
## keeps only the cases in which we carried on playing (the opened gate wasn't the one with the prize)
steptwo$switchedchoice <- rep(1,length(steptwo$randomgates))
for (i in 1:length(steptwo$randomgates)) {
steptwo$switchedchoice[i] <- options[options != steptwo$randomchoice[i] & options != steptwo$randomopen[i]]
}
## Now we check how many times you guess the prize gate when you switch your initial choice
steptwo$prizesswitching <- steptwo$randomgates == steptwo$switchedchoice
with(steptwo, sum(prizesswitching)/length(randomgates))
Each round, there is a prize_door and a chosen_door. Monty Hall will open a door that is not a prize_door or chosen_door (setdiff between 1:3 and the vector (prize_door, chosen_door), with a random choice between the two if the setdiff is two elements). Then the switch door is the door not chosen or opened.
n <- 1e4
set.seed(2020)
df <-
data.frame(
prize_door = sample(1:3, n, replace = TRUE),
chosen_door = sample(1:3, n, replace = TRUE))
df$opened_door <-
mapply(function(x, y){
available <- setdiff(1:3, c(x, y))
available[sample(length(available), 1)]
}, df$prize_door, df$chosen_door)
df$switch_door <-
mapply(function(x, y) setdiff(1:3, c(x, y)),
df$chosen_door, df$opened_door)
with(df, mean(prize_door == chosen_door))
# [1] 0.3358
with(df, mean(prize_door == switch_door))
# [1] 0.6642
Plot of probabilities as n increases
probs <-
data.frame(
chosen_p = with(df, cumsum(prize_door == chosen_door))/(1:n),
switch_p = with(df, cumsum(prize_door == switch_door))/(1:n))
plot(probs$switch_p, type = 'l', ylim = c(0, 1))
lines(probs$chosen_p, col = 'red')
abline(h = 1/3)
abline(h = 2/3)
This seems to do the trick:
n_iter <- 10000
set.seed(4)
doors <- 1:3
prizes <- sample.int(n = 3, size = n_iter, replace = TRUE)
your_pick <- sample.int(n = 3, size = n_iter, replace = TRUE)
open_door <- rep(0, n_iter)
switched_door <- rep(0, n_iter)
for (i in 1:n_iter) {
remaining_choices <- setdiff(doors, c(your_pick[i], prizes[i]))
if (length(remaining_choices) > 1) {
open_door[i] <- sample(remaining_choices, size = 1)
} else {
open_door[i] <- remaining_choices
}
switched_door[i] <- setdiff(doors, c(your_pick[i], open_door[i]))
}
> mean(your_pick == prizes)
[1] 0.3305
> mean(switched_door == prizes)
[1] 0.6695
The sample.int and sample base functions help simplify things a bit. The remaining_choices item contains the possible doors that can be opened by the game show host, which has a length of 1 or 2 depending on your original choice. If the length is 2, we sample from that vector, and if it's 1, that door is automatically opened.
I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.
As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))
Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.
I am trying to solve an ODE in R using deSolve. With the following code, I expected the parameter gamma0 takes the values 5 at time step 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10, and 0 otherwise. However, the print(gamma0) shows that gamma0 stays at 0.
Here is my ODE:
library(deSolve)
param <- c(a = 0.1, b = 1)
yini <- c(alpha0 = 6, beta0 = 2)
mod <- function(times, yini, param) {
with(as.list(c(yini, param)), {
gamma0 <- ifelse(times %in% seq(0,10,1), 5, 0)
## print(gamma0)
dalpha0 <- - a*alpha0 + gamma0
dbeta0 <- a*alpha0 - b*beta0
return(list(c(dalpha0, dbeta0)))
})}
times <- seq(from = 0, to = 10, by = 1/24)
out <- ode(func = mod, times = times, y = yini, parms = param)
plot(out, lwd = 2, xlab = "day")
What am I doing wrong?
This is a really simple modification of your function. If you're interested in knowing what you are doing wrong you can look below.
mod <- function(times, yini, param) {
dt = times[2] - times[1]
with(as.list(c(yini, param)), {
gamma0 <- ifelse(times <= 10*dt, 5, 0)
## print(gamma0)
dalpha0 <- - a*alpha0 + gamma0
dbeta0 <- a*alpha0 - b*beta0
return(list(c(dalpha0, dbeta0)))
})}
EDIT
Same as G5W's answer, the problem is with what you are comparing times to.
When you are writing
times %in% seq(0,10,1)
you are not referring to time steps. You simply refer to the values of times.
So, if you want to have it for the first 10 time steps you just need to go with my code or anything that considers the dt.
But here's a question for you:
If you do not need gamma0 to change according to times and want it to be 5 at the first 11 (10) time steps, why you are comparing it to times? Why not simply set it to be 5 for those time steps?
I get a slightly different result from you. If I uncomment the print(gamma0) it prints out 5 twice then prints out 513 zeros. It is not hard to trace why in a superficial way, although you may want more than I will offer here.
Where you have the (commented out) statement print(gamma0) instead, put the line:
cat("g:", gamma0, " t:", times, "\n")
and run the code. You will see that the first two times it displays are 0. Since those are on your list seq(0,10,1) gamma0 is 5. After that, the times values displayed change. Notice that none of them that are printed are from your original list of times seq(from = 0, to = 10, by = 1/24) and none of them are integers so none meet your condition to set gamma0 to 5. ode is doing something with the times (interpolating?) but it is not simply using the values that you provided. In fact, it does not print out 241 values of gamma0 and times. It prints out 515 such values. I note that the result out does have 241 values.
I think from your question that you assumed ode would actually evaluate the function at your times. It does not. It is treating times like a continuous variable. But your condition
gamma0 <- ifelse(times %in% seq(0,10,1), 5, 0)
only tests for 11 specific values - not ranges of values. A continuous variable is quite unlikely to hit exactly those values.
M--'s answer doesn't works for me, what if you just try this?
mod <- function(times, yini, param) {
with(as.list(c(yini, param)), {
if (times < 10) {
gamma0 = 5
} else if (times >= 10) {
gamma0 = 0
}
dalpha0 <- -a * alpha0 + gamma0
dbeta0 <- a * alpha0 - b * beta0
return(list(c(dalpha0, dbeta0)))
})
}
This works as a Headvise type function
I am trying to bucket coordinates into their nearest coordinate. In a sense, I am doing one iteration of kmeans clustering, with 1222 centroids. Below I have a function that does this, imperfectly, and too slowly as well. I am looking for help on improving this function:
discretizeCourt <- function(x_loc, y_loc) {
# create the dataframe of points that I want to round coordinates to
y <- seq(0, 50, by = 2)
x1 <- seq(1, 93, by = 2)
x2 <- seq(2, 94, by = 2)
x <- c(x1, x2)
coordinates <- data.frame(
x = rep(x, 13),
y = rep(y, each = length(x1)),
count = 0
)
# loop over each point in x_loc and y_loc
# increment the count column whenever a point is 'near' that column
for(i in 1:length(x_loc)) {
this_x = x_loc[i]
this_y = y_loc[i]
coordinates[coordinates$x > this_x-1 &
coordinates$x < this_x+1 &
coordinates$y > this_y-1 &
coordinates$y < this_y+1, ]$count =
coordinates[coordinates$x > this_x-1 &
coordinates$x < this_x+1 &
coordinates$y > this_y-1 &
coordinates$y < this_y+1, ]$count + 1
}
}
Here is some test data that I'm working with:
> dput(head(x_loc, n = 50))
c(13.57165, 13.61702, 13.66478, 13.70833, 13.75272, 13.7946,
13.83851, 13.86792, 13.8973, 13.93906, 13.98099, 14.02396, 14.06338,
14.10872, 14.15412, 14.2015, 14.26116, 14.30871, 14.35056, 14.39536,
14.43964, 14.48442, 14.5324, 14.57675, 14.62267, 14.66972, 14.71443,
14.75383, 14.79012, 14.82455, 14.85587, 14.87557, 14.90737, 14.9446,
14.97763, 15.01079, 15.04086, 15.06752, 15.09516, 15.12394, 15.15191,
15.18061, 15.20413, 15.22896, 15.25411, 15.28108, 15.3077, 15.33578,
15.36507, 15.39272)
> dput(head(y_loc, n = 50))
c(25.18298, 25.17431, 25.17784, 25.18865, 25.20188, 25.22865,
25.26254, 25.22778, 25.20162, 25.25191, 25.3044, 25.35787, 25.40347,
25.46049, 25.5199, 25.57132, 25.6773, 25.69842, 25.73877, 25.78383,
25.82168, 25.86067, 25.89984, 25.93067, 25.96943, 26.01083, 26.05861,
26.11965, 26.18428, 26.25347, 26.3352, 26.35756, 26.4682, 26.55412,
26.63745, 26.72157, 26.80021, 26.8691, 26.93522, 26.98879, 27.03783,
27.07818, 27.03786, 26.9909, 26.93697, 26.87916, 26.81606, 26.74908,
26.67815, 26.60898)
My actual x_loc and y_loc files are ~60000 coordinates, and I have thousands of files each with ~60000 coordinates, so it's a lot of work. I am pretty certain that the reason the function runs slow is the way I am indexing / incrementing.
The counting is imperfect. A technically better approach would be to loop over all 60000 points (above only 50 points for the example), and for each point, calculate the distance between that point and each point in the coordinates dataframe (1222 points). However thats 60000 * 1222 calculations, just for this one set of points, which is too high.
Would greatly appreciate any help on this!
Thanks,
EDIT: I'm working on converting my dataframes / vectors to 2 matrices, and vectorizing the whole approach, will let you know if it works.
If you want to process your matrix faster than your solution, consider using data.table library. Please see the example below:
df <- data.table(x_loc, y_loc) # Your data.frame is turned into a data.table
df$row.idx <- 1:nrow(df) # This column is used as ID for each sample point.
Now, we can find the right coordinate for each point. Later we can calculate how many points belong to a certain coordinate. We keep the coordinates data frame first:
y <- seq(0, 50, by = 2)
x1 <- seq(1, 93, by = 2)
x2 <- seq(2, 94, by = 2)
x <- c(x1, x2)
coordinates <- data.frame(
x = rep(x, 13),
y = rep(y, each = length(x1)),
count = 0
)
coordinates$row <- 1:nrow(coordinates) # Similar to yours. However, this time we are interested in seeing which points belong to this coordinate.
Now, we define a function which checks the coordinates and returns the one within one unit distance of the point in question.
f <- function(this_x, this_y, coordinates) {
res <- coordinates[coordinates$x > this_x-1 &
coordinates$x < this_x+1 &
coordinates$y > this_y-1 &
coordinates$y < this_y+1, ]$row
res
}
For each point, we find its right coordinate:
df[, coordinate.idx := f(x_loc, y_loc), by = row.idx]
df[, row.idx := NULL]
df contains the following variables: (x_loc, y_loc, coordinate.idx). You can populate coordinates$count using this. Even for 60000 points, it should not take more than 1 second.
for(i in 1:nrow(coordinates)) {
coordinates$count = length(which(df$coordinate.idx == i))
}