Monty Hall game in R with base functions - r

Just for fun and to train R, I tried to proof the Monty Hall Game rule (changing your choice after one gate opened gives you more probability to win), I made this reproducible code (The explanation of every step is within the code):
## First I set the seed
set.seed(4)
## Then I modelize the presence of the prize as a random variable between gates 1,2,3
randomgates <- ceiling(runif(10000, min = 0, max = 3))
## so do I with the random choice.
randomchoice <- ceiling(runif(10000, min = 0, max = 3))
## As the opening of a gate is dependent from the gate you chose (the gate you chose cannot be opened)
## I modelize the opening of the gate as a variable which cannot be equal to the choice.
options <- c(1:3)
randomopen <- rep(1,10000)
for (i in 1:length(randomgates)) {
realoptions <- options[options != randomchoice[i]]
randomopen[i] <- realoptions[ceiling(runif(1,min = 0, max = 2))]
}
##Just to make data more easy to handle, I make a dataset
dataset <- cbind(randomgates, randomchoice, randomopen)
## Then I creat a dataset which only keeps the realization of the games in which we carry on (
## the opened gate wasn't the one with the price within)
steptwo <- dataset[randomopen != randomgates,]
## The next step is just to check if the probability of carry on is 2/3, which indeed is
carryon <- randomopen != randomgates
sum(carryon)/length(randomgates)
## I format the dataset as a data frame
steptwo <- as.data.frame(steptwo)
## Now we check what happens if we hold our initial choice when game carries on
prizesholding <- steptwo$randomgates == steptwo$randomchoice
sum(prizesholding)
## creating a vector of changing option, dependant on the opened gate, in the dataset that
## keeps only the cases in which we carried on playing (the opened gate wasn't the one with the prize)
switchedchoice <- rep(1,length(steptwo$randomgates))
for (i in 1:length(steptwo$randomgates)) {
choice <- options[options != steptwo$randomchoice[i]]
switchedchoice[i] <- choice[ceiling(runif(1,min = 0, max = 2))]
}
## Now we check how many times you guess the prize gate when you switch your initial choice
prizesswitching <- steptwo$randomgates == switchedchoice
sum(prizesswitching)/length(steptwo$randomgates)
When I check the probability without changing my initial choice in the cases in which the game carried on (the gate opening didn't match the one with the prize) I obtain what I exepected (close 1/3 of probability of winning the prize), which refers to the following instruction:
carryon <- randomopen != randomgates
sum(carryon)/length(randomgates)
My problem arises when I check the probability of winning the prize after changing my choice (conditionate, obviously to not having opened the door which holds the prize), instead of getting 1/2 as Monty Hall states, I get 1/4, it refers to the following instruction:
prizesswitching <- steptwo$randomgates == switchedchoice
sum(prizesswitching)/length(steptwo$randomgates)
I know that I am doing something bad because it is already more than proofed that Monty Hall holds, but I am not able to detect the flaw. Does anyone know what it is?
If you don't know what Monty Hall problem is, you can find easy-to-read information at wikipedia:
Monty Hall Game
Edit: As #Dason pointed out, one of the problem was I was introducing some kind of randomness in the changing of the initial choice, which doesn't makes sense as there is only one option left.
Other problem was that I was not approaching the problem under the assumption of Monty Hall knowing where the prize is. I changed my code from the initial to this, and the problem is solved:
# Prepare each variable for 10000 experiments
## First I set the seed
set.seed(4)
## Then I modelize the presence of the prize as a random variable between gates 1,2,3
randomgates <- ceiling(runif(10000, min = 0, max = 3))
## so do I with the random choice.
randomchoice <- ceiling(runif(10000, min = 0, max = 3))
## As the opening of a gate is dependent from the gate you chose (the gate you chose cannot be opened
##, neither the one with the prize does), I modelize the opening of the gate as a variable which cannot be equal to the choice.
options <- c(1:3)
randomopen <- rep(1,10000)
for (i in 1:length(randomgates)) {
randomopen[i] <- options[options != randomchoice[i] & options != randomgates[i]]
}
##Just to make data more easy to handle, I make a dataset
dataset <- cbind(randomgates, randomchoice, randomopen)
## I format the dataset as a data frame
steptwo <- as.data.frame(dataset)
## Now we check what happens if we hold our initial choice when game carries on
steptwo$prizesholding <- steptwo$randomgates == steptwo$randomchoice
with(steptwo, sum(prizesholding))
## creating a vector of changing option, dependant on the opened gate, in the dataset that
## keeps only the cases in which we carried on playing (the opened gate wasn't the one with the prize)
steptwo$switchedchoice <- rep(1,length(steptwo$randomgates))
for (i in 1:length(steptwo$randomgates)) {
steptwo$switchedchoice[i] <- options[options != steptwo$randomchoice[i] & options != steptwo$randomopen[i]]
}
## Now we check how many times you guess the prize gate when you switch your initial choice
steptwo$prizesswitching <- steptwo$randomgates == steptwo$switchedchoice
with(steptwo, sum(prizesswitching)/length(randomgates))

Each round, there is a prize_door and a chosen_door. Monty Hall will open a door that is not a prize_door or chosen_door (setdiff between 1:3 and the vector (prize_door, chosen_door), with a random choice between the two if the setdiff is two elements). Then the switch door is the door not chosen or opened.
n <- 1e4
set.seed(2020)
df <-
data.frame(
prize_door = sample(1:3, n, replace = TRUE),
chosen_door = sample(1:3, n, replace = TRUE))
df$opened_door <-
mapply(function(x, y){
available <- setdiff(1:3, c(x, y))
available[sample(length(available), 1)]
}, df$prize_door, df$chosen_door)
df$switch_door <-
mapply(function(x, y) setdiff(1:3, c(x, y)),
df$chosen_door, df$opened_door)
with(df, mean(prize_door == chosen_door))
# [1] 0.3358
with(df, mean(prize_door == switch_door))
# [1] 0.6642
Plot of probabilities as n increases
probs <-
data.frame(
chosen_p = with(df, cumsum(prize_door == chosen_door))/(1:n),
switch_p = with(df, cumsum(prize_door == switch_door))/(1:n))
plot(probs$switch_p, type = 'l', ylim = c(0, 1))
lines(probs$chosen_p, col = 'red')
abline(h = 1/3)
abline(h = 2/3)

This seems to do the trick:
n_iter <- 10000
set.seed(4)
doors <- 1:3
prizes <- sample.int(n = 3, size = n_iter, replace = TRUE)
your_pick <- sample.int(n = 3, size = n_iter, replace = TRUE)
open_door <- rep(0, n_iter)
switched_door <- rep(0, n_iter)
for (i in 1:n_iter) {
remaining_choices <- setdiff(doors, c(your_pick[i], prizes[i]))
if (length(remaining_choices) > 1) {
open_door[i] <- sample(remaining_choices, size = 1)
} else {
open_door[i] <- remaining_choices
}
switched_door[i] <- setdiff(doors, c(your_pick[i], open_door[i]))
}
> mean(your_pick == prizes)
[1] 0.3305
> mean(switched_door == prizes)
[1] 0.6695
The sample.int and sample base functions help simplify things a bit. The remaining_choices item contains the possible doors that can be opened by the game show host, which has a length of 1 or 2 depending on your original choice. If the length is 2, we sample from that vector, and if it's 1, that door is automatically opened.

Related

While Loops and Midpoints

Recently, I learned how to write a loop that initializes some number, and then randomly generates numbers until the initial number is guessed (while recording the number of guesses it took) such that no number will be guessed twice:
# https://stackoverflow.com/questions/73216517/making-sure-a-number-isnt-guessed-twice
all_games <- vector("list", 100)
for (i in 1:100){
guess_i = 0
correct_i = sample(1:100, 1)
guess_sets <- 1:100 ## initialize a set
trial_index <- 1
while(guess_i != correct_i){
guess_i = sample(guess_sets, 1) ## sample from this set
guess_sets <- setdiff(guess_sets, guess_i) ## remove it from the set
trial_index <- trial_index + 1
}
## no need to store `i` and `guess_i` (as same as `correct_i`), right?
game_results_i <- data.frame(i, trial_index, guess_i, correct_i)
all_games[[i]] <- game_results_i
}
all_games <- do.call("rbind", all_games)
I am now trying to modify the above code to create the following two loops:
(Deterministic) Loop 1 will always guess the midpoint (round up) and told if their guess is smaller or bigger than the correct number. They will then re-take the midpoint (e.g. their guess and the floor/ceiling) until they reach the correct number.
(Semi-Deterministic) Loop 2 first makes a random guess and is told if their guess is bigger or smaller than the number. They then divide the difference by half and makes their next guess randomly in a smaller range. They repeat this process many times until they reach the correct number.
I tried to write a sketch of the code:
#Loop 2:
correct = sample(1:100, 1)
guess_1 = sample(1:100, 1)
guess_2 = ifelse(guess_1 > correct, sample(50:guess_1, 1), sample(guess_1:100, 1))
guess_3 = ifelse(guess_2 > correct, sample(50:guess_2, 1), sample(guess_2:100, 1))
guess_4 = ifelse(guess_4 > correct, sample(50:guess_3, 1), sample(guess_3:100, 1))
#etc
But I am not sure if I am doing this correctly.
Can someone please help me with this?
Thank you!
Example : Suppose I pick the number 68
Loop 1: first random guess = 51, (100-51)/2 + 51 = 75, (75-50)/2 + 50 = 63, (75 - 63)/2 + 63 = 69, (69 - 63)/2 + 63 = 66, etc.
Loop 2: first random guess = 53, rand_between(53,100) = 71, rand_between(51,71) = 65, rand(65,71) = 70, etc.
I don't think you need a for loop for this, you can create structures since the beginning, with sample, sapply and which:
## correct values can repeat, so we set replace to TRUE
corrects <- sample(1:100, 100, replace = TRUE)
## replace is by default FALSE in sample(), if you don't want repeated guesses
## sapply() creates a matrix
guesses <- sapply(1:100, function(x) sample(1:100, 100))
## constructing game_results_i equal to yours, but could be simplified
game_results_i <- data.frame(
i = 1:100,
trial_index = sapply(
1:100,
function(x) which(
## which() returns the index of the first element that makes the predicate true
guesses[, x] == corrects[x]
)
),
guess_i = corrects,
correct_i = corrects # guess_i and correct_i are obviously equal
)
Ok, let's see if now I match question and answer properly :)
If I got correctly your intentions, in both loops, you are setting increasingly finer lower and upper bounds. Each guess reduces the search space. However, this interpretation does not always match your description, please double check if it can be acceptable for your purposes.
I wrote two functions, guess_bisect for the deterministic loop_1 and guess_sample for loop_2:
guess_bisect <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- round((ub - lb) / 2) + lb
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- round((ub - lb) / 2) + lb
trial_index <- trial_index + 1
}
trial_index
}
guess_sample <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- sample((lb + 1):(ub - 1), 1)
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- sample((lb + 1):(ub - 1), 1)
trial_index <- trial_index + 1
}
trial_index
}
Obviously, guess_bisect always produces the same results with the same input, guess_sample changes randomly instead.
By plotting the results in a simple chart, it seems that the deterministic bisection is on the average much better, as the random sampling may become happen to pick improvements from the wrong sides. x-axis is the correct number, spanning 1 to 100, y-axis is the trial index, with guess_bisect you get the red curve, with many attempts of guess_sample you get the blue curves.

Monte Carlo simulation in R for Monty Hall problem not working?

I'm writing a function in R to perform a Monte Carlo simulation for the Monty Hall problem. The function is working when the doors are not switched it switch == FALSE, but when I call mean(replicate(10000, monty_hall(switch = TRUE))), the expected answer is about 0.66 but I actually get around 0.25.
Here is the code to the function:
monty_hall = function(switch = logical()){
doors <- c(1,2,3)
names(doors) <- rep(c("goat", "car"), c(2,1))
prize_door <- doors[3]
guess <- sample(doors, 1)
revealed_door <- sample(doors[!doors %in% c(guess, prize_door)],1)
if(switch){
switched_door <- sample(doors[!doors %in% c(guess, revealed_door)],1)
prize_door == switched_door
} else {
prize_door == guess
}
}
What changes should I make to get the correct output, which is around 0.66?
Just change the doors vector to characters
monty_hall = function(switch = logical()){
doors <- c("1","2","3")
names(doors) <- rep(c("goat", "car"), c(2,1))
prize_door <- doors[3]
guess <- sample(doors, 1)
revealed_door <- sample(doors[!doors %in% c(guess, prize_door)],1)
if(switch){
switched_door <- sample(doors[!doors %in% c(guess, revealed_door)],1)
prize_door == switched_door
} else {
prize_door == guess
}
}
Suppose the person chose door number 1 and the prize is in door number 2, so what is left to be revealed is door number 3.
You will have revealed_door <- sample(3,1) and this doesn't work as you are expecting, this becomes revealed_door <- sample(c(1,2,3),1)
From the function documentation, just type ?sample
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x. Note that this convenience
feature may lead to undesired behaviour when x is of varying length in
calls such as sample(x)
I think the easiest fix is changing to characters, but if you must use numerical values just do a check for the vector's length and return the value if it is 1, or do a sample otherwise

RStudio not responding with large Linear Optimization

I have a large two-step optimization problem that I've tried to simplify for this question. The first step is to choose 10 elements to maximize utility with certain constraints. I need 200 of these sets, but due to the nature of what I'm trying to do, there needs to be 600 generated so that the correct combinations can manifest.
Wrapping around these mini-optimization problems is a larger constraint where each individual element can only be used within a certain range. The first optimization tunes each element's utility so that each one is relatively close to the bounds, but it's not possible for all of them to be within their bounds. Therefore, the second step is to choose 200 of the 600 sets such that each individual element's min/max usage is satisfied. This is what I need help with.
I made a function using lpSolve that works, but over 80% of the time it freezes RStudio and it's just becoming too much of a hassle - I need to either improve my current approach, or need a completely new approach. I don't know if lpSolve is really the best approach to begin with. While I do have an overall set-score that I can maximize, all I really care about is having each element within the bounds. I've made a simplified example to get at the essence of my problem.
I'm in charge of making 200 meals from a set of 80 different fruits. Each meal uses 10 fruit and cannot have more than 1 of the same fruit. I'm limited in the number of fruits that I have (and my boss is making me use a minimum of each fruit otherwise they'll go bad), so they need to be within certain bounds. I have a list of 600 meals already created (Meals), and each one has it's own unique Health-Score. Ideally I would like to maximize the Health-Score, but obviously the most important piece is that each fruit is used the correct number of times, otherwise the meals can't be made in the first place.
Here's my code to 1) Setup the 600 Meals (random) 2) Set the min/max times each fruit must be used (random) 3) Run a linear optimization to select 200 of the 600 meals such that the individual fruit constraints are fulfilled. The program tries to chose 200 of the 600, but if the constraints don't allow it, then it loosens the constraints (e.g. if the solver doesn't work the first time, then I'll decrease the minimum number of times an Apple can be used, and increase the maximum number of times it can be used). It does this one fruit at a time, rather than all at once. Eventually the constraints should be loosened so much that any 200 of the 600 will work (i.e. when all fruit minPercent is less than 0 and all fruit maxPercent is greater than 100), but it doesn't matter because R freezes up.
library(stringr)
library(dplyr)
library(lpSolve)
# Inputs
MealsNeeded <- 200
Buffer <- 3
# Setup the meals (this is the output of another optimizer in my actual program. Considered "Step 1" as I mentioned above)
Meals <- data.frame()
for(i in 1:(MealsNeeded*Buffer)){
run <- i
meal <- sample(fruit, 10)
healthFactor <- round(runif(1, 10, 30), 0) #(Health factor for the entire meal)
df <- data.frame(Run = run, Fruit = meal, healthFactor = healthFactor, stringsAsFactors = FALSE)
Meals <- rbind(Meals, df)
}
# The minimum/maximum number of times each fruit must be used across all 200 meals (these would be inputs in my program)
set.seed(11)
fruitDF <- data.frame(Name = fruit, minSelectPct = round(runif(length(fruit), .05, .1)*100, 0), stringsAsFactors = FALSE) %>%
mutate(maxSelectPct = round(minSelectPct/2 + runif(length(fruit), .05, .1)*100, 0))
#### Actual Program Start
# Get objective
obj <- Meals %>%
distinct(Run, healthFactor) %>%
ungroup() %>%
select(healthFactor) %>%
pull()
# Dummy LU - for each fruit give 1/0 whether or not they were in the meal
dummyLUInd <- data.frame(FruitName = fruitDF$Name, stringsAsFactors = FALSE)
for(i in unique(Meals$Run)){
selectedFruit <- Meals %>%
filter(Run == i) %>%
select(Fruit) %>%
mutate(Indicator = 1)
dummyLUIndTemp <- fruitDF %>%
left_join(selectedFruit, by = c('Name' = 'Fruit')) %>%
mutate(Indicator = ifelse(is.na(Indicator), 0, Indicator)) %>%
select(Indicator)
dummyLUInd <- cbind(dummyLUInd, dummyLUIndTemp)
}
## Table create
dummyLUInd <- rbind(dummyLUInd, dummyLUInd)[,-1]
dummyLUInd <- as.data.frame(t(dummyLUInd))
dummyLUInd$Total = 1
## Directions
dirLT <- c(rep('<=', (ncol(dummyLUInd)-1)/2))
dirGT <- c(rep('>=', (ncol(dummyLUInd)-1)/2))
## Multiply percentages by total Meals
MinExp = round(fruitDF$minSelectPct/100 * MealsNeeded - 0.499, 0)
MaxExp = round(fruitDF$maxSelectPct/100 * MealsNeeded + 0.499, 0)
# Setup constraints like # of tries
CounterMax <- 10000
LPSum = 0
Counter = 0
# Create DF to make it easier to change constraints for each run
MinExpDF <- data.frame(Place = 1:length(MinExp), MinExp = MinExp)
MaxExpDF <- data.frame(Place = 1:length(MaxExp), MaxExp = MaxExp)
cat('\nStarting\n')
Sys.sleep(2)
# Try to get the 200 of 600 Meals that satisfy the constraints for the individual Fruit.
# If the solution doesn't exist, loosen the constraints for each fruit (one at a time) until it does work
while (LPSum == 0 & Counter <= CounterMax) {
rowUse <- Counter %% length(MaxExp)
# Knock one of minimum, starting with highest exposure, one at a time
MinExpDF <- MinExpDF %>%
mutate(Rank = rank(-MinExp, na.last = FALSE, ties.method = "first"),
MinExp = ifelse(Rank == rowUse, MinExp - 1, MinExp)
)
MinExp <- MinExpDF$MinExp
# Add one of maximum, starting with highest exposure, one at a time
MaxExpDF <- MaxExpDF %>%
mutate(Rank = rank(-MaxExp, na.last = FALSE, ties.method = "first"),
MaxExp = ifelse(Rank == rowUse, MaxExp + 1, MaxExp))
MaxExp <- MaxExpDF$MaxExp
# Solve
dir <- 'max'
f.obj <- obj
f.mat <- t(dummyLUInd)
f.dir <- c(dirGT, dirLT, '==')
f.rhs <- c(MinExp, MaxExp, MealsNeeded)
Sol <- lp(dir, f.obj, f.mat, f.dir, f.rhs, all.bin = T)$solution
LPSum <- sum(Sol)
Counter = Counter + 1
if(Counter %% 10 == 0) cat(Counter, ', ', sep = '')
}
# Get the Run #'s from the lpSolve
if(Counter >= CounterMax){
cat("Unable to find right exposure, returning all Meals\n")
MealsSolved <- Meals
} else {
MealsSolved <- data.frame(Run = unique(Meals$Run))
MealsSolved$selected <- Sol
MealsSolved <- MealsSolved[MealsSolved$selected == 1,]
}
# Final Meals
FinalMeals <- Meals %>%
filter(Run %in% MealsSolved$Run)
If you run this code enough times, eventually RStudio will freeze up on you (at least it does for me, if it doesn't for you I suppose increase the number of Meals). It happens during the actual lp, so there's really not much you can do since it's really C code. This is where I'm lost.
A part of me thinks this isn't really an lpSolve issue since I'm not really trying to maximize anything (Health-Factor isn't all too important). My real "loss function" is the number of times each fruit goes above/below their min/max exposure, but I can't think of how to set something like this up. Can my current approach work, or do I need to do something different completely?

How to get a random observation point at a specific time over multiple trials in R?

I am working on Spike Trains and my code to get a spike train like this:
for 20 trials is written below. The image is representational for 5 trials.
fr = 100
dt = 1/1000 #dt in milisecond
duration = 2 #no of duration in s
nBins = 2000 #10msSpikeTrain
nTrials = 20 #NumberOfSimulations
MyPoissonSpikeTrain = function(p, fr= 100) {
p = runif(nBins)
q = ifelse(p < fr*dt, 1, 0)
return(q)
}
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
plot(x=-1,y=-1, xlab="time (s)", ylab="Trial",
main="Spike trains",
ylim=c(0.5, nTrials+1), xlim=c(0, duration))
for (i in 1: nTrials)
{
clip(x1 = 0, x2= duration, y1= (i-0.2), y2= (i+0.4))
abline(h=i, lwd= 1/4)
abline(v= dt*which( SpikeMat[i,]== 1))
}
Each trial has spikes occuring at random time points. Now what I am trying to work towards, is getting a random sample time point that works for all 20 trials and I want to get the vector consisting of length of the intervals this point falls into, for each trial. The code to get the time vector for the points where the spikes occur is,
A <- numeric()
for (i in 1: nTrials)
{
ISI <- function(i){
spike_times <- c(dt*which( SpikeMat[i, ]==1))
ISI1vec <- c(diff(spike_times))
A <- c(A, ISI1vec)
return(A)}
}
Then you call ISI(i) for whichever trial you wish to see the Interspike interval vector for. A visual representation of what I want is:
I want to get a vector that has the lengths of the interval where this points fall into, for each trial. I want to figure out it's distribution as well, but that's for later. Can anybody help me figure out how to code my way to this? Any help is appreciated, even if it's just about how to start/where to look.
Your data
set.seed(1)
SpikeMat <- t(replicate(nTrials, MyPoissonSpikeTrain()))
I suggest transforming your sparse matrix data into a list of indices where spikes occur
L <- lapply(seq_len(nrow(SpikeMat)), function(i) setNames(which(SpikeMat[i, ] == 1), seq_along(which(SpikeMat[i, ] == 1))))
Grab random timepoint
set.seed(1)
RT <- round(runif(1) * ncol(SpikeMat))
# 531
Result
distances contains the distances to the 2 nearest spikes - each element of the list is a named vector where the values are the distances (to RT) and their names are their positions in the vector. nearest_columns shows the original timepoint (column number) of each spike in SpikeMat.
bookend_values <- function(vec) {
lower_val <- head(sort(vec[sign(vec) == 1]), 1)
upper_val <- head(sort(abs(vec[sign(vec) == -1])), 1)
return(c(lower_val, upper_val))
}
distances <- lapply(L, function(i) bookend_values(RT-i))
nearest_columns <- lapply(seq_along(distances), function(i) L[[i]][names(distances[[i]])])
Note that the inter-spike interval of the two nearest spikes that bookend RT can be obtained with
sapply(distances, sum)

Speed up while loop in R

As part of a project I made a smoother to smooth out missing data. I make use of the previous slope of the last data points to calculate new values. After calculated each new point I use this data to calculate a new value (and so on). Hence I used a while-loop to calculate each value (both from left to right as from right to left to eventually take a average of these 2 values). This scripts works fine!
Although I expect that I can significantly accelerate this with a function from the apply-family, I still want to use this while loop. The script is however really slow (3 days for ~ 2,500,000 data points). Do you have tips (for the current script) for me to change to speed things up?
#Loop from: bottom -> top
number_rows <- nrow(weight_id)
i <- nrow(weight_id)
while (i >= 1){
j = as.integer(weight_id[i,1])
prev1 <- temp[j+1,]$new_MAP_bottom
if(j<max(weight_id)){
previous_slope <- ifelse((temp[j+2,]$duration-temp[j+1,]$duration)>0,prev1-temp[j+2,]$new_MAP_bottom,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope-(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_bottom <- new_MAP
i <- i-1
}
#Loop from: top -> bottom
weight_factor <- 0
i <- 1
while (i <= nrow(weight_id)) {
j = as.integer(weight_id[i,1])
prev1 <- temp[j-1,]$new_MAP_top
if(j>2){
previous_slope <- ifelse((temp[j-1,]$duration-temp[j-2,]$duration)>0,prev1-temp[j-2,]$new_MAP_top,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope+(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_top <- new_MAP
#Take weighted average of two approaches (top -> bottom/bottom -> top)
if(weight_factor < 1){ weight_factor = temp[j,]$weight-1 }
weight_top <- weight_factor
weight_bottom <- temp[j,]$weight-weight_factor
if(weight_top>weight_bottom){ weight_top<-weight_top-1 }
if(weight_top<weight_bottom){ weight_bottom<-weight_bottom-1}
temp[j,]$MAP <- round(((new_MAP*weight_top)+(temp[j,]$new_MAP_bottom*weight_bottom))/(weight_top+weight_bottom),digit=0)
weight_factor <- weight_factor-1
i <- i+1
}
I did not read all of your code, especially without example data, but from the textual description, its only linear approximation: Please check, if the buildin functions approx and approxfun already do what you try to implement yourself, as these will be optimized more than you can with suitable effort.
par(mfrow=c(2,1))
example <- data.frame(x = 1:14,
y = c(3,4,5,NA, NA, NA, 6,7,8.1, 8.2, NA, 8.4, 8.5, NA))
plot(example)
f <- approxfun(example)
plot(example$x, f(example$x))
The apply family tends to give you shorter, more succinct code, but not necessarily much more speed then loops. If you are into speed, first check, if somebody else has already implemented, what you need, then try vectorization.
Edit:
The following runs in about a second on my computer. If this does something close enough to your own "linear smoother" so that you can replace yours with this, that is a speed increase of about 3 days.
n <- 2500000
example <- data.frame(x = 1:n,
y = sample(1:1000, n, replace = TRUE))
example$y[sample(1:n, n/5)] <- NA
print(Sys.time())
f <- approxfun(example)
mean(f(example$x))
print(Sys.time())

Resources