Suppose that I want to create in R a binary tree on the interval (0,1) with maximum depth 3 in the following way:
First we have a pool of potential cut-offs for the tree t=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7), a cut-off means that if we randomly choose the value 0.4 then we split the interval (0,1) to (0,0.4) and (0.4,1).
The steps that I want to do are:
1)Start with the whole interval (0,1)
2)Randomly choose a cut-off from t, denoted as t_1
3)Split the interval (0,1) based on the chosen cut-off i.e. to subintervals (0,t_1) and (t_1,1)
4)Then randomly choose one of the intervals (0,t_1) and (t_1,1)
5)For the chosen interval randomly sample from the cut-offs a point t_2 that makes sense, i.e. a point that it is not outside of the interval
6)Continue the procedure up to the point where we reach the maximum depth.
I'm totally clueless on where how to start. Is this the right forum to post such a question?
Creating a tree structure like this requires a recursive function (i.e. a function that calls itself). The following function creates a list of nodes, where each branch node contains a split value, and two daughter nodes called left and right. The leaf nodes contains the final range encompassed within the leaf.
make_node <- function(min = 0, max = 1, desired_depth = 3, depth = 0) {
if (depth < desired_depth) {
split <- runif(1, min, max)
list(split = split,
left = make_node(min, split, desired_depth, depth + 1),
right = make_node(split, max, desired_depth, depth + 1))
} else {
list(range = c(min, max))
}
}
It works like this. Let's create a reproducible tree:
set.seed(1)
tree <- make_node()
To get the initial splitting value, we do:
tree$split
#> [1] 0.2655087
So the right branch deals with all values between 0.2655087 and 1. To see where it splits this range, we do
tree$right$split
#> [1] 0.4136423
So this branch splits into values between [0.2655087, 0.4136423] on the left and [0.4136423, 1] on the right. Let's examine the left node:
tree$right$left$split
#> [1] 0.3985904
This has now split the [0.2655087, 0.4136423] branch into a left [0.2655087, 0.3985904] branch and a right [0.3985904, 0.4136423] branch.
If we take this right branch, have now reached depth 3, so we get the final range of this leaf and confirm its range:
tree$right$left$right
#> $range
#> [1] 0.3985904 0.4136423
Of course, to make all this easier you probably want some kind of function to walk the tree to classify a particular number.
walk_tree <- function(value, tree) {
result <- paste("Value:", value, "\n")
while(is.null(tree$range)) {
if(value >= tree$split) {
result <- paste(result, "\nGreater than split of", tree$split)
tree <- tree$right
} else {
result <- paste(result, "\nLess than split of", tree$split)
tree <- tree$left
}
}
result <- paste0(result, "\nValue falls into leaf node with range [",
tree$range[1], ",", tree$range[2], "]\n")
cat(result)
}
So, for example, we get
walk_tree(value = 0.4, tree)
#> Value: 0.4
#>
#> Greater than split of 0.2655086631421
#> Less than split of 0.413642294289884
#> Greater than split of 0.398590389362078
#> Value falls into leaf node with range [0.398590389362078,0.413642294289884]
You may prefer that this function returns a vector of 0s and 1s, or you may be looking for it to draw the tree, which is trickier to do, but possible.
Created on 2022-03-09 by the reprex package (v2.0.1)
Perhaps we can use Reduce to generate intervals in the binary-tree manner
Reduce(
function(interval, k) {
lb <- min(interval)
ub <- max(interval)
x <- v[v > lb & v < ub]
if (!length(x)) {
return(c(NA, NA))
}
p <- sample(x, 1)
list(c(lb, p), c(p, ub))[[sample(1:2, 1)]]
},
1:3,
init = c(0, 1),
accumulate = TRUE
)
and you will see the result like
[[1]]
[1] 0 1
[[2]]
[1] 0.0 0.6
[[3]]
[1] 0.0 0.2
[[4]]
[1] 0.0 0.1
which indicates the selected interval in each iteration from top to bottom.
Related
I have one vector like this:
years <- c(2021:2091)
And I want to create another vector to bind to it based off an initial value and inrease compound-like for every row based on an arbitrary decimal(such as 10%, 15%, 20%):
number = x
rep(x*(1 + .10)^n, length(years))
How do I replicate the length of years for the second vector while increasing the exponent every time. Say there is 71 rows in years, I need n to start at 1 and run through 71.
I have tried:
rep(x*(1 + .10)^(1:71), length(years))
But this does it 71*71 times. I just need one value for each exponent!
Hopefully this makes sense, thanks in advance!
Here is how you could do it with a function:
future_value = function(years, x = 1, interest = 0.1) {
x * (1 + interest) ^ (1:length(years))
}
Example outputs:
> future_value(2021:2025)
[1] 1.10000 1.21000 1.33100 1.46410 1.61051
> future_value(2021:2025, x = 2, interest = 0.15)
[1] 2.300000 2.645000 3.041750 3.498012 4.022714
I am looking to randomly generate a vector of numbers in R, with a specific sum but which also has a restriction for some specific members of the generated vector, e.g. that the 4th number (say, in a vector of 5) cannot exceed 50.
I am doing this within a for loop with millions of iterations in order to simulate election vote changes, where I am adding votes to one party and taking them away from other parties with equal probability. However, my issue is that in many iterations, votes turn out to be negative, which is illogical. I have figured out how to do the "sums up to X" part from other answers here, and I have made a workaround for the second restriction as follows:
parties <- data.table(party = c("red", "green", "blue", "brown", "yellow"), votes = c(657, 359, 250, 80, 7))
votes_to_reallocate <- 350
immune_party <- "green"
parties_simulation <- copy(parties)
parties_simulation[party != immune_party,
votes := votes - as.vector(rmultinom(1, size=votes_to_reallocate, prob=rep(1, nrow(parties)-1)))
]
# Most likely there are negative votes for the last party, perhaps even the last two.
# While loop is supposed to correct this
while (any(parties_simulation[, votes]<0)) {
negative_parties <- parties_simulation[votes < 0, party]
for (i in seq_along(negative_parties)) {
votes_to_correct <- parties_simulation[party == negative_parties[i], abs(votes)]
parties_to_change <- parties_simulation[party != immune_party & !party %in% negative_parties, .N]
parties_simulation[party != immune_party & !party %in% negative_parties,
votes := votes - as.vector(rmultinom(1, size=votes_to_correct, prob=rep(1, parties_to_change)))
]
parties_simulation[party == negative_parties[i], votes := votes + votes_to_correct]
}
}
However, this seems to be a huge bottleneck as each simulation has to be corrected by the while loop. I am curious as to whether there is a solution for this that would generate the random numbers with the restriction already imposed (for instance, generate 4 random numbers, adding up to 350, and with the fourth number not exceeding 7). If not, perhaps there is a more efficient way to solve this?
Maybe I'm missing something, but would this work:
const_rng <- function(n, const, total){
consts <- sapply(const, function(x)sample(1:x, 1))
rest <- rmultinom(1, total - sum(consts), prob = rep(1/(n-length(consts)), (n-length(consts))))
res <- rep(NA, n)
res[as.numeric(names(const))] <- consts
res[-as.numeric(names(const))] <- rest
return(res)
}
out <- const_rng(5, const=c("4" = 7), 350)
out
# [1] 90 76 88 5 91
sum(out)
# [1] 350
First, it draws the constrained values from the integers 1:const. Then it draws the remainder total - the sum of the constrained draws) from a multinomial distribution giving each other outcome equal probability. The const argument is specified by a vector where the name is the observation number to be constrained and the value is the upper bound of the draw. For example const = c("4" = 7) means constrain the fourth point to be between 0 and 7.
I have the following code for a random walk, in which I start from i and add up cumulatively for each line.
However, I need to limit my random walk on each line. One way I thought of doing this, would be from the index j (where the value in the position is less than or equal to 0 or greater than or equal to t) of each line replace with null.
simulate_binomial = function(cenarios, rodadas, p){
return(matrix(data=rbinom(cenarios*rodadas, 1, p), nrow=cenarios, ncol=rodadas))
}
i = 2
t = 10
p = 0.8
max_walk = 100
samples = simulate_binomial(1000, max_walk, p)
samples[samples==0] = -1
walk = t(apply(cbind(i, samples), 1, cumsum))
walk1 = apply(walk, 1, function(x) (which((x <= 0) | (x >= t))[1]))
So my walk1 would be the indices of each line that would have a value less than or equal to zero or greater than or equal to t. However, I don't know how to assign null for this index onwards in the line.
My intention is to assign null so that I can plot precisely without this null part and see the effect of the ruin on each line / "scenario".
Can anyone help me plz?
You can change your last apply to :
walk1 <- t(apply(walk, 1, function(x) {
inds <- (which((x <= 0) | (x >= t))[1])
x[(inds+1):length(x)] <- NA
x
}))
I have the following vector:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
And if we with the following plot code
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
we can get this:
By eye we can see at x-axis point 4 the value change begin to change drastically plateaued.
My question is given the vector wss how can we automatically detect the index 4 without looking at the plot.
Edit: This works better:
#change relative to the maximum change
threshold <- 0.1
d1 <- diff(wss)
# this assumes that the first value is the highest
## you could use max(d1) instead of d1[1]
which.max((d1 / d1[1]) < threshold) #results in 3
d1 <- diff(wss2)
which.max(d1 / d1[1] < threshold) #results in 5
Second Edit: This is somewhat subjective, but here's how my three methods compare for your two data sets. While it's easy to visualize what a plateau is, you need to be able to describe in math terminology what a plateau is in order to automate it.
Original: If you know that the second derivative will flip from positive to negative, you can do this:
sec_der <- diff(wss, differences = 2)
inflection_pt <- which.min(sign(sec_der))
inflection_pt
For this data set, the result is 5 which corresponds to the original datasets result of 7 (i.e., 151.991).
Instead of looking at inflection points, you could instead look at some relative percent threshold.
thrshold <- 0.06
which.min(sign(abs(diff(wss)) / wss[1:(length(wss)-1)] - thrshold))
This results in 5 as well using the first derivative approach.
Regardless, using the diff() function would be a key part of figuring this out in base R. Also see:
Finding the elbow/knee in a curve
Code to create graphs:
wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287,
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227,
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563,
116.448588269066, 88.1287299776581, 83.9345098736843)
wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)
data_list <- list(wss, wss2)
# Potential_methods -------------------------------------------------------
plateau_method = list(thresh_to_max = function(x) which.max(diff(x) / diff(x)[1] < threshold)
, inflection_pt = function(x) which.min(sign(diff(x, differences = 2)))
, deriv_to_raw = function(x) which.min(sign(abs(diff(x)) / x[1:(length(x)-1)] - threshold))
)
threshold <- 0.1
results <- t(sapply(plateau_method, mapply, data_list))
# graphing ----------------------------------------------------------------
par(mfrow = c(3,2))
apply(results, 1, function (x) {
for (i in seq_along(x)) {
plot(data_list[[i]],ylab="Within groups sum of squares", type = 'b', xlab = 'Number of Clusters')
abline(v = x[i])
}
} )
lapply(seq_along(names(plateau_method))
, function (i) {
mtext(paste(names(plateau_method)[i]
, "- \n"
, substring(plateau_method[i], 15))
, side = 3, line = -18*(i)+15, outer = TRUE)
})
mtext('Threshold = 0.1', side = 3, line = -53, outer = T)
I have been trying to implement a random walk on the n-cycle algorithm in R.
By n-cycle I mean the set of integers Zn, or modulo n. Basically, it’s example 5.3.1 from the book “Markov chains and mixing time”, by Levin, Peres and Wilmer. The intention is as follows: consider two chains modeling the movement of two particles X and Y on Zn with starting points X1 and Y1. By the flip of a fair coin we decide which particle will move (the particles cannot move simultaneously unless they have coupled); the direction is decided by another flip of fair coin.
Once the two particle collide, they move together hereafter. It is part of a study project to implement a CFTP algorithm, so the length of the chains should have a pre-defined value, say T.
The code does not run and an error message appears. The error is “object ‘res’ not found”. However, I had previously defined “res” as a list to store the output of the function. Why does this happen and how could it be fixed?
I have two scripts: in the first one the code is split in smaller helper functions; the second one may be messier, as I tried to put all the helper functions within one single function.
Any help will be much appreciated.
This one is script 2.
# X1 - initial state of chain X
# Y1 - initial state of chain Y
# T - "length" of a chain, number of steps the chains will run for.
# n - length of the n-cycle, i.e., Zn.
Main_Function <- function (X1 = 8, Y1 = 4 , T = 20, n = 6){
X <- rep( X1, T) %% n # X, Y and res will store the results
Y <- rep( Y1, T) %% n
res <- list(X,Y) # Here I defined the object res. Later on R encounters an error "object 'res' not found".
ps <- TakeOneStep() # TakeOneStep is a function defined below
return(ps)
}
TakeOneStep <- function(){
incr_same <- sample(c(-1, 0, 1), size = 1, prob = c(1/4, 1/2, 1/4)) #direction of the particles after they have coupled
incr_dif <- sample(c(-1,1), size = 1, prob = c(1/2, 1/2)) # direction of the particles before coupling occurred.
choice <- runif(T) # determines which chain moves, before coupling occurred.
for(t in 2:T){
if(res[[1]][t-1]%%n == res[[2]][t-1]%%n){
res[[1]][t] <- (res[[1]][t-1] + incr_same) %% n
res[[2]][t] <- (res[[2]][t-1] + incr_same) %% n
}else{ if(choice[t] < 0.5) {
res[[1]][t] <- (res[[1]][t-1] + incr_dif) %% n
}else{res[[2]][t] <- (res[[2]][t-1] + incr_dif)%%n}
}
}
return(res)
}