Create conditional probability tree in R with data.tree - r

I am creating a 3 x 3 data.tree of conditional probabilities and would like the last branch of each node to be equal to 1 - sum of the siblings. The idea is that I will tweak each node of the tree and want to make sure the probabilities at the level do not sum up to be greater than 100%.
I have looked at https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html and see there is a way to do custom fields as a function but how do I take such an exmaple and allow the function to look at the nodes at the same level as itself?
library(data.tree)
Probability <- function(node) {
result <- node$prob
if (node$position == 3)
result <- 1 - node$parent$children[[1]]$prob - node$parent$children[[2]]$prob
else
sapply(node$children, Probability)
return(result)
}
df <- data.frame(pathString = c("R/1", "R/2", "R/3", "R/1/1", "R/1/2", "R/1/3", "R/2/1", "R/2/2", "R/2/3", "R/3/1", "R/3/2", "R/3/3"),
prob = c(0.1, 0.2, 0, 0.2, 0.2, 0, 0.5, 0.35, 0, 0.35, 0.35, 0))
tree <- as.Node(df)
tree$prob <- Probability(tree)
The output I get with
print(tree, "prob")
levelName prob
1 R NA
2 ¦--1 0.10
3 ¦ ¦--1 0.20
4 ¦ ¦--2 0.20
5 ¦ °--3 0.00
6 ¦--2 0.20
7 ¦ ¦--1 0.50
8 ¦ ¦--2 0.35
9 ¦ °--3 0.00
10 °--3 0.00
11 ¦--1 0.35
12 ¦--2 0.35
13 °--3 0.00
The 3rd branch is still 0 and would have hoped it would be as I expected which is 1 - prob[node1] - prob[node2].
So
R/3 should be 0.7 (= 1 - 0.10 - 0.20)
R/1/3 should be 0.6 (= 1 - 0.20 - 0.20)
R/2/3 should be 0.15 (= 1 - 0.50 - 0.35)
R/3/3 should be 0.30 (= 1 - 0.35 - 0.35)
Am new to data.trees. Any help would be greatly appreciated!!

I have managed to do this using recursion and the Cumulate function
refresh.probs <- function(node) {
if (node$isRoot) {
sapply(node$children, refresh.probs)
} else {
if (node$position == node$parent$count) {
node$prob <- 0
node$prob <- 1 - Cumulate(node, "prob", sum)
}
if (!node$isLeaf)
sapply(node$children, refresh.probs)
}
}
refresh.probs(tree)
This seems to do the trick.

Related

Loop results in wrong position/order

I need to calculate the results of a very simple formula (weighted average) that uses two variables (A and B) and two weight factors (A_prop and B_prop). The calculation is to be performed in a very large data set and the weight factors are stored in another data frame that I called here grid.
My approach was first to create repetitions of the data for each weight factors combination and then performed the calculations. Till that nothing strange. However then I thought about calculating values inside loop. Everything seemed to be in place, but then I checked the results of both approaches and results do not match. The results from the calculation inside loop are incorrect.
I know I should just get along and keep with the one that gives me the correct results, also because the number of lines are quite small. No big problem. However... I can just live with this. I'm about to tear my hair.
Can anyone explain me why the results are not the same? What's wrong with the loop calculation?
Also, in addition, if you have any suggestion on a more elegant it will be welcome.
(note: my first time using a reprex. Hope it is as it should)
>require(tidyverse)
>require(magicfor)
>require(readxl)
>require(reprex)
> dput(dt)
structure(list(X = 1:5, A = c(83.73, 50.4, 79.59, 62.96, 0),
B = c(100, 86.8, 80.95, 81.48, 0), weight = c(201.6, 655,
220.5, 280, 94.5), ind = c(733L, 26266L, 6877L, 8558L, 16361L
)), class = "data.frame", row.names = c(NA, -5L))
> dput(grid)
structure(list(A_prop = c(0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8,
0.85, 0.9, 0.95, 1), B_prop = c(0.5, 0.45, 0.4, 0.35, 0.3, 0.25,
0.2, 0.15, 0.1, 0.05, 0), id = 1:11, tag = structure(1:11, .Label = c("Aprop_0.5",
"Aprop_0.55", "Aprop_0.6", "Aprop_0.65", "Aprop_0.7", "Aprop_0.75",
"Aprop_0.8", "Aprop_0.85", "Aprop_0.9", "Aprop_0.95", "Aprop_1"
), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
> foo<-function(data,i){
+ val<-(data$A*grid[i,1])+(data$B*grid[i,2])
+ return(val)
+ }
> magic_for(print, progress=FALSE,silent = TRUE)
> for(i in grid$id){
+
+ score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
+
+ weight=dt$weight
+ A<-dt$A
+ B<-dt$B
+
+ ind=dt$ind
+
+ print(score)
+ print(weight)
+ print(ind)
+ print(A)
+ print(B)
+ }
> rest<-magic_result_as_dataframe()
> magic_free()
> rest2<-left_join(rest,grid,by=c("i"="id"))%>%
+ arrange(ind,tag)%>%
+ mutate(score2=(A*A_prop)+(B*B_prop))
> head(rest2)
i score weight ind A B A_prop B_prop tag score2
1 1 91.8650 201.6 733 83.73 100 0.50 0.50 Aprop_0.5 91.8650
2 2 84.5435 201.6 733 83.73 100 0.55 0.45 Aprop_0.55 91.0515
3 3 86.1705 201.6 733 83.73 100 0.60 0.40 Aprop_0.6 90.2380
4 4 87.7975 201.6 733 83.73 100 0.65 0.35 Aprop_0.65 89.4245
5 5 89.4245 201.6 733 83.73 100 0.70 0.30 Aprop_0.7 88.6110
6 6 91.0515 201.6 733 83.73 100 0.75 0.25 Aprop_0.75 87.7975
The problem is actually your left_join and NOT the for loop. For future posts, I would recommend that you also provide a minimal(istic) example.
I will demonstrate what went wrong in your code.
Say, we have these data frames, which should be similar to your real-world data:
dt <- data.frame(
A = c(2,3,4),
B = c(20,30,40)
)
grid <- data.frame(
A_prop = c(0.5, 0.6),
B_prop = c(0.5, 0.4),
id = c(1,2),
tag = c("A_prop0.5", "A_prop0.6"))
We expect the following outputs:
Expected Output dt[1,] & A_prop 0.5 and B_prop 0.5
2 * 0.5 + 20 * 0.5 #= 11
Expected Output dt[2,] & A_prop 0.5 and B_prop 0.5
3 * 0.5 + 30 * 0.5 #= 16.5
Expected Output dt[3,] & A_prop 0.5 and B_prop 0.5
4 * 0.5 + 40 * 0.5 #= 22
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
2 * 0.6 + 20 * 0.4 #= 9.2
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
3 * 0.6 + 30 * 0.4 #= 13.8
Expected Output dt[1,] & A_prop 0.6 and B_prop 0.4
4 * 0.6 + 40 * 0.4 #= 18.4
I have never used the "magicfor" library, but the problem lies in your way of joining i and id.
I would write the for loop as follows:
l <- list()
for(i in grid$id){
score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
A<-dt$A
B<-dt$B
iteration <- rep(i, 3) # to keep track in which iteration the result was created.
l[[i]] <- list(
score = score,
A = A,
B = B,
iteration = iteration
)
}
Now I bind the list to a data frame and do the left_join as you did in your example:
l <- bind_rows(l)
l_merged <- grid %>% left_join(l, by = c("id"="iteration")) %>%
mutate(score2 = (A*A_prop + B*B_prop))
The test that score and score2 are the same:
transmute(l_merged, identical = score == score2)
identical
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
Now to the actual problem
I have adapted your code a little bit. I have added the iteration number to the output.
magic_for(print, progress=FALSE,silent = TRUE)
for(i in grid$id){
score<-(dt$A*grid[i,1])+(dt$B*grid[i,2])
A<-dt$A
B<-dt$B
iteration <- rep(i, 3)
print(score)
print(A)
print(B)
print(iteration)
}
rest<-magic_result_as_dataframe()
magic_free()
Now, if we look at the output and compare i and iteration, we can see that these are not identical. Therefore your left_join() has produced a confusing result.
rest %>% arrange(i)
i score A B iteration
1 1 11.0 2 20 1
2 1 22.0 4 40 1
3 1 13.8 3 30 2
4 2 16.5 3 30 1
5 2 9.2 2 20 2
6 2 18.4 4 40 2
To finalise, we can test it:
grid %>% left_join(rest, by = c("id"="i")) %>% # using i for the join
mutate(score2 = (A*A_prop + B*B_prop)) %>%
transmute(identical = score == score2)
identical
1 TRUE
2 TRUE
3 FALSE
4 FALSE
5 TRUE
6 TRUE
The join with i does not produce the correct results.
But the join with iteration will:
grid %>% left_join(rest, by = c("id"="iteration")) %>% # using the "manually" produced iteration for the join
mutate(score2 = (A*A_prop + B*B_prop)) %>%
transmute(identical = score == score2)
identical
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 TRUE
I am not sure why the i from "magicfor" is different from the manually created iteration. I certainly get your confusion...

Optimization function across multiple factors

I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56

Comparing changes across two matrices

I'm performing some biogeographic analyses in R and the result is encoded as a pair of matrices. Columns represent geographic regions, rows indicate nodes in a phylogenetic tree and values in the matrix are the probability that the branching event occurred in the geographic region indicated by the column. A very simple example would be:
> One_node<-matrix(c(0,0.8,0.2,0),
+ nrow=1, ncol=4,
+ dimnames = list(c("node 1"),
+ c("A","B","C","D")))
> One_node
A B C D
node_1 0 0.8 0.2 0
In this case, the most probable location for node_1 is region B. In reality, the output of the analysis is encoded as two separate 79x123 matrices. The first is the probabilities of a node occupying a given region before an event and the second is the probabilities of a node occupying a given region after an event (rowSums=1). Some slightly more complicated examples:
before<-matrix(c(0,0,0,0,0.9,
0.8,0.2,0.6,0.4,0.07,
0.2,0.8,0.4,0.6,0.03,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
after<-matrix(c(0,0,0,0,0.9,
0.2,0.8,0.4,0.6,0.03,
0.8,0.2,0.6,0.4,0.07,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
> before
A B C D
node_1 0.0 0.80 0.20 0
node_2 0.0 0.20 0.80 0
node_3 0.0 0.60 0.40 0
node_4 0.0 0.40 0.60 0
node_5 0.9 0.07 0.03 0
> after
A B C D
node_1 0.0 0.20 0.80 0
node_2 0.0 0.80 0.20 0
node_3 0.0 0.40 0.60 0
node_4 0.0 0.60 0.40 0
node_5 0.9 0.03 0.07 0
Specifically, I'm only interested in extracting row numbers where column B is the highest in before and column C is the highest in after and vice versa as I'm trying to extract node numbers in a tree where taxa have moved B->C or C->B.
So the output I'm looking for would be something like:
> BC
[1] 1 3
> CB
[1] 2 4
There will be rows where B>C or C>B but where neither is the highest in the row (node_5) and I need to ignore these. The row numbers are then used to query a separate dataframe that provides the data I want.
I hope this all makes sense. Thanks in advance for any advice!
You could do something like this...
maxBefore <- apply(before, 1, which.max) #find highest columns before (by row)
maxAfter <- apply(after, 1, which.max) #and highest columns after
BC <- which(maxBefore==2 & maxAfter==3) #rows with B highest before, C after
CB <- which(maxBefore==3 & maxAfter==2) #rows with C highest before, B after
BC
node_1 node_3
1 3
CB
node_2 node_4
2 4

Backtesting stock returns for Moving Average rule in r

I am trying to backtest stock returns given a 10 month moving average rule. The rule being, if the price is above the 10mnth average - buy, if it is below the 10mnth average - hold the value constant.
I know how to do this in excel very easily, but I am having trouble in R.
Below is my approach in R:
#Downloand financial data
library(Quandl)
SPY <- Quandl("YAHOO/INDEX_GSPC", type = "xts", collapse = "monthly")
head(SPY)
#Calculate log returns
SPY$log_ret <- diff(log(SPY$Close))
#Calculate moving average for Closing price
SPY$MA.10 <- rollapply(SPY$Close, width = 10, FUN = mean)
#Create binary rule to determine when to buy and when to hold
#1 = Buy
SPY$Action <- ifelse(SPY$MA.10 < SPY$Close, 1, 0)
#Create default value in a new column to backtest returns
SPY$Hit <- 100
#Calculate cumulative returns
SPY$Hit <-ifelse(SPY$Action == 1, SPY[2:n, "Hit"] *
(1 + SPY$log_ret), lag.xts(SPY$Hit, k=1))
Returns do get calculated correctly for an Action of 1, but when the Action is not 1, I find that SPY$Hit only lags 1 time, then defaults to the 100 value, while I would like it to hold the value from the last Action == 1 time.
This formula works very well in MS Excel and is very easy to implement, but it seems that the issue in R is that I cannot keep the value constant from the last Action == 1, how can I do this so that I can see how well this simple trading strategy would work?
Please let me know if I can clarify this further, thank you.
Sample of the desired output:
Action Return Answer
[1,] 0 0.00 100.00000
[2,] 1 0.09 109.00000
[3,] 1 0.08 117.72000
[4,] 1 -0.05 111.83400
[5,] 1 -0.03 108.47898
[6,] 0 -0.02 108.47898
[7,] 0 0.01 108.47898
[8,] 0 0.06 108.47898
[9,] 1 -0.03 105.22461
[10,] 0 0.10 105.22461
[11,] 1 -0.05 99.96338
Here's my guess, let me know what you think.
# Looping
Hit <- matrix(100, nrow = nrow(SPY))
for(row in 11:nrow(SPY)){ # 11 since you have NA's from your moving average
if(SPY$Action[row] == 1){
Hit[row] = Hit[row-1] * (1 + SPY$log_ret[row]) # here we needed row-1
} else {
Hit[row] = Hit[row-1]
}
}
SPY$Hit <- Hit
cbind(SPY$Action, SPY$Hit)
For your sample:
x <- data.frame(Action = c(0,1,1,1,1,0,0,0,1,0,1))
x$Return <- c(0,0.09,0.08,-0.05,-0.03,-0.02,0.01,0.06,-0.03,0.10,-0.05)
x$Answer <- matrix(100, nrow = nrow(x))
for(row in 2:nrow(x)){ # 11 since you have NA's from your moving average
if(x$Action[row] == 1){
x$Answer[row] = x$Answer[row-1] * (1 + x$Return[row])
} else {
x$Answer[row] = x$Answer[row-1]
}
}
x
Action Return Answer
1 0 0.00 100.00000
2 1 0.09 109.00000
3 1 0.08 117.72000
4 1 -0.05 111.83400
5 1 -0.03 108.47898
6 0 -0.02 108.47898
7 0 0.01 108.47898
8 0 0.06 108.47898
9 1 -0.03 105.22461
10 0 0.10 105.22461
11 1 -0.05 99.96338
In Excel there are 2 ways to attain it,
1. Go to the Data command find Data Analysis, find Moving Average,,
In the dialog box you need to put Input data range, Interval (in yur case 10), then output cell addresses.
After finding Result write this formula,
=if(A2 >B2, "Buy", "Hold")
Where A2 hold Price, B2 holds 10 months Moving Average value.
Any where in sheet number cells horizontally 1 to 10 (month number).
Below row put month's value 1 to 10.
Below row calculate 10 months Average.
And finally write the Above written formula to find Buy or hold.

R alternate two/four vectors after every two/four values

Say I have two vectors with values that come from formulas:
A <- c(0.11, -0.11, -.20, .20, -0.18, 0.18)
B <- c(-0.11, 0.11, .20, -.20, 0.18, -0.18)
What I wish to accomplish is to merge the vertices into one vector where I have the first two values of A, then the 3rd and 4th value of B, then the 5th and 6th value of A (in the actual data set the vertices are 96 characters long), to end up with:
V <- c(0.11, -0.11, .20, -.20, -0.18, 0.18)
I wish to accomplish the same with four vectors where it switches between vectors every 4 values. Seeing as the vectors are long, I don't want to have to resort to the use of indices.
I've fumbled around a lot with combinations of the c() and rbind() functions, but always end up merging incorrectly.
An example of code I've tried (with called objects substituted with possible values):
c(rbind(1.2 - (1.2 + 1.2/2),
1.2 - 1.2/2)),
rbind(1.2 - 1.2/2),
1.2 - (1.2 + 1.2/2)))
This would end up with the vectors being merged after the first one ends. I've tried different combinations, but none worked out for me.
Does anybody have a nifty trick up their sleeve?
Here's a wrapper function that will accept any number of vectors and give you desire result (though the vectors are assumed to be of same length)
Myfunc <- function(...){
temp <- cbind(...)
len <- ncol(temp)
suppressWarnings(temp[cbind(seq(nrow(temp)), rep(seq(len), each = len))])
}
Myfunc(A, B)
## [1] 0.11 -0.11 0.20 -0.20 -0.18 0.18
On 4 vectors (Provided by OP in comments)
A <- 1:16 ; B <- 21:36 ; C <- 41:56 ; D <- 61:76
Myfunc(A, B, C, D)
## [1] 1 2 3 4 25 26 27 28 49 50 51 52 73 74 75 76
> (1:6) %% 4 %in% c(1,2)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
> (1:12) %% 8 %in% c(1,2,3,4)
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
etc.
or for your example
> D <- rep(0,6)
> D[(1:6) %% 4 %in% c(1,2)] <- A[(1:6) %% 4 %in% c(1,2)]
> D[!(1:6) %% 4 %in% c(1,2)] <- B[!(1:6) %% 4 %in% c(1,2)]
> D
[1] 0.11 -0.11 0.20 -0.20 -0.18 0.18

Resources