Find the maximum value in decision tree - r

I created decision tree with Party package in R.
I'm trying to get the route/branch with the maximum value.
It can be mean value that came from box-plot
and it can be probability value that came from binary tree
(source: rdatamining.com)

This can be done pretty easily actually, though while your definition of maximum value is clear for a regression tree, it is not very clear for a classification tree, as in each node different level can have it's own maximum
Either way, here's a pretty simple helper function that will return you the predictions for each type of tree
GetPredicts <- function(ct){
f <- function(ct, i) nodes(ct, i)[[1]]$prediction
Terminals <- unique(where(ct))
Predictions <- sapply(Terminals, f, ct = ct)
if(is.matrix(Predictions)){
colnames(Predictions) <- Terminals
return(Predictions)
} else {
return(setNames(Predictions, Terminals))
}
}
Now luckily you've took your trees from the examples of ?ctree, so we can test them (next time, please provide the code you used yourself)
Regression Tree (your frist tree)
## load the package and create the tree
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
plot(airct)
Now, test the function
res <- GetPredicts(airct)
res
# 5 3 6 9 8
# 18.47917 55.60000 31.14286 48.71429 81.63333
So we've got the predictions per each terminal node. You can easily proceed with which.max(res) from here (I'll leave it for you to decide)
Classification tree (your second tree)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, type = "simple")
Run the function
res <- GetPredicts(irisct)
res
# 2 5 6 7
# [1,] 1 0.00000000 0.0 0.00000000
# [2,] 0 0.97826087 0.5 0.02173913
# [3,] 0 0.02173913 0.5 0.97826087
Now, the output is a bit harder to read because each class has it's own probabilities. You could make this a bit more readable using
row.names(res) <- levels(iris$Species)
res
# 2 5 6 7
# setosa 1 0.00000000 0.0 0.00000000
# versicolor 0 0.97826087 0.5 0.02173913
# virginica 0 0.02173913 0.5 0.97826087
The, you could do something like the following in order to get the overall maximum value
which(res == max(res), arr.ind = TRUE)
# row col
# setosa 1 1
For column/row maxes, you could do
matrixStats::colMaxs(res)
# [1] 1.0000000 0.9782609 0.5000000 0.9782609
matrixStats::rowMaxs(res)
# [1] 1.0000000 0.9782609 0.9782609
But, again, I'll leave to you to decide on how to proceed from here.

Related

split() returns "longer object length is not a multiple of shorter object length"

Context
I asked this question recently:
Comparing partitions from split() using a nested for loop containing an if statement
where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by #robertdj
set.seed(1234) # set random seed for reproducibility
# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix
splt <- split(d, 1:5) # split data with 5 values in each partition
for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
if (i != j) {
a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
}
}
}
I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.
As per #Robert Hacken's comment if I instead do
d <- na.omit(d[lower.tri(d)])
I get partitions of unequal length.
Real Data
However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.
Here is my code
splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w+)\\|(\\w+)\\|.*?$", "\\1-\\2", colnames(dist_matrix)))
The distance matrix dist_matrix contains FASTA headers from which I extract the species names.
I then use splt above in the doubly nested loop.
For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.
splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667
splt[[10]]
[1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000
Expected Output
For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.
splt would thus contain 10 partitions (each not necessarily of the same length) for all species
The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().
What I've Done
I've read on SO that %in% is typically used to resolve the warning
In splt[[i]] == splt[[j]] :
longer object length is not a multiple of shorter object length
except in my case, I believe I would need %notin% <- Negate(%in%).
However, %notin% gives the error in my original post
the condition has length > 1
Question
How can my nested loop be altered to remove the warning?
I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).
Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,
set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)
We have:
splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389 0.06445882
# $`5`
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).
Some quick math so we know what we should end up with:
splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069
Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.
Similarly, [[5]] over [[4]]:
splt[[5]]
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644 0.06445882
### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2
### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2
We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.
Greater than the other's max
sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
# 1 2 3 4 5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1
Interpretation and subset validation:
1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.
Less than the other's min
sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
# 1 2 3 4 5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1
Same two pairs:
1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected
Edit: #compbiostats pointed out that while sum(..) should produce the same results as length(which(..)), the latter may be more robust to missing-data (e.g., NA values, c.f., Difference between sum(), length(which()), and nrow() in R). For sum(..) to share that resilience, we should add na.rm=TRUE) to both sum(.) and min(.) in the above calls. Thanks #compbiostats!

igraph: summarize each node's neighbours characteristics

With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}
Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN

Create multiple confusion matrices in R using loops

I am trying to create multiple confusion matrices from one dataframe, with each matrix generated based off a different condition in the dataframe.
So for the dataframe below, I want a confusion matrix for when Value = 1, Value = 2, Value =3
observed predicted Value
1 1 1
0 1 1
1 0 2
0 0 2
1 1 3
0 0 3
and see the results like:
Value Sensitivity Specificity PPV NPV
1 .96 .71 .84 .95
2 .89 .63 .30 .45
3 .88 .95 .28 .80
This is what I tried with a reproducible example. I am trying to write a loop that looks at every row, determines if Age = 1, and then pulls the values from the predicted and observed columns to generate a confusion matrix. Then I manually pull out the values from the confusion matrix to write out sen, spec, ppv, and npv and tried to combine all the matrices together. And then the loop starts again with Age = 2.
data(scat)
df<-scat %>% transmute(observed=ifelse(Site=="YOLA","case", "control"), predicted=ifelse(Location=="edge","case", "control"),Age)
x<-1 #evaluate at ages 1 through 5
for (i in dim(df)[1]) { #for every row in df
while(x<6) { #loop stops at Age=5
if(x=df$Age) {
q<-confusionMatrix(data = df$predicted, reference = df$observed, positive = "case")
sensitivity = q$table[1,1]/(q$table[1,1]+q$table[2,1])
specificity = q$table[2,2]/(q$table[2,2]+q$table[1,2])
ppv = q$table[1,1]/(q$table[1,1]+q$table[1,2])
npv = q$table[2,2]/(q$table[2,2]+q$table[2,1])
matrix(c(sensitivity, specificity, ppv, npv),ncol=4,byrow=TRUE)
}
}
x <- x + 1 #confusion matrix at next Age value
}
final<- rbind(matrix) #combine all the matrices together
However, this loop is completely non-functional. I'm not sure where the error is.
Your code can be simplified and the desired output achieved like this:
library(caret)
library(dplyr)
data(scat)
df <- scat %>%
transmute(observed = factor(ifelse(Site == "YOLA","case", "control")),
predicted = factor(ifelse(Location == "edge","case", "control")),
Age)
final <- t(sapply(sort(unique(df$Age)), function(i) {
q <- confusionMatrix(data = df$predicted[df$Age == i],
reference = df$observed[df$Age == i],
positive = "case")$table
c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
ppv = q[1, 1] / (q[1, 1] + q[1, 2]),
npv = q[2, 2] / (q[2, 2] + q[2, 1]))
}))
Resulting in
final
#> sensitivity specificity ppv npv
#> [1,] 0.0 0.5625000 0.00000000 0.8181818
#> [2,] 0.0 1.0000000 NaN 0.8000000
#> [3,] 0.2 0.5882353 0.06666667 0.8333333
#> [4,] 0.0 0.6923077 0.00000000 0.6923077
#> [5,] 0.5 0.6400000 0.25000000 0.8421053
However, it's nice to know why your own code didn't work, so here are a few issues that might be useful to consider:
You need factor columns rather than character columns for confusionMatrix
You were incrementing through the rows of df, but you need one iteration for each unique age, not each row in your data frame.
Your line to increment x happens outside of the while loop, so x never increments and the loop never terminates, so the console just hangs.
You are doing if(x = df$Age), but you need a == to test equality.
It doesn't make sense to compare x to df$Age anyway, because x is length 1 and df$Age is a long vector.
You have unnecessary repetition by doing q$table each time. You can just make q equal to q$table to make your code more readable and less error-prone.
You call matrix at the end of the loop, but you don't store it anywhere, so the whole loop doesn't actually do anything.
You are trying to rbind an object called matrix in the last line which doesn't exist
Your lack of spaces between math operators, commas and variables make the code less readable and harder to debug. I'm not just saying this as a stylistic point; it is a major source of errors I see frequently here on SO.

interpretation of the output of R function bs() (B-spline basis matrix)

I often use B-splines for regression. Up to now I've never needed to understand the output of bs in detail: I would just choose the model I was interested in, and fit it with lm. However, I now need to reproduce a b-spline model in an external (non-R) code. So, what's the meaning of the matrix generated by bs? Example:
x <- c(0.0, 11.0, 17.9, 49.3, 77.4)
bs(x, df = 3, degree = 1) # generate degree 1 (linear) B-splines with 2 internal knots
# 1 2 3
# [1,] 0.0000000 0.0000000 0.0000000
# [2,] 0.8270677 0.0000000 0.0000000
# [3,] 0.8198433 0.1801567 0.0000000
# [4,] 0.0000000 0.7286085 0.2713915
# [5,] 0.0000000 0.0000000 1.0000000
# attr(,"degree")
# [1] 1
# attr(,"knots")
# 33.33333% 66.66667%
# 13.30000 38.83333
# attr(,"Boundary.knots")
# [1] 0.0 77.4
# attr(,"intercept")
# [1] FALSE
# attr(,"class")
# [1] "bs" "basis" "matrix"
Ok, so degree is 1, as I specified in input. knots is telling me that the two internal knots are at x = 13.3000 and x = 38.8333 respectively. Was a bit surprised to see that the knots are at fixed quantiles, I hoped R would find the best quantiles for my data, but of course that would make the model not linear, and also wouldn't be possible without knowing the response data. intercept = FALSE means that no intercept was included in the basis (is that a good thing? I've always being taught not to fit linear models without an intercept...well guess lm is just adding one anyway).
However, what about the matrix? I don't really understand how to interpret it. With three columns, I would think it means that the basis functions are three. This makes sense: if I have two internal knots K1 and K2, I will have a spline between left boundary knot B1 and K1, another spline between K1 and K2, and a final one between K2 and B2, so...three basis functions, ok. But which are the basis functions exactly? For example, what does this column mean?
# 1
# [1,] 0.0000000
# [2,] 0.8270677
# [3,] 0.8198433
# [4,] 0.0000000
# [5,] 0.0000000
EDIT: this is similar to but not precisely the same as this question. That question asks about the interpretation of the regression coefficients, but I'm a step before that: I would like to understand the meaning of the model matrix coefficients. If I try to make the same plots as suggested in the first answer, I get a messed up plot:
b <- bs(x, df = 3, degree = 1)
b1 <- b[, 1] ## basis 1
b2 <- b[, 2] ## basis 2
b3 <- b[,3]
par(mfrow = c(1, 3))
plot(x, b1, type = "l", main = "basis 1: b1")
plot(x, b2, type = "l", main = "basis 2: b2")
plot(x, b3, type = "l", main = "basis 3: b3")
These can't be the B-spline basis functions, because they have too many knots (each function should only have one).
The second answer would actually allow me to reconstruct my model outside R, so I guess I could go with that. However, also that answer doesn't exactly explains what the elements of the b matrix are: it deals with the coefficients of a linear regression, which I haven't still introduced here. It's true that that is my final goal, but I wanted to understand also this intermediate step.
The matrix b
# 1 2 3
# [1,] 0.0000000 0.0000000 0.0000000
# [2,] 0.8270677 0.0000000 0.0000000
# [3,] 0.8198433 0.1801567 0.0000000
# [4,] 0.0000000 0.7286085 0.2713915
# [5,] 0.0000000 0.0000000 1.0000000
is actually just the matrix of the values of the three basis functions in each point of x, which should have been obvious to me since it's exactly the same interpretation as for a polynomial linear model. As a matter of fact, since the boundary knots are
bknots <- attr(b,"Boundary.knots")
# [1] 0.0 77.4
and the internal knots are
iknots <- attr(b,"knots")
# 33.33333% 66.66667%
# 13.30000 38.83333
then the three basis functions, as shown here, are:
knots <- c(bknots[1],iknots,bknots[2])
y1 <- c(0,1,0,0)
y2 <- c(0,0,1,0)
y3 <- c(0,0,0,1)
par(mfrow = c(1, 3))
plot(knots, y1, type = "l", main = "basis 1: b1")
plot(knots, y2, type = "l", main = "basis 2: b2")
plot(knots, b3, type = "l", main = "basis 3: b3")
Now, consider b[,1]
# 1
# [1,] 0.0000000
# [2,] 0.8270677
# [3,] 0.8198433
# [4,] 0.0000000
# [5,] 0.0000000
These must be the values of b1 in x <- c(0.0, 11.0, 17.9, 49.3, 77.4). As a matter of fact, b1 is 0 in knots[1] = 0 and 1 in knots[2] = 13.3000, meaning that in x[2] (11.0) the value must be 11/13.3 = 0.8270677, as expected. Similarly, since b1 is 0 for knots[3] = 38.83333, the value in x[3] (17.9) must be (38.83333-13.3)/17.9 = 0.8198433. Since x[4], x[5] > knots[3] = 38.83333, b1 is 0 there. A similar interpretation can be given for the other two columns.
Just a small correction to the excellent answer by #DeltaIV above (it looks like I can not comment.)
So in b1, when he calculated b1(x[3]), it should be (38.83333-17.9)/(38.83333-13.3)=0.8198433 by linear interpolation. Everything else is perfect.
Note b1 should look like this
\frac{t}{13.3}I(0<=t<13.3)+\frac{38.83333-t}{38.83333-13.3}I(13.3<=t<38.83333)

calculate cumulative probability from binomial logit output in R

I have a fitted binomial logit model and want to calculate the cumulative probability of experiencing an event <= some value of a covariate.
For example, If I have a fitted model that predicts and outcome based on a continuous distance range (0-8.5 km) I might want to find out the cumulative probability for distance <= to 4.5 km.
I have vectors of estimated probabilities and the associated distances as below
dat <- structure(list(km = c(0, 0.447368421052632, 0.894736842105263,
1.34210526315789, 1.78947368421053, 2.23684210526316, 2.68421052631579,
3.13157894736842, 3.57894736842105, 4.02631578947368, 4.47368421052632,
4.92105263157895, 5.36842105263158, 5.81578947368421, 6.26315789473684,
6.71052631578947, 7.15789473684211, 7.60526315789474, 8.05263157894737,
8.5), prob = c(0.99010519543441, 0.985413663823809, 0.97854588563623,
0.968547716962174, 0.954108659036907, 0.933496091194704, 0.904551377544634,
0.864833064332603, 0.81202174997839, 0.744668375529677, 0.663191827576796,
0.570704402277059, 0.47300143764816, 0.377323442817887, 0.290336664745317,
0.216433162546689, 0.157174982015906, 0.111825887625402, 0.0783449309507567,
0.054275681518511)), .Names = c("km", "prob"), row.names = c(NA,
-20L), class = "data.frame")
What I ultimately want to say is "x% of observations within x distance are predicted to experience an event". Is this the right way to go about that?
Also is there an easy way to calculate at which distance (from 0 - whatever) encompasses the 50% cumulative probability.
Thanks, Tim
There is probably some way to extract this from your model, but if you were doing it from scratch I would try to fit your data to a distribution, then extract your relevant data points.
First define an error function:
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x)) # or some other error fxn
Now let's say your data sort of looks like a gamma distribution, so try:
gdf <- function(x, d=dat$km) pgamma(d,shape=x[1], scale=x[2])
So your function to optimize will be the error function of your data and the fit distribution:
error_fxn <- function(x) rmse(rev(dat$prob),gdf(x)) # rev data to make ascending
Now optimize this function to get your parameters for the distribution of interest:
rr <- optim(c(1,1),error_fxn)
And let's see how good the fit is (just ok...);
rr
# $par
# [1] 3.108392 1.112584
# $value
# [1] 0.0333369
# $counts
# function gradient
119 NA
# $convergence
# [1] 0
# $message
# NULL
Or graphicaly:
with(dat,plot(km,prob,xlim=c(10,0)))
with(dat,lines(rev(km),pgamma(km,shape=rr$par[1], scale=rr$par[2]),col='red'))
Take a look at the values for the CDF:
kms <- seq(0,8.5,0.5)
data.frame(dist = kms, cdf = pgamma(kms,shape=rr$par[1], scale=rr$par[2]))
# dist cdf
# 1 0.0 0.000000000
# 2 0.5 0.008634055
# 3 1.0 0.053615340
# 4 1.5 0.137291689
# 5 2.0 0.245961242
# 6 2.5 0.363956061
# 7 3.0 0.479070721
# 8 3.5 0.583659363
# 9 4.0 0.673982194
# 10 4.5 0.749075757
# 11 5.0 0.809691054
# 12 5.5 0.857478086
# 13 6.0 0.894431622
# 14 6.5 0.922551998
# 15 7.0 0.943661710
# 16 7.5 0.959325076
# 17 8.0 0.970830577
# 18 8.5 0.979207658
And to answer your final question, get the distance at 50% of the CDF:
qgamma(0.5,shape=rr$par[1], scale=rr$par[2])
# [1] 3.095395

Resources