I know that randomForest is supposed to be a black box, and that most people are interested in the ROC curve of the classifier as a whole, but I'm working on a problem in which I need to inspect individual trees of RF. I'm not very experienced with R so what's an easy way to plot ROC curves for the individual trees generated by RF?
I don't think you can generate a ROC curve from a single tree from a random forest generated by the randomForest package. You can access the output of each tree from a prediction, for example over the training set.
# caret for an example data set
library(caret)
library(randomForest)
data(GermanCredit)
# use only 50 rows for demonstration
nrows = 50
# extract the first 9 columns and 50 rows as training data (column 10 is "Class", the target)
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# build the model
rf_model = randomForest(x = x, y = y, ntree = 11)
# Compute the prediction over the training data. Note predict.all = TRUE
rf_pred = predict(rf_model, newdata = x, predict.all = TRUE, type = "prob")
You can access the predictions of each tree with
rf_pred$individual
However, the prediction of a single tree is only the most likely label. For a ROC curve you need class probabilities, so that changing the decision threshold changes the predicted class to vary true and false positive rates.
As far as I can tell, at least in package randomForest there is no way to make the leaves output probabilities instead of labels. If you inspect a tree with getTree(), you will see that the prediction is binary; use getTree(rf_model, k = 1, labelVar = TRUE) and you'll see the labels in plain text.
What you can do, though, is to retrieve individual predictions via predict.all = TRUE and then manually compute class labels on subsets of the whole forest. This you can then input into a function to compute ROC curves like those from the ROCR package.
Edit: Ok, from the link you provided in your comment I got the idea how a ROC curve can be obtained. First, we need to extract one particular tree and then input each data point into the tree, in order to count the occurances of the success class at each node as well as total data points in each node. The ratio gives the node probability for success class. Next, we do something similar, i.e. input each data point into the tree, but now record the probability. This way we can compare the class probs with the true label.
Here is the code:
# libraries we need
library(randomForest)
library(ROCR)
# Set fixed seed for reproducibility
set.seed(54321)
# Define function to read out output node of a tree for a given data point
travelTree = function(tree, data_row) {
node = 1
while (tree[node, "status"] != -1) {
split_value = data_row[, tree[node, "split var"]]
if (tree[node, "split point"] > split_value ) {
node = tree[node, "right daughter"]
} else {
node = tree[node, "left daughter"]
}
}
return(node)
}
# define number of data rows
nrows = 100
ntree = 11
# load example data
data(GermanCredit)
# Easier access of variables
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# Build RF model
rf_model = randomForest(x = x, y = y, ntree = ntree, nodesize = 10)
# Extract single tree and add variables we need to compute class probs
single_tree = getTree(rf_model, k = 2, labelVar = TRUE)
single_tree$"split var" = as.character(single_tree$"split var")
single_tree$sum_good = 0
single_tree$sum = 0
single_tree$pred_prob = 0
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree$sum_good[out_node] = single_tree$sum_good[out_node] + (y[zeile] == "Good")
single_tree$sum[out_node] = single_tree$sum[out_node] + 1
}
# Compute class probabilities from count of "Good" data points in each node.
# Make sure we do not divide by zero
idcs = single_tree$sum != 0
single_tree$pred_prob[idcs] = single_tree$sum_good[idcs] / single_tree$sum[idcs]
# Compute prediction by inserting again data set into tree, but read out
# previously computed probs
single_tree_pred = rep(0, nrow(x))
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree_pred[zeile] = single_tree$pred_prob[out_node]
}
# Et voila: The ROC curve for single tree!
plot(performance(prediction(single_tree_pred, y), "tpr", "fpr"))
Related
I am trying to replicate a simulation from an article for a class project. I have created simulated matrix x and a simulated vector y. I want to first reduce the model size from p=1000 to p=n/log(n) using SIS. I want to then further reduce the vector space to p' by using the Primal Dual Dantzig selector.
There is no direct predict function in this package so I am plugging the p' selected predictors and their coefficients into a lm function to get predictions. I am using the predict.lm function to get the predicted values. However, the returned vector is all NA.
The following code may illustrate the problem further. I am not getting any errors after running the following code. I have to follow this procedure of using SIS and then Dantzig Selector to match the simulation in the article.
library(SIS) #import the SIS library
library(fastclime)
errors_DS_small = c() #create empty array to store errors from each simulation
model_sizes_DS_small = c() #creat empty array to store selected model size from each simulation
start_time = Sys.time()
set.seed(123*1) #re-create results at a later time but different seed for each data set
n = 200 #sample size
p = 1000 #variables
x = matrix(rnorm(n*p, mean=0, sd=1), n, p) #creating IID standard Gaussian random predictors
# gaussian response
set.seed(456*1) #re-create results at a later time but different seed for each data set
s = 8
u = rbinom(s, 1, 0.4)
z = rnorm(s, mean=0, sd=1)
a = 4*log(n)/sqrt(n)
b= ((-1)**(u))*(a + abs(z))
y=x[, 1:s]%*%b
#creating SIS-DS model and gaussian response. iter=FALSE means not doing ISIS
modelSIS_small = SIS(x, y, family='gaussian', iter = FALSE, nsis=(n/log(n)))
modelDS = dantzig(x[,modelSIS_small$sis.ix0], y)
selectDS = dantzig.selector(modelDS$lambdalist, modelDS$BETA0, lambda=min(modelDS$lambdalist))
linearMod = lm(y~x[,modelSIS_small$sis.ix0])
linearMod$coefficients = selectDS
newx = x[,modelSIS_small$sis.ix0]
predTest = predict(linearMod, data=newx) #create predictions using test data test
a = modelDS$BETA0[, 9] != 0
mse = mean((y - predTest)^2) #compare predictions to real values of test y
rmse = sqrt(mse) #Square root of MSE (Square-loss function)
errors_DS_small[1] = rmse #store the RMSE
model_sizes_DS_small[1] = table(a)["TRUE"] #Store model size including intercept
print(c(1, errors_DS_small[1], model_sizes_DS_small[1])) #print results
end_time = Sys.time()
DS_total_time_small = end_time - start_time
DS_error_small_median = median(errors_DS_small)
DS_model_Size_small_median = median(model_sizes_DS_small)
print(c("DS", DS_model_Size_small_median, DS_error_small_median, DS_total_time_small))
I perform a hierarchical cluster analysis based on 'average linkage' In base r, I use
dist_mat <- dist(cdata, method = "euclidean")
hclust_avg <- hclust(dist_mat, method = "average")
I want to calculate the gap statistics to decide optimal number of clusters. I use the 'cluster' library and the clusGap function. Since I can't pass the hclust solution nor specify average hiearchical clustering in the clusGap function, I use these lines:
cluster_fun <- function(x, k) list(cluster = cutree(hclust(dist(x, method = "euclidean"), method="average"), k = k))
gap_stat <- clusGap(cdata, FUN=cluster_fun, K.max=10, B=50)
print(gap_stat)
However, here I can't check the cluster solution. So, my question is - can I be sure that the gap statistic is calculated on the same solution as hclust_avg?
Is there a better way of doing this?
Yes it should be the same. In the clusGap function, it calls the cluster_fun for each k you provided, then calculates the pooled within cluster sum of squares around, as described in the paper
This is the bit of code called inside clusGap that calls your custom function:
W.k <- function(X, kk) {
clus <- if (kk > 1)
FUNcluster(X, kk, ...)$cluster
else rep.int(1L, nrow(X))
0.5 * sum(vapply(split(ii, clus), function(I) {
xs <- X[I, , drop = FALSE]
sum(dist(xs)^d.power/nrow(xs))
}, 0))
}
And from here, the gap statistics is calculated.
You can calculate the gap statistic using some custom code, but for the sake of reproducibility, etc, it might be easier to use this?
Thanhs for solving it. I must say this is good enough solution but you can try below given code as well.
# Gap Statistic for K means
def optimalK(data, nrefs=3, maxClusters=15):
"""
Calculates KMeans optimal K using Gap Statistic
Params:
data: ndarry of shape (n_samples, n_features)
nrefs: number of sample reference datasets to create
maxClusters: Maximum number of clusters to test for
Returns: (gaps, optimalK)
"""
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf)
score_g, df = optimalK(cluster_df, nrefs=5, maxClusters=30)
plt.plot(df['clusterCount'], df['gap'], linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('Gap Statistic');
plt.title('Gap Statistic vs. K');
I want to analyse the type of spatial pattern shown by an animal (i.e. random, clustered, uniform) taking into consideration the underlying spatial pattern of it's available habitat. The animals in question roost in trees, so a standard analysis of the animal spp will always show a clustered distribution (i.e. clustering around trees), but I want to test whether there is clustering between trees vs whether they distribute randomly throughout trees. To provide a visual, I want to be able to differentiate between the following scenarios in the image:
https://imgur.com/a/iE3nAoh (image not allowed because I'm new to stack overflow, but it's available through the link)
Here is a reproducible data frame. The scenario here is of uniform habitat (25 areas of habitat) and uniform animals (16 animals per habitat):
library(spatstat)
data <- data.frame(matrix(ncol = 4, nrow = 25))
x <- c("habitat", "x", "y", "animalcount")
colnames(data) <- x
data$habitat <- 1:25
data$x <- seq(from=2, to=20, by=4)
data$y[1:5] <- 2
data$y[6:10] <- 6
data$y[11:15] <- 10
data$y[16:20] <- 14
data$y[21:25] <- 18
data$animalcount <- 16
Set up conditions for the spatial analysis:
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
nS <- 499 # number of simulations
cd <- 5 # cluster distance
ed <- 50 # envelope distance
incr.dist <- 0.5 # increment distance for envelopes
Create the point pattern for the habitat:
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
Create the point pattern for the animals. To do this, first make a new dataframe with repeated rows for the number in animal count, so that points are individual animals. Jitter x/y so that x/y coordinates are not exactly the same:
data <-data[which(data$animalcount>0),]
duplicate_rows <- function(habitat, x, y, animalcount) {
expanded <- paste0("animal-", 1:animalcount)
repeated_rows <- data.frame("habitat" = habitat, "x" = x, "y" = y, "animalcount" = expanded)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, data$habitat, data$x, data$y, data$animalcount)
animal_data <- do.call(rbind, expanded_rows)
animal_data$xan <- jitter(animal_data$x)
animal_data$yan <- jitter(animal_data$y)
animal <- ppp(x = animal_data$xan, y = animal_data$yan, window = plot.win)
Now test Complete Spatial Randomness of animals regardless of habitat. This should come out as clustered:
an.csr <- envelope(animal, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE) #CSR fit and determine the number of simulations
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE) #calculate the summary statistics of the CSR null model fit (dclf.test)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = "")) #plot 0-centered fit with the confidence bounds
clarkevans(animal)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animal, "Donnelly")$p
Now test Complete Spatial Randomness of animals, given the available habitat. This should come out not clustered. But simply adding habitat as a covariate clearly isn't the appropriate way to do it:
an.csr <- envelope(animal, covariates = animal_data[,2:3], Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animal)[2]
clarkevans.test(animal, "Donnelly")$p
I also tried running the test of Complete Spatial Randomness on a fitted Point Process Model, where the animal point pattern could be predicted by x&y, but this also did not change outcomes:
animalppm<-ppm(animal~x+y)
an.csr <- envelope(animalppm, Kest, nsims = nS, savepatterns = TRUE, r = seq(0, ed, incr.dist), correction=c("Ripley"), verbose = FALSE)
an.dclf <- dclf.test(an.csr, rinterval = c(0,cd), verbose = FALSE)
plot(an.csr, sqrt(./pi)-r~r, ylab="L(r)-r", xlab="r (meters)", xlim=c(0,ed), legend="NULL", main=paste("Animal - CSR", sep = ""))
clarkevans(animalppm)[2] #R > 1 suggests ordering, < 1 suggests clustering
clarkevans.test(animalppm, "Donnelly")$p
From there I would run tests of aggregation models, but the logic of adding the second point pattern should be similar.
I would appreciate any suggestions on ways to deal with this. I cannot think of an effective way to google this, and am coming up short on clever coding solutions in R. Thanks in advance!
You can model the intensity as depending on the distance to the
habitat pattern. Here is a simple example where the animals follow a Poisson
point process with intensity function which decays log-linearly with distance
to the habitat:
library(spatstat)
data <- expand.grid(x = seq(2, 18, by=4), y = seq(2, 18, by=4))
data$animalcount <- 16
plot.win <- owin(c(0,20), c(0,20)) # set the plot window as 20x20m
habitat <- ppp(x = data$x, y = data$y, window = plot.win)
d <- distmap(habitat)
plot(d)
lam <- exp(3-2*d)
plot(lam)
animal <- rpoispp(lam)
plot(animal)
fit <- ppm(animal ~ d)
fit
#> Nonstationary Poisson process
#>
#> Log intensity: ~d
#>
#> Fitted trend coefficients:
#> (Intercept) d
#> 2.952048 -1.974381
#>
#> Estimate S.E. CI95.lo CI95.hi Ztest Zval
#> (Intercept) 2.952048 0.07265533 2.809646 3.094450 *** 40.63085
#> d -1.974381 0.07055831 -2.112673 -1.836089 *** -27.98226
Taking the underlying non-homogeneous intensity into account
there is no sign of departure from the Poisson model in the
(inhomogeneous) K-function:
plot(Kinhom(animal, lambda = fit))
#> Warning: The behaviour of Kinhom when lambda is a ppm object has changed
#> (in spatstat 1.37-0 and later). See help(Kinhom)
You don't have to have simple log-linear dependence on distance. You could also make a threshold model where you have one intensity with e.g. distance 1 of the habitat and another intensity outside this distance. You can make all kinds of derived covariates from e.g. the distance for use in your model.
If animals is the point pattern of animals, and trees is the point pattern of trees (both objects of class "ppp" in spatstat) then you could do
d <- distfun(trees)
f <- rhohat(animals, d)
plot(f)
to get an idea of how the concentration of animals depends on distance to nearest tree. You can use
berman.test(animals, d)
to perform a hypothesis test of dependence on the trees.
I am trying to make a similar analysis to McNeil & Frey in their paper 'Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach' but I am stuck with a problem when implementing the models.
The approach is to fit a AR(1)-GARCH(1,1) model in order to estimate the the one-day ahead forecast of the VaR using a window of 1000 observations.
I have simulated data that should work fine with my model, and I assume that if I would be doing this correct, the observed coverage rate should be close to the theoretical one. However it is always below the theoretical coverage rate, and I donĀ“t know why.
I beleive that this is how the calculation of the estimated VaR is done
VaR_hat = mu_hat + sigma_hat * qnorm(alpha)
, but I might be wrong. I have tried to find related questions here at stack but I have not found any.
How I approach this can be summarized in three steps.
Simulate 2000 AR(1)-GARCH(1,1) observations and fit a corresponding model and extract the one day prediction of the conditional mean and standard deviation using a window of 1000 observations.(Thereby making 1000 predictions)
Use the predicted values and the normal quantile to calculate the VaR for the wanted confidence level.
Check if the coverage rate is close to the theoretical one.
If someone could help me I would be extremely thankful, and if I'm unclear in my formalation please just tell me and I'll try to come up with a better explanation to the problem.
The code I'm using is attached below.
Thank you in advance
library(fGarch)
nObs <- 2000 # Number of observations.
quantileLevel <- 0.95 # Since we expect 5% exceedances.
from <- seq(1,1000) # Lower index vector for observations in model.
to <- seq(1001,2000) # Upper index vector for observations in model.
VaR_vec <- rep(0,(nObs-1000)) # Empty vector for storage of 1000 VaR estimates.
# Specs for simulated data (including AR(1) component and all components for GARC(1,1)).
spec = garchSpec(model = list(omega = 1e-6, alpha = 0.08, beta = 0.91, ar = 0.10),
cond.dist = 'norm')
# Simulate 1000 data points.
data_sim <- c(garchSim(spec, n = nObs, n.start = 1000))
for (i in 1:1000){
# The rolling window of 1000 observations.
data_insert <- data_sim[from[i]:to[i]]
# Fitting an AR(1)-GARCH(1,1) model with normal cond.dist.
fitted_model <- garchFit(~ arma(1,0) + garch(1,1), data_insert,
trace = FALSE,
cond.dist = "norm")
# One day ahead forecast of conditional mean and standard deviation.
predict(fitted_model, n.ahead = 1)
prediction_model <- predict(fitted_model, n.ahead = 1)
mu_pred <- prediction_model$meanForecast
sigma_pred <- prediction_model$standardDeviation
# Calculate VaR forecast
VaR_vec[i] <- mu_pred + sigma_pred*qnorm(quantileLevel)
if (length(to)-i != 0){
print(c('Countdown, just',(length(to) - i),'iterations left'))
} else {
print(c('Done!'))
}
}
# Exctract only the estiamtes ralated to the forecasts.
compare_data_sim <- data_sim[1001:length(data_sim)]
hit <- rep(0,length(VaR_vec))
# Count the amount of exceedances.
for (i in 1:length(VaR_vec)){
hit[i] <- sum(VaR_vec[i] <= compare_data_sim[i])
}
plot(data_sim[1001:2000], type = 'l',
ylab = 'Simulated data', main = 'Illustration of one day ahead prediction of 95%-VaR')
lines(VaR_vec, col = 'red')
cover_prop <- sum(hit)/length(hit)
print(sprintf("Diff theoretical level and VaR coverage = %f", (1-quantileLevel) - cover_prop))
I am trying to permute (column-wise only) my data matrix a 1000 times and then do hierarchical clustering in "R" so I have the final tree on my data after 1000 randomizations.
This is where I am lost. I have this loop
for(i in 1:1000)
{
permuted <- test2_matrix[,sample(ncol(test2_matrix), 12, replace=TRUE)]; (this permutes my columns)
d = dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2);
clust = hclust(d, method = "complete", members=NULL);
}
png (filename="cluster_dendrogram_bootstrap.png", width=1024, height=1024, pointsize=10)
plot(clust)
I am not sure if the final tree is a product after the 1000 randomizations or just the last tree that it calculated in the loop. Also If I want to display the bootstrap values on the tree how should I go about it?
Many thanks!!
The value of clust in your example is indeed the final tree calculated in the loop. Here's a way of making and saving 1000 permutations of your matrix
make.permuted.clust <- function(i){ # this argument is not used
permuted <- data.matrix[,sample(ncol(data.matrix), 12, replace=TRUE)]
d <- dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
clust <- hclust(d, method = "complete", members=NULL)
clust # return value
}
all.clust <- lapply(1:1000, make.permuted.clust) # 1000 hclust trees
The second part of your question should be answered here.
You may be interested in the RandomForest method implemented in the randomForest package, which implements both bootstrapping of the data and of the splitting variables and allows you to save trees and get a consensus tree.
library(randomForest)
The original random forest (in FORTRAN 77) developers site
The package manual