I am building a tree using the partykit R package, and I am wondering if there is a simple, efficient way to determine the depth number at each internal node. For example, the root node would have depth 0, the first two kid nodes have depth 1, the next kid nodes have depth 2, and so forth. This will eventually be used to calculate the minimal depth of a variable. Below is a very basic example (taken from vignette("constparty", package="partykit")):
library("partykit")
library("rpart")
data("Titanic", package = "datasets")
ttnc<-as.data.frame(Titanic)
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
rp <- rpart(Survived ~ ., data = ttnc)
ttncTree<-as.party(rp)
plot(ttncTree)
#This is one of my many attempts which does NOT work
internalNodes<-nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)]
depth(ttncTree)-unlist(nodeapply(ttncTree, ids=internalNodes, FUN=function(n){depth(n)}))
In this example, I want to output something similar to:
nodeid = 1 2 4 7
depth = 0 1 2 1
I apologize if my question is too specific.
Here's a possible solution which should be efficient enough as usually the trees have no more than several dozens of nodes.
I'm ignoring node #1, as it is always 0 an hence no point neither calculating it or showing it (IMO)
Inters <- nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)][-1]
table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(ttncTree, from = x)))))
# 2 4 7
# 1 2 1
I had to revisit this problem recently. Below is a function to determine the depth of each node. I count the depth based on the number of times a vertical line | appears running the print.party() function.
library(stringr)
idDepth <- function(tree) {
outTree <- capture.output(tree)
idCount <- 1
depthValues <- rep(NA, length(tree))
names(depthValues) <- 1:length(tree)
for (index in seq_along(outTree)){
if (grepl("\\[[0-9]+\\]", outTree[index])) {
depthValues[idCount] <- str_count(outTree[index], "\\|")
idCount = idCount + 1
}
}
return(depthValues)
}
> idDepth(ttncTree)
1 2 3 4 5 6 7 8 9
0 1 2 2 3 3 1 2 2
There definitely seems to be a simpler, faster solution, but this is faster than using the intersect() function. Below is an example of the computation time for a large tree (around 1,500 nodes)
# Compare computation time for large tree #
library(mlbench)
set.seed(470174)
dat <- data.frame(mlbench.friedman1(5000))
rp <- rpart(as.formula(paste0("y ~ ", paste(paste0("x.", 1:10), collapse=" + "))),
data=dat, control = rpart.control(cp = -1, minsplit=3, maxdepth = 10))
partyTree <- as.party(rp)
> length(partyTree) #Number of splits
[1] 1503
>
> # Intersect() computation time
> Inters <- nodeids(partyTree)[-nodeids(partyTree, terminal = TRUE)][-1]
> system.time(table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(partyTree, from = x))))))
user system elapsed
22.38 0.00 22.44
>
> # Proposed computation time
> system.time(idDepth(partyTree))
user system elapsed
2.38 0.00 2.38
Related
With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}
Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN
I'm new to R and programming in general, and I'm struggling with a for-loop for building the lx function in a life table.
I have the age function x, the death function qx (the probability that someone aged exactly x will die before reaching age x+1), and the surviving function px = 1 - qx.
I want to write a function that returns a vector with all the lx values from first to last age in my table. The function is simple...
I've defined cohort = 1000000. The first age in my table is x = 5, so, considering x = 5...
l_(x) = cohort
And, from now on, l_(x+n) = l_(x+n-1)*p_(x+n-1)
I've searched about for-loops, and I can only get my code working for lx[1] and lx[2], and I get nothing for lx[n] if n > 2.
I wrote that function:
living_x <- function(px, cohort){
result <- vector("double", length(px))
l_x <- vector("double", length(px))
for (i in 1:length(px)){
if (i == 1){
l_x[i] = cohort
}
else l_x[i] = l_x[i-1]*px[i-1]
result[i] = l_x
print(result)
}
}
When I run it, I get several outputs (more than length(px)) and "There were 50 or more warnings (use warnings() to see the first 50)".
When I run warnings(), I get "In result[i] <- l_x : number of items to replace is not a multiple of replacement length" for every number.
Also, everything I try besides it give me different errors or only calculate lx for lx[1] and lx[2]. I know there's something really wrong with my code, but I still couldn't identify it. I'd be glad if someone could give me a hint to find out what to change.
Thank you!
Here's an approach using dplyr from the tidyverse packages, to use px to calculate lx. This can be done similarly in "Base R" using excerpt$lx = 100000 * cumprod(1 - lag(excerpt$qx)).
lx is provided in the babynames package, so we can check our work:
library(tidyverse)
library(babynames)
# Get excerpt with age, qx, and lx.
excerpt <- lifetables %>%
filter(year == 2010, sex == "F") %>%
select(x, qx_given = qx, lx_given = lx)
excerpt
# A tibble: 120 x 3
x qx_given lx_given
<dbl> <dbl> <dbl>
1 0 0.00495 100000
2 1 0.00035 99505
3 2 0.00022 99471
4 3 0.00016 99449
5 4 0.00012 99433
6 5 0.00011 99421
7 6 0.00011 99410
8 7 0.0001 99399
9 8 0.0001 99389
10 9 0.00009 99379
# ... with 110 more rows
Using that data to estimate lx_calc:
est_lx <- excerpt %>%
mutate(px = 1 - qx_given,
cuml_px = cumprod(lag(px, default = 1)),
lx_calc = cuml_px * 100000)
And finally, comparing visually the given lx with the one calculated based on px. They match exactly.
est_lx %>%
gather(version, val, c(lx_given, lx_calc)) %>%
ggplot(aes(x, val, color = version)) + geom_line()
I could do it in a very simple way after thinking for some minutes more.
lx = c()
for (i in 2:length(px)){
lx[1] = 10**6
lx[i] = lx[i-1]*px[i-1]
}
When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.
I have the following dataset (obtained here):
----------item survivalpoints weight
1 pocketknife 10 1
2 beans 20 5
3 potatoes 15 10
4 unions 2 1
5 sleeping bag 30 7
6 rope 10 5
7 compass 30 1
I can cluster this dataset into three clusters with kmeans() using a binary string as my initial choice of centers. For eg:
## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2
Using this fundamental concept, I tried it out with GA package to conduct the search where I am trying to optimize(minimize) Davies-Bouldin (DB) Index.
library(GA) ## for ga() function
library(clusterSim) ## for index.DB() function
## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
## converting matrix to vector to access each row
binary_rep <- split(x, row(x))
## evaluate the fitness of each chromsome
for(each in 1:nrow(x){
cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
## minimizing db
return(-dbi)
}
}
g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
Of course (I have no idea what's happening), I received error message of
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
Here are my questions:
How can correctly use the GA package to solve my problem?
How can I make sure the randomly generated chromosomes contains the same number of 1s which corresponds to k number of clusters (eg. if k=3 then the chromosome must contain exactly three 1s)?
I can't comment on the sense of combining k-means with ga, but I can point out that you had issue in your fitness function. Also, errors are produced when all genes are on or off, so fitness is only calculated when that is not the case:
DBI <- function(x) {
if(sum(x)==nrow(dataset) | sum(x)==0){
score <- 0
} else {
cl <- kmeans(dataset[, -1], dataset[x==1, -1])
dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
score <- dbi$DB
}
return(score)
}
g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)
g#solution
g#fitnessValue
Looks like several gene combinations produced the same "best" fitness value
I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216