R h2o: examining the nodes on a deeplearning neuralnet model - r

I would like to be able to examine the node structure on my neural networks. Specifically, I use L1 and L2 regularisation and would like to know what proportion of my weights have atrophied to zero. Does my trained neuralnet use every single node, or can I get away with using much fewer hidden nodes? That sort of thing.
Here's a toy problem:
library(h2o)
h2o.init()
x <-seq(-10,10,by=0.0002)
y <- dnorm(x,sd=2)*sin(2*x) # The function the neuralnet will attempt to fit
plot(x,y,type="l")
df <- data.frame(x=x,y=y)
df2 <- as.h2o(df)
model <- h2o.deeplearning(df2,
x=1,
y=2,
hidden=c(200,100,50,10,5), # way more hidden nodes than it needs
l1=0.0000001, # regularisation to reduce the number of unnecessary nodes
l2=0.0000001,
activation="Tanh",
export_weights_and_biases=TRUE)
P <- as.data.frame(h2o.predict(model,df2))
lines(x,P$predict,col=2)
legend("topleft",legend=c("Training data","nn output"),col=c(1,2),lwd=1)
Is there a function within h2o that will give me the information on what all the weights are?
(I've tried h2o.weights(), it only appears to give me the first layer)
Failing that, given that the model is a S4 object, what are the ways of inspecting an S4 object?
Bonus question: Is there any ability within h2o for visualising the node structure?

The h2o.weights() function returns the first layer weights by default, as an H2OFrame. You can get an arbitrary layer by changing the matrix_id argument.
Some examples:
> h2o.weights(model)
x
1 0.3520632
2 0.5535296
3 -0.4793063
4 0.3828013
5 -0.3253765
6 0.7234487
[200 rows x 1 column]
> h2o.weights(model, matrix_id = 5)
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 0.7233770 0.7793967 -1.4486042 -0.8187707 0.8667952 1.0290394 0.26213858 0.02487412 0.3342296 0.39383927
2 0.4528885 0.2434976 0.5963052 0.9640941 -0.4480562 -0.1976745 -0.63002998 0.17428128 -0.9241131 0.13199258
3 -0.5477357 0.4918589 -0.7991062 -0.6445754 0.3618000 0.1324274 0.60856968 -0.35876906 -0.0655041 0.21673690
4 -0.3147219 0.2541574 -0.5886489 -0.9993418 0.3042635 0.4107490 -0.08639368 -1.11863077 0.8755589 -0.06117416
5 -0.7028044 0.4625969 -0.3838535 -0.6484048 -0.6975272 0.2663548 -0.17953268 0.14127040 -0.6482394 -0.04426440
> hidden <- c(200,100,50,10,5)
> for (i in 1:(length(hidden) + 1)) {
+ print(dim(h2o.weights(model, matrix_id = i)))
+ }
[1] 200 1
[1] 100 200
[1] 50 100
[1] 10 50
[1] 5 10
[1] 1 5

Related

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

How are proportions of clades and proportions of bipartitions calculated in `ape`?

Consider the following dataset:
fictional.df <- data.frame(L1 = c(0,0,0,0,0,0,0,0),
L2 = c(0,1,0,0,0,1,1,0),
L3 = c(1,1,0,1,1,1,1,1),
L4=c(0,0,1,1,0,0,0,0))
I converted this to a phyDat object and then created a pairwise distance matrix as follows:
fictional.phydat <- as.phyDat(fictional.df,
type="USER",levels=c("1","0"),
names=names(fictional.df))
fictional.hamming <- dist.hamming(fictional.phydat)
From this distance matrix, I then estimated a UPGMA tree:
fictional.upgma <- upgma(fictional.hamming)
I then created bootstrap datasets:
set.seed(187)
fictional.upgma.bs <- bootstrap.phyDat(fictional.phydat, FUN =
function(xx) upgma(dist.hamming(xx)), bs=100)
I then calculated the proportion of partitions in the bootstrap set:
upgma.bs.part <- prop.part(fictional.upgma.bs)
So far so good. Here is where I would appreciate some help. When I call the function prop.clades, I do not understand the result:
prop.clades(fictional.upgma,fictional.upgma.bs)
[1] 100 NA 71
Why does this function return NA when there is evidence for that clade in the set of bootstrap trees?
A second question:
prop.clades(fictional.upgma,part=upgma.bs.part)
[1] 100 49 112
If there are only 100 bootstrap samples, why is the value for the final clade 112?
Your tree fictional.upgma is rooted and prop.clades return as default how often each bipartition occurs. In a rooted tree the two edges leading to the root both refer to the same bipartition or split:
prop.clades(unroot(fictional.upgma), fictional.upgma.bs)
[1] 100 71
For rooted trees you some times want to count the number of identical clades:
prop.clades(fictional.upgma, fictional.upgma.bs, rooted=TRUE)
[1] 100 49 71
This seems a bug and you best report it to Emmanuel Pardis
prop.clades(fictional.upgma,part=upgma.bs.part)
[1] 100 49 112

Depth of a node in partykit

I am building a tree using the partykit R package, and I am wondering if there is a simple, efficient way to determine the depth number at each internal node. For example, the root node would have depth 0, the first two kid nodes have depth 1, the next kid nodes have depth 2, and so forth. This will eventually be used to calculate the minimal depth of a variable. Below is a very basic example (taken from vignette("constparty", package="partykit")):
library("partykit")
library("rpart")
data("Titanic", package = "datasets")
ttnc<-as.data.frame(Titanic)
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
rp <- rpart(Survived ~ ., data = ttnc)
ttncTree<-as.party(rp)
plot(ttncTree)
#This is one of my many attempts which does NOT work
internalNodes<-nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)]
depth(ttncTree)-unlist(nodeapply(ttncTree, ids=internalNodes, FUN=function(n){depth(n)}))
In this example, I want to output something similar to:
nodeid = 1 2 4 7
depth = 0 1 2 1
I apologize if my question is too specific.
Here's a possible solution which should be efficient enough as usually the trees have no more than several dozens of nodes.
I'm ignoring node #1, as it is always 0 an hence no point neither calculating it or showing it (IMO)
Inters <- nodeids(ttncTree)[-nodeids(ttncTree, terminal = TRUE)][-1]
table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(ttncTree, from = x)))))
# 2 4 7
# 1 2 1
I had to revisit this problem recently. Below is a function to determine the depth of each node. I count the depth based on the number of times a vertical line | appears running the print.party() function.
library(stringr)
idDepth <- function(tree) {
outTree <- capture.output(tree)
idCount <- 1
depthValues <- rep(NA, length(tree))
names(depthValues) <- 1:length(tree)
for (index in seq_along(outTree)){
if (grepl("\\[[0-9]+\\]", outTree[index])) {
depthValues[idCount] <- str_count(outTree[index], "\\|")
idCount = idCount + 1
}
}
return(depthValues)
}
> idDepth(ttncTree)
1 2 3 4 5 6 7 8 9
0 1 2 2 3 3 1 2 2
There definitely seems to be a simpler, faster solution, but this is faster than using the intersect() function. Below is an example of the computation time for a large tree (around 1,500 nodes)
# Compare computation time for large tree #
library(mlbench)
set.seed(470174)
dat <- data.frame(mlbench.friedman1(5000))
rp <- rpart(as.formula(paste0("y ~ ", paste(paste0("x.", 1:10), collapse=" + "))),
data=dat, control = rpart.control(cp = -1, minsplit=3, maxdepth = 10))
partyTree <- as.party(rp)
> length(partyTree) #Number of splits
[1] 1503
>
> # Intersect() computation time
> Inters <- nodeids(partyTree)[-nodeids(partyTree, terminal = TRUE)][-1]
> system.time(table(unlist(sapply(Inters, function(x) intersect(Inters, nodeids(partyTree, from = x))))))
user system elapsed
22.38 0.00 22.44
>
> # Proposed computation time
> system.time(idDepth(partyTree))
user system elapsed
2.38 0.00 2.38

Creating an R function to use mclapply from the multicore package

I need to analyze some simulated data with the following structure:
h c x1 y1 x1c10
1 0 37.607056431 104.83097593 5
1 1 27.615251557 140.85532974 10
1 0 34.68915314 114.59312842 2
1 1 30.090387454 131.60485642 9
1 1 39.274429397 106.76042522 10
1 0 33.839385007 122.73681319 2
...
where h ranges from 1 to 2500, and indexes the Monte Carlo sample, each sample with 1000 observations. I'm analysing these data with the following code that gives me two objects (fnN1, fdQB101):
mc<-2500 ##create loop index
fdN1<-matrix(0,mc,1000)
fnQB101 <- matrix(0,mc,1000) ##create 2500x1000 storage matrices, elements zero
for(j in 1:mc){
fdN1[j,] <- dnorm(residuals(lm(x1 ~ c,data=s[s$h==j,])),
mean(residuals(lm(x1 ~ c,data=s[s$h==j,]))),
sd(residuals(lm(x1 ~ c,data=s[s$h==j,]))))
x1c10<-as.matrix(subset(s,s$h==j,select=x1c10))
fdQB100 <- as.matrix(predict(polr(as.factor(x1c10) ~ c ,
method="logistic", data=s[s$h==j,]),
type="probs"))
indx10<- as.matrix(cbind(as.vector(seq(1:nrow(fdQB100))),x1c10))
fdQB101[j,] <- fdQB100[indx10]
}
The objects fdN1 and fdQB101 are 2500x1000 matrices with predicted probabilities as elements. I need to create a function out of this loop that I can call with lapply() or mclapply(). When I wrap this in the following function command:
ndMC <- function(mc){
for(j in 1:mc){
...
}
return(list(fdN1,fdQB101))
}
lapply(mc,ndMC)
the objects fdN1 and fdQB101 are each returned as 2500x1000 matrices of zeros, instead of the predicted probabilities. What am I doing wrong?
You should be able to do this with the data.table package. Here is an example:
library(data.table)
dt<-data.table(h=rep(1L,6), c=c(0L,1L,0L,1L,1L,0L),
X1=c(37.607056431,27.615251557,34.68915314,30.090387454,39.274429397,33.839385007),
y1=c(104.83097593,140.85532974,114.59312842,131.60485642,106.76042522,122.73681319),
x1c10=c(5L,10L,2L,9L,10L,2L))
## Create a linear model for every grouping of variable h:
fdN1.partial<-dt[,list(lm=list(lm(X1~c))),by="h"]
## Retrieve the linear model for h==1:
fdN1.partial[h==1,lm]
## [[1]]
##
## Call:
## lm(formula = X1 ~ c)
##
## Coefficients:
## (Intercept) c
## 35.379 -3.052
You could also write a function to generalize this solution:
f.dnorm<-function(y,x) {
f<-lm(y ~ x)
out<-list(dnorm(residuals(f), mean(residuals(f)), sd(residuals(f))))
return(out)
}
## Generate two dnorm lists for every grouping of variable h:
dt.lm<-dt[,list(dnormX11=list(f.dnorm(X1,rep(1,length(X1)))), dnormX1c=list(f.dnorm(X1,c))),by="h"]
## Retrieve one of the dnorm lists for h==1:
unlist(dt.lm[h==1,dnormX11])
## 1 2 3 4 5 6
## 0.06296194 0.03327407 0.08884549 0.06286739 0.04248756 0.09045784

Lasso r code - what is wrong with it?

I am attempting to carry out lasso regression using the lars package but can not seem to get the lars bit to work. I have inputted code:
diabetes<-read.table("diabetes.txt", header=TRUE)
diabetes
library(lars)
diabetes.lasso = lars(diabetes$x, diabetes$y, type = "lasso")
However, I get an error message of :
Error in rep(1, n) : invalid 'times' argument.
I have tried entering it like this:
diabetes<-read.table("diabetes.txt", header=TRUE)
library(lars)
data(diabetes)
diabetes.lasso = lars(age+sex+bmi+map+td+ldl+hdl+tch+ltg+glu, y, type = "lasso")
But then I get the error message:
'Error in lars(age+sex + bmi + map + td + ldl + hdl + tch + ltg + glu, y, type = "lasso") :
object 'age' not found'
Where am I going wrong?
EDIT: Data - as below but with another 5 columns.
ldl hdl tch ltg glu
1 -0.034820763 -0.043400846 -0.002592262 0.019908421 -0.017646125
2 -0.019163340 0.074411564 -0.039493383 -0.068329744 -0.092204050
3 -0.034194466 -0.032355932 -0.002592262 0.002863771 -0.025930339
4 0.024990593 -0.036037570 0.034308859 0.022692023 -0.009361911
5 0.015596140 0.008142084 -0.002592262 -0.031991445 -0.046640874
I think some of the confusion may have to do with the fact that the diabetes data set that comes with the lars package has an unusual structure.
library(lars)
data(diabetes)
sapply(diabetes,class)
## x y x2
## "AsIs" "numeric" "AsIs"
sapply(diabetes,dim)
## $x
## [1] 442 10
##
## $y
## NULL
##
## $x2
## [1] 442 64
In other words, diabetes is a data frame containing "columns" which are themselves matrices. In this case, with(diabetes,lars(x,y,type="lasso")) or lars(diabetes$x,diabetes$y,type="lasso") work fine. (But just lars(x,y,type="lasso") won't, because R doesn't know to look for the x and y variables within the diabetes data frame.)
However, if you are reading in your own data, you'll have to separate the response variable and the predictor matrix yourself, something like
X <- as.matrix(mydiabetes[names(mydiabetes)!="y",])
mydiabetes.lasso = lars(X, mydiabetes$y, type = "lasso")
Or you might be able to use
X <- model.matrix(y~.,data=mydiabetes)
lars::lars does not appear to have a formula interface, which means you cannot use the formula specification for the column names (and furthermore it does not accept a "data=" argument). For more information on this and other "data mining" topics, you might want to get a copy of the classic text: "Elements of Statistical Learning". Try this:
# this obviously assumes require(lars) and data(diabetes) have been executed.
> diabetes.lasso = with( diabetes, lars(x, y, type = "lasso"))
> summary(diabetes.lasso)
LARS/LASSO
Call: lars(x = x, y = y, type = "lasso")
Df Rss Cp
0 1 2621009 453.7263
1 2 2510465 418.0322
2 3 1700369 143.8012
3 4 1527165 86.7411
4 5 1365734 33.6957
5 6 1324118 21.5052
6 7 1308932 18.3270
7 8 1275355 8.8775
8 9 1270233 9.1311
9 10 1269390 10.8435
10 11 1264977 11.3390
11 10 1264765 9.2668
12 11 1263983 11.0000

Resources