I would like to be able to examine the node structure on my neural networks. Specifically, I use L1 and L2 regularisation and would like to know what proportion of my weights have atrophied to zero. Does my trained neuralnet use every single node, or can I get away with using much fewer hidden nodes? That sort of thing.
Here's a toy problem:
library(h2o)
h2o.init()
x <-seq(-10,10,by=0.0002)
y <- dnorm(x,sd=2)*sin(2*x) # The function the neuralnet will attempt to fit
plot(x,y,type="l")
df <- data.frame(x=x,y=y)
df2 <- as.h2o(df)
model <- h2o.deeplearning(df2,
x=1,
y=2,
hidden=c(200,100,50,10,5), # way more hidden nodes than it needs
l1=0.0000001, # regularisation to reduce the number of unnecessary nodes
l2=0.0000001,
activation="Tanh",
export_weights_and_biases=TRUE)
P <- as.data.frame(h2o.predict(model,df2))
lines(x,P$predict,col=2)
legend("topleft",legend=c("Training data","nn output"),col=c(1,2),lwd=1)
Is there a function within h2o that will give me the information on what all the weights are?
(I've tried h2o.weights(), it only appears to give me the first layer)
Failing that, given that the model is a S4 object, what are the ways of inspecting an S4 object?
Bonus question: Is there any ability within h2o for visualising the node structure?
The h2o.weights() function returns the first layer weights by default, as an H2OFrame. You can get an arbitrary layer by changing the matrix_id argument.
Some examples:
> h2o.weights(model)
x
1 0.3520632
2 0.5535296
3 -0.4793063
4 0.3828013
5 -0.3253765
6 0.7234487
[200 rows x 1 column]
> h2o.weights(model, matrix_id = 5)
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 0.7233770 0.7793967 -1.4486042 -0.8187707 0.8667952 1.0290394 0.26213858 0.02487412 0.3342296 0.39383927
2 0.4528885 0.2434976 0.5963052 0.9640941 -0.4480562 -0.1976745 -0.63002998 0.17428128 -0.9241131 0.13199258
3 -0.5477357 0.4918589 -0.7991062 -0.6445754 0.3618000 0.1324274 0.60856968 -0.35876906 -0.0655041 0.21673690
4 -0.3147219 0.2541574 -0.5886489 -0.9993418 0.3042635 0.4107490 -0.08639368 -1.11863077 0.8755589 -0.06117416
5 -0.7028044 0.4625969 -0.3838535 -0.6484048 -0.6975272 0.2663548 -0.17953268 0.14127040 -0.6482394 -0.04426440
> hidden <- c(200,100,50,10,5)
> for (i in 1:(length(hidden) + 1)) {
+ print(dim(h2o.weights(model, matrix_id = i)))
+ }
[1] 200 1
[1] 100 200
[1] 50 100
[1] 10 50
[1] 5 10
[1] 1 5
I am using the mob() function from partykit package and I am getting some problems when parsing the obtained model.
My main aim is to check approximately how large a sample size needs to be in order to detect the real
structure of a data generating process (DGP) when breaks are present.
The code below performs a Montecarlo simulation of data with breaks and tries to identify if the break was captured by the M-fluctuation test or not.
More specifically, I want to make a count on the number of times over the total number of simulations (nreps) that the model actually captures the DGP, conditional on a fixed sample size (N) to get a feeling of how many data should I need to capture the real DGP.
At the end of the code, you can see the list that I get out of my simulations. The problem is that I cannot recover the information displayed on the console.
Additionally, I have some doubts about how to make the count of the "correctly identified models". For now, what I have in mind is to count as positive if the model has a break into the correct variable (z1) at the specified region (z1>0) with some tolerance into the break region for example if the break is at z1>0.1 or z1>-0.4 it is also valid as a positive for me. Therefore, is there any simple way of counting the models that have the stated characteristics?
I hope my example is clear enough for you to help me out. Thank you a lot in advance.
Simulate model
Ingredients to generate the DGP.
library("partykit")
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
#sample size
N <- 300
#coeficients
betas <- list()
betas$b0 <- 1
betas$b1_up <- 2.4
betas$b1_down <- 2
betas$b2_up <- 2.4
betas$b2_down <- 2
#mob() ingredients
ols_formula <- y ~ x1 + x2 | z1 + z2
# the ""0 +"" ---> supress the 'double' interecept
ols <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {lm(y ~ 0 + x)}
Function that generates the data and fit OLS using mob algorithm.
reg_simulation_mob <- function(...){
#data
data <- data.frame(
x1 = rnorm(N),
x2 = rnorm(N),
z1 = rnorm(N),
z2 = rnorm(N),
e = rnorm(N))
#dependent variable
data$y <- betas$b0 + with(data, ifelse(z1>0,
betas$b1_up * x1 + betas$b2_up * x2 ,
betas$b1_down * x1 + betas$b2_down * x2 )
+ e )
#Estimate mob()-OLS
ols_mob <- mob(ols_formula,
data = data,
fit = ols)
# return(ols$coefficients)
return(ols_mob)
}
Montecarlo simulation (only 2 trials) of the above-described setup.
# N repetitions
nreps <- 2
## Parallel version
results <- future_lapply(1:nreps, reg_simulation_mob, future.seed = 1234L)
Obtained result
As you can see below in the first trial (results[[1]]), the model finds the correct break but in the second (results[[2]]) it fails to find it.
> results
[[1]]
Model-based recursive partitioning (ols)
Model formula:
y ~ x1 + x2 | z1 + z2
Fitted party:
[1] root
| [2] z1 <= 0.00029: n = 140
| x(Intercept) xx1 xx2
| 0.9597894 1.7552122 2.1360788
| [3] z1 > 0.00029: n = 160
| x(Intercept) xx1 xx2
| 0.9371795 2.4745728 2.5087608
Number of inner nodes: 1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 422.2329
[[2]]
Model-based recursive partitioning (ols)
Model formula:
y ~ x1 + x2 | z1 + z2
Fitted party:
[1] root: n = 300
x(Intercept) xx1 xx2
1.015224 2.175625 2.200746
Number of inner nodes: 0
Number of terminal nodes: 1
Number of parameters per node: 3
Objective function: 422.3085
In the picture below, you can observe the structure of the list results, where I cannot find the information displayed on the console (i.e. number of nodes, parameters, threshold values, etc..)
First, I would recommend to use the lmtree() function and not vanilla mob(). The former is faster, comes with better defaults for printing and plotting, and has more options for predictions.
Second, I recommend that you consult the vignette("partykit", package = "partykit") which explains how party objects are constructed and which classes and methods are involved.
Third, to determine which variable (if any) was used for splitting in the root node it is probably of interest to extract the results from all parameter instability tests. There is a dedicated sctest() (structural change test) method for obtaining this:
library("strucchange")
sctest(results[[1]], node = 1)
## z1 z2
## statistic 4.072483e+01 6.1762164
## p.value 5.953672e-07 0.9153013
sctest(results[[2]], node = 1)
## z1 z2
## statistic 12.2810548 10.1944484
## p.value 0.2165527 0.4179998
The corresponding partysplit object for the $split (if any) in the root $node is probably easiest to extract manually:
results[[1]]$node$split
## $varid
## [1] 4
##
## $breaks
## [1] 0.0002853492
##
## $index
## NULL
##
## $right
## [1] TRUE
##
## $prob
## NULL
##
## $info
## NULL
##
## attr(,"class")
## [1] "partysplit"
results[[2]]$node$split
## NULL
The variable id pertains to the order of the variables in:
names(results[[1]]$data)
## [1] "y" "x1" "x2" "z1" "z2"
Finally, as for the question what to evaluate: Everything depends on identifying the correct variable for splitting. If this is done correctly, then the split point estimates converge fast to the true values, and then the parameter estimates also converge. See for example our recent arXiv paper (https://arxiv.org/abs/1906.10179) which contains a larger simulation study and also provides replication code.
Therefore, typically, I either evaluate the probability of selecting the correct variable in the first split. Or alternatively I look at the RMSE of the estimated vs.true coefficients for each observation.
Update: Beyond the root node you can use nodeapply() to extract information from various nodes. However, I do not recommend to evaluate all splits because it becomes increasingly difficult to match which estimated split matches which of the true splits. Instead, we often assess how similar the fitted partition is compared to the true partition, e.g., using the adjusted Rand Index. Again, you can find an example for the in the arXiv paper mentioned above.
This answer is based on the reference that professor #AchimZeileis provided in his article (https://arxiv.org/abs/1906.10179), and it is devoted to the second part of my original question, which was regarding the question: How to count the correctly specified models (trees)?
The short answer.
The article divides the problem into two different types of data generating process (DGP). In the first one, the data has only one break in one variable ("stump" case), and the authors count the number of correctly identified models based on the number of times that the M-fluctuation test identified correctly the variable that was generating the break (just one real break in one variable and 9 noisy candidates to be splitting variable with no break). The second DGP was a model with two breaks ("tree" case), and they used the Adjusted Rand Index (ARI) to assess the performance of the model as a metric of the similarity of the real tree to the predicted one.
The very long answer.
Let's break down the ARI for 6 different illustrative possible trees that can be obtained at different sample sizes. The code used here is highly based on the supplemental material of the article recommended by #AchimZeileis.
Data generating process: Tree structure
The real dgp has 2 breaks, as illustrated in the picture below. The first one is generated by the variable z2 and the second one by z1. In the snippet of the code below, delta is equal to 1. The threshold value for the first break (depending on z2) is equal to 0.3, and the threshold value for the second break (depending on z1) is equal to -0.3 (the values can be seen in the object xi = c(-0.3, 0.3))
#function from https://arxiv.org/src/1906.10179v1/anc
dgp_tree <- function(nobs = 1000, delta = 1, xi = c(-0.3, 0.3),
sigma = 1, seed = 7,
changetype = "abrupt",
variation = "all",
beta0 = NULL, beta1 = NULL)
{
# check input values
if(variation != "all") stop("variation can only be set to 'all' in dgp_tree")
if(changetype != "abrupt") stop("changetype can only be abrupt in dgp_tree")
if(!is.null(beta0) | !is.null(beta1)) warning("values for beta0 or beta1 are ignored since variation='all' for dgp_tree")
set.seed(seed)
if(length(xi)==1){
xi1 <- xi2 <- xi
} else {
xi1 <- xi[1]
xi2 <- xi[2]
}
z1 <- runif(nobs,-1,1)
z2 <- runif(nobs,-1,1)
z3 <- rnorm(nobs, 0, 1)
z4 <- rnorm(nobs, 0, 1)
z5 <- rnorm(nobs, 0, 1)
z6 <- rnorm(nobs, 0, 1)
z7 <- rnorm(nobs, 0, 1)
z8 <- runif(nobs, -1, 1)
z9 <- runif(nobs, -1, 1)
z10 <- runif(nobs, -1, 1)
id <- numeric(length(z1))
x <- runif(nobs, min = -1, max = 1)
beta0 <- delta * (-1)^(z1<xi1) * 0^(z2<xi2)
beta1 <- delta * (-1)^(z2>=xi2)
id <- 1 + (z2>=xi2) + (z2>=xi2)*(z1>=xi1)
mu <- beta0 + beta1 * x
y <- rnorm(nobs, mu, sigma)
d <- data.frame(y = y, x = x,
z1 = z1, z2 = z2, z3 = z3, z4 = z4, z5 = z5, z6 = z6, z7 = z7, z8 = z8, z9 = z9, z10 = z10,
beta0 = beta0, beta1 = beta1, mu = mu, sigma = rep.int(sigma, times = length(y)), id = id)
return(d)
}
The Adjusted Rand Index
Among the functions included in the article, there is one to compute the ARI, and it is listed below to be used in the following examples. It resembles almost exactly letter by letter the notation used here.
# function to compute adjusted Rand Index from https://arxiv.org/src/1906.10179v1/anc
adj_rand_index <- function(x, y) {
tab <- table(x, y)
a <- rowSums(tab)
b <- colSums(tab)
M <- sum(choose(tab, 2))
N <- choose(length(x), 2)
A <- sum(choose(a, 2))
B <- sum(choose(b, 2))
c(ARI = (M - (A * B) / N) / (0.5 * (A + B) - (A * B) / N))
}
Simulating the data
library(partykit)
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
ols_formula <- y ~ x | z1 + z2 +z3 +z4 + z5 +z6 +z7+ z8 +z9 +z10
ols <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {lm(y ~ 0 + x)}
sim_ari <- function(n){
tree_data <- dgp_tree(nobs = n)
ols_mob <- mob(ols_formula,
data = tree_data,
fit = ols)
prednode <- predict(ols_mob ,
type = "node")
cross_table <- table(prednode,tree_data$id)
ari <- adj_rand_index(prednode,
tree_data$id)
print(n)
print(ari)
return(
list(
ols_mob = ols_mob,
cross_table = cross_table,
ari=ari,
data = tree_data)
)
}
n_levels <- c(55, ## no break
87, ## only one break
123, ## Correct structure, but poor performance
199, ## Nested break in second leaf
667, ## Additional break in first leaf
5000 ## Perfect model
)
ari <- future_lapply(n_levels, sim_ari, future.seed = 1234L)
A Tale of Six Trees.
The following six cases are analyzed in terms of how the ARI can accurately capture the degree of similarity between the correct and the estimated tree. The key to compare the trees is id, which shows which leaf each observation should belong in the tree. For example, if an observation has an id value of 1, it meets the requirements assigned to node number 2 in the figure above. On the other hand, if id is equal to 2, the observation should be assigned to node number 4 in the same picture. Finally, if id is equal to 3, it is assigned to node number 5. You can check this reasoning in the following line id <- 1 + (z2>=xi2) + (z2>=xi2)*(z1>=xi1)
The First Tree (n=55): No breaks
The first case analyzed corresponds when no breaks are identified. Here, in this case, the ARI is equal to 0.
##### First Tree (n=55): No break ####
ari[[1]][[1]]
## Fitted party:
## [1] root: n = 55
## x(Intercept) xx
## -0.01309586 0.39291089
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function: 95.58631
Here it is interesting to note that all the observations are assigned to the root node. Therefore, when crosstable the predicted nodes prednode_1, we see that all possible id values belong to the root node [1] of the predicted tree (basically, because there is no other option.) By using the function adj_rand_index(), you can check that this leads to an ARI equal to 0.
#data first tree (n=55)
data_1 <- ari[[1]][[4]]
#predicted node first iteration
data_1$prednode_1 <- predict(ari[[1]][[1]], type = "node")
#Cross table
with(data_1, table(prednode_1 ,id))
## id
## prednode_1 1 2 3
## 1 37 7 11
#adj_rand_index
ari[[1]][[3]]
The second tree (n=87): Just one break identified
This case is interesting because it partially identifies the tree's structure (i.e., the break on z1 is missing).
##### Second Tree (n=87): Extra partition in node[5] ####
ari[[2]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.29288: n = 57
# | x(Intercept) xx
# | 0.133293 1.082701
# | [3] z2 > 0.29288: n = 30
# | x(Intercept) xx
# | 0.2598309 -1.8014133
#
# Number of inner nodes: 1
# Number of terminal nodes: 2
# Number of parameters per node: 2
# Objective function: 122.0116
Additionally, we can check that when crosstable the predicted and the real nodes, we see that some observations meet the criteria even in this non-perfect tree. Meaning there are 57 observations correctly assigned to the first node and 9 that were correctly assigned to the second branch. Finally, 30 where misassigned because the last node was not identified at all. This lead to an ARI equal to 0.8577366, which is a huge improvement from the first tree.
#data second iteration (n=87)
data_2 <- ari[[2]][[4]]
#predicted node first iteration
data_2$prednode_2 <- predict(ari[[2]][[1]], type = "node")
#Cross table
with(data_2, table(prednode_2 ,id))
# id
# prednode_2 1 2 3
# 2 57 0 0
# 3 1 9 20
#adj_rand_index
ari[[2]][[3]]
# > ari[[2]][[3]]
# ARI
# 0.8577366
The third tree (n=123): Correct structure but poor performance
This case is interesting because it does recover the real tree structure, but it has worse performance than the last three that only partially identified its structure.
##### Third Tree (n=123): Correct structure but poor performance ####
ari[[3]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.07319: n = 60
# | x(Intercept) xx
# | -0.1723388 1.1071878
# | [3] z2 > 0.07319
# | | [4] z1 <= -0.35485: n = 22
# | | x(Intercept) xx
# | | -0.7166565 -0.6791717
# | | [5] z1 > -0.35485: n = 41
# | | x(Intercept) xx
# | | 0.7096033 -0.8605967
#
# Number of inner nodes: 2
# Number of terminal nodes: 3
# Number of parameters per node: 2
# Objective function: 156.4397
Below we can see that when we crosstable predicted and real nodes, we see that 16 (10 + 6) observations were incorrectly classified, and this fact leads to an ARI of 0.6117612.
#data third iteration (n=123)
data_3 <- ari[[3]][[4]]
#predicted node first iteration
data_3$prednode_3 <- predict(ari[[3]][[1]], type = "node")
#Cross table
with(data_3, table(prednode_3 ,id))
# id
# prednode_3 1 2 3
# 2 60 0 0
# 4 6 16 0
# 5 10 0 31
#adj_rand_index
ari[[3]][[3]]
# > ari[[3]][[3]]
# ARI
# 0.6117612
The fourth Tree: Forth Tree (n=199): Extra leaf at node [5]
The tree identified here deviated from the original because it has an extra leaf from node[5], which is unexisting in the real data.
##### Forth Tree (n=199): Extra leaf at node[5] ####
ari[[4]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= -0.19806: n = 79
# | x(Intercept) xx
# | 0.06455217 1.51512672
# | [3] z2 > -0.19806
# | | [4] z1 <= -0.27127: n = 44
# | | x(Intercept) xx
# | | -0.4863122 -0.3860951
# | | [5] z1 > -0.27127
# | | | [6] z2 <= 0.17481: n = 23
# | | | x(Intercept) xx
# | | | -0.1335096 0.2046050
# | | | [7] z2 > 0.17481: n = 53
# | | | x(Intercept) xx
# | | | 1.0868488 -0.0290925
#
# Number of inner nodes: 3
# Number of terminal nodes: 4
# Number of parameters per node: 2
# Objective function: 282.6727
Here the count of the crosstable of the real and predicted nodes is interesting because nodes [6] and [7] are inexistent in the real data, but they are getting observations that should, for example, be assigned to the node [1] (23 and 7 observations respectively.) This misallocation diminished the ARI index down to 0.4649789.
#data forth iteration (n=199)
data_4 <- ari[[4]][[4]]
#predicted node first iteration
data_4$prednode_4 <- predict(ari[[4]][[1]], type = "node")
#Cross table
with(data_4, table(prednode_4 ,id))
# id
# prednode_4 1 2 3
# 2 79 0 0
# 4 16 27 1
# 6 23 0 0
# 7 7 0 46
#adj_rand_index
ari[[4]][[3]]
# ARI
# 0.4649789
The fifth Tree (n=667): Extra leaf at node [2]
This is another example of a tree with an incorrect structure where an extra leaf (based on a partition on z5 which is incorrect(!)) is attached to the node [2].
##### Fifth Tree (n=667): Extra leaf at node[2] ####
ari[[5]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.28476
# | | [3] z5 <= 0.76285: n = 322
# | | x(Intercept) xx
# | | -0.1322881 0.9535337
# | | [4] z5 > 0.76285: n = 96
# | | x(Intercept) xx
# | | 0.1686863 1.3878776
# | [5] z2 > 0.28476
# | | [6] z1 <= -0.32001: n = 89
# | | x(Intercept) xx
# | | -0.9139858 -0.7957158
# | | [7] z1 > -0.32001: n = 160
# | | x(Intercept) xx
# | | 0.7661154 -0.8656553
#
# Number of inner nodes: 3
# Number of terminal nodes: 4
# Number of parameters per node: 2
# Objective function: 927.9088
The crosstable from predicted and correct nodes shows us that most of the observations (322) that, in reality, belong to the first node [1] were assigned to the predicted node [3]. Finally, this poor structure leads to an ARI of 0.6932132.`
#data third iteration (n=667)
data_5 <- ari[[5]][[4]]
#predicted node first iteration
data_5$prednode_5 <- predict(ari[[5]][[1]], type = "node")
#Cross table
with(data_5, table(prednode_5 ,id))
# id
# prednode_5 1 2 3
# 3 322 0 0
# 4 96 0 0
# 6 0 89 0
# 7 3 3 154
#adj_rand_index
ari[[5]][[3]]
# ARI
# 0.6932132
The sixth tree (n=5000): The Golden Tree!
This final tree recovers the data perfectly, both in the tree structure and in allocating the observation to each leaf.
##### Sixth Tree (n=5000): Extra leaf at node[2] ####
ari[[6]][[1]]
# Fitted party:
# [1] root
# | [2] z2 <= 0.29971: n = 3187
# | x(Intercept) xx
# | -0.008719923 1.022232280
# | [3] z2 > 0.29971
# | | [4] z1 <= -0.30286: n = 609
# | | x(Intercept) xx
# | | -0.9488846 -0.9813765
# | | [5] z1 > -0.30286: n = 1204
# | | x(Intercept) xx
# | | 1.0281410 -0.9565637
#
# Number of inner nodes: 2
# Number of terminal nodes: 3
# Number of parameters per node: 2
# Objective function: 6992.848
Here we can see from the crosstable of the predicted and real nodes that it allocates perfectly each observation where it does belong, leading to an ARI equal to 1.
#data sixt iteration (n=5000)
data_6 <- ari[[6]][[4]]
#predicted node first iteration
data_6$prednode_6 <- predict(ari[[6]][[1]], type = "node")
#Cross table
with(data_6, table(prednode_6 ,id))
# id
# prednode_6 1 2 3
# 2 3187 0 0
# 4 0 609 0
# 5 0 0 1204
#adj_rand_index
ari[[6]][[3]]
# ARI
# 1
The conclusions.
Some important takeaways can be recovered from the illustration above.
1.- The ARI is useful to assess the degree of similarity of a predicted tree that can have a very different structure from the real tree in the data generating process.
2.- Recover the tree's correct structure does not lead to an ARI equal to one.
3.- Incorrect trees not necessarily will an ARI equal to zero.
A final Simulation.
To conclude, here is a small simulation to see how the ARI index behaves when the sample size increases.
### Final simulation
n_levels <-seq(from= 10, to= 2000, by= 5)
ari <- lapply(n_levels, sim_ari)
ari_models<- function(i){
ari <- ari_sim[[i]]$ari
n <- nobs(ari_sim[[i]]$ols_mob)
return(
list(ari = ari , n= n )
)
}
ari_n_list <- lapply(1:length(ari_sim), ari_models)
df <- data.frame(matrix(unlist(ari_n_list), nrow=length(ari_n_list), byrow=T))
colnames(df) <- c("ARI" , "N")
library(ggplot2)
ggplot(df, aes(N)) +
geom_line(aes(y = ARI, colour = "ARI"))
I've got three parameters a,b and c. Every parameter is a factor with three categories. I wanted to fit a multinomial regression with the car package.
require(car)
a <- sample(3, 100, TRUE)
b <- sample(3, 100, TRUE)
c <- sample(3, 100, TRUE)
a <- as.factor(a)
b <- as.factor(b)
c <- as.factor(c)
testus <- multinom(c ~ a + b)
predictors <-
expand.grid(b=c("1","2","3","4","5"),a=c("1","2","3","4","5"))
p.fit <- predict(testus, predictors, type='probs')
probabilities<-data.frame(predictors,p.fit)
Now I got the predicted probabilities for a under b and c.
>
`head(probabilities)
> b a X1 X2 X3 X4 X5
>1 1 1 0.10609054 0.22599152 0.20107167 0.21953158 0.2473147
>2 2 1 0.20886614 0.27207108 0.08613633 0.18276394 0.2501625
>3 3 1 0.17041268 0.24995975 0.16234240 0.13111518 0.2861700
>4 4 1 0.23704078 0.21179521 0.08493274 0.03135092 0.4348804
>5 5 1 0.09494071 0.09659144 0.24162612 0.21812449 0.3487172
>6 1 2 0.14059489 0.17793438 0.29272452 0.26104833 0.1276979`
The first two cols shows the categories of the independent variables a and b. the next five colums show the conditional probabilities (p.e. P(c=1|b==1&&a==1)=0,10609.
I need the variance covariance and did:
vcov(testus)
2:(Intercept) 2:b2 2:b3 2:c2 2:c3 ....
2:(Intercept) .......................................
2:b2 ................................
2:b3 .................
2:c2 ..............
2:c3 .............
3:(Intercept) .............
....
Sorry for pasting only a part of the matrix, but otherwise it would be to long. What I would like to have, is a variance covariance matrix for the simultaneous observation of two variables(vcov(a,b&c)). That means, that I would like to get variance (covariance)between my variable a and the simultaneous observation of b and c as I created with "probabilities". I would like to get the output
2:(Intercept) 2:b2&c2 2:b2&c3 ....
2:(Intercept) .......................................
2:b2&c2 ................................
2:b3&c3 .................
3:(Intercept) .............
....
Is this possible?
Perhaps:
testus <- multinom(c ~ a : b)
vcov(testus)
I say 'perhaps' because there is also the possibility of using the c ~ a*b model and it's not clear what you want exactly. (The statistical question has not been defined and I would not think this to be a sufficient number of observations to a stable estimate.) At any rate:
colnames( vcov(testus))
#-----------
[1] "2:(Intercept)" "2:a1:b1" "2:a2:b1"
[4] "2:a3:b1" "2:a1:b2" "2:a2:b2"
[7] "2:a3:b2" "2:a1:b3" "2:a2:b3"
[10] "2:a3:b3" "3:(Intercept)" "3:a1:b1"
[13] "3:a2:b1" "3:a3:b1" "3:a1:b2"
[16] "3:a2:b2" "3:a3:b2" "3:a1:b3"
[19] "3:a2:b3" "3:a3:b3"
rownames( vcov(testus))
#--------
[1] "2:(Intercept)" "2:a1:b1" "2:a2:b1"
[4] "2:a3:b1" "2:a1:b2" "2:a2:b2"
[7] "2:a3:b2" "2:a1:b3" "2:a2:b3"
[10] "2:a3:b3" "3:(Intercept)" "3:a1:b1"
[13] "3:a2:b1" "3:a3:b1" "3:a1:b2"
[16] "3:a2:b2" "3:a3:b2" "3:a1:b3"
[19] "3:a2:b3" "3:a3:b3"
I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance
IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)
It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.
Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.