I am using the nnet function package from the nnet package in R. I am trying to set the MaxNWts parameter and was wondering if there is any disadvantage to setting this number to a large value like 10^8 etc. The documentation says
"The maximum allowable number of weights. There is no intrinsic limit
in the code, but increasing MaxNWts will probably allow fits that are
very slow and time-consuming."
I also calculate the size parameter by the following calculation
size = Math.Sqrt(%No of Input Nodes% * %No of Output Nodes%)
The problem is that if I set "MaxNWts" to a value like 10000 , it fails sometimes because the number of coefficients is > 10000 when working with huge data sets.
EDIT
Is there a way to calculate the number of wts( to the get the same number calculated by R nnet) somehow so that I can explicitly set it every time without worrying about the failures?
Suggestions?
This is what I have seen in the Applied Predictive Modeling:
Kuhn M, Johnson K. Applied predictive modeling. 1st edition. New York: Springer. 2013.
for the MaxNWts= argument you can pass either one of:
5 * (ncol(predictors) + 1) + 5 + 1)
or
10 * (ncol(predictors) + 1) + 10 + 1
or
13 * (ncol(predictors) + 1) + 13 + 1
predictors is the matrix of your predictors
I think it is empirical based on your data, it is a regularization term like the idea behind shrinkage methods, ridge regression (L2) term for instance. It's main goal is to prevent the model from over fitting as is the case often with neural networks because of the over-parameterization of its computation.
Related
Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.
The leaps package in R has a useful function for model selection called regsubsets which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:
data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50) + (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")
The rs$which command produces the output of the regsubsets function and allows you to select the model once you've plotted the AIC and found the number of parameters that minimizes the AIC. But here's the problem: while the typed-up example works fine, I'm having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:
require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp+gas+pack+
I(temp^2)+I(gas^2)+I(pack^2)+
I(temp*gas)+I(temp*pack)+I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50) + (2:10)*2
produces a warning message:
Warning message:
In 50 * log(rs$rss/50) + (2:10) * 2 :
longer object length is not a multiple of shorter object length
Sure enough, length(rs$rss)=8, but length(2:10)=9. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss in the AIC formula, it doesn't work with the original example!
So here's my question: what is summary() doing to the output of the regsubsets() function? The number of RSS values is not only not the same, but the values themselves are not the same.
Ok, so you know the help page for regsubsets says
regsubsets returns an object of class "regsubsets" containing no
user-serviceable parts. It is designed to be processed by
summary.regsubsets.
You're about to find out why.
The code in regsubsets calls Alan Miller's Fortran 77 code for subset selection. That is, I didn't write it and it's in Fortran 77. I do understand the algorithm. In 1996 when I wrote leaps (and again in 2017 when I made a significant modification) I spent enough time reading the code to understand what the variables were doing, but regsubsets mostly followed the structure of the Fortran driver program that came with the code.
The rss field of the regsubsets object has that name because it stores a variable called RSS in the Fortran code. This variable is not the residual sum of squares of the best model. RSS is computed in the setup phase, before any subset selection is done, by the subroute SSLEAPS, which is commented 'Calculates partial residual sums of squares from an orthogonal reduction from AS75.1.' That is, RSS describes the RSS of the models with no selection fitted from left to right in the design matrix: the model with just the leftmost variable, then the leftmost two variables, and so on. There's no reason anyone would need to know this if they're not planning to read the Fortran so it's not documented.
The code in summary.regsubsets extracts the residual sum of squares in the output from the $ress component of the object, which comes from the RESS variable in the Fortran code. This is an array whose [i,j] element is the residual sum of squares of the j-th best model of size i.
All the model criteria are computed from $ress in the same loop of summary.regsubsets, which can be edited down to this:
for (i in ll$first:min(ll$last, ll$nvmax)) {
for (j in 1:nshow) {
vr <- ll$ress[i, j]/ll$nullrss
rssvec <- c(rssvec, ll$ress[i, j])
rsqvec <- c(rsqvec, 1 - vr)
adjr2vec <- c(adjr2vec, 1 - vr * n1/(n1 + ll$intercept -
i))
cpvec <- c(cpvec, ll$ress[i, j]/sigma2 - (n1 + ll$intercept -
2 * i))
bicvec <- c(bicvec, (n1 + ll$intercept) * log(vr) +
i * log(n1 + ll$intercept))
}
}
cpvec gives you the same information as AIC, but if you want AIC it would be straightforward to do the same loop and compute it.
regsubsets has a nvmax parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.
Please note though, that the 50 in your AIC formula is the sample size (i.e. 50 states in statedata). So for your second example, this should be nrow(odor), so 15.
I try to understand the differences between the two methods bayes and mle in the bn.fit function of the package bnlearn.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit, leading to bnlearn:::bn.fit.backend, leading to bnlearn:::smartSapply but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit applies to discrete variables. The key is the optional iss argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X in some network, the bayes option in bnlearn::bn.fit returns (Nx + iss / cptsize) / (N + iss) as the probability of X = x, where N is your number of samples, Nx the number of samples with X = x, and cptsize the size of the CPT of X; in this case cptsize = 2. The relevant code is in the bnlearn:::bn.fit.backend.discrete function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize is the number of imaginary observations for each entry in a CPT, as opposed to N, the number of 'real' observations. With iss = 0 you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss with respect to N, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss and a growing N, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.
I am using h2o kmeans in R to divide my population. The method need to be audited, so I would like to explain the threshold used in the h2o's kmeans.
In the documentation of h2o kmeans (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html), it is said :
H2O uses proportional reduction in error (PRE) to determine when to
stop splitting. The PRE value is calculated based on the sum of
squares within (SSW).
PRE=(SSW[before split]−SSW[after split])/SSW[before split]
H2O stops splitting when PRE falls below a threshold, which is a
function of the number of variables and the number of cases as
described below:
threshold takes the smaller of these two values:
either 0.8 or [0.02 + 10/number_of_training_rows +
2.5/(number_of_model_features)^2]
The source code (https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/kmeans/KMeans.java) is given as :
final double rel_improvement_cutoff = Math.min(0.02 + 10. /
_train.numRows() + 2.5 / Math.pow(model._output.nfeatures(), 2), 0.8);
Where does this threshold come from ? Are there scientific papers about it ?
I am responsible for that threshold. I developed it by running numerous datasets -- artificial and real -- through the k-means algorithm. I began some years ago working with SSW improvement and testing it as a chi-square variable, as recommended by John Hartigan. This criterion failed in a number of instances, so I switched to PRE. The equation above is the result of fitting a nonlinear model to results on datasets with a known number of clusters. When I wrote the k-means program for Tableau, I used this same PRE criterion. After I left Tableau for H2O, they substituted the Calinski-Harabasz index for my PRE rule, producing similar results.
Leland Wilkinson, Chief Scientist, H2O.
I have a data set with 500,000 observations with events and a competing risk as well as a time-to-event variable (survival analysis).
I want to run a survival random forest.
The R-package randomForestSRC is great for it, however, it is impossible to use more than 100,000 rows due to memory limitation (100'000 uses 40GB of RAM) even though I limit my number of predictors to 15 to 20.
I have a hard time finding a solution. Does anyone have a recommendation?
I looked at h2o and spark mllib, both of which do not support survival random forests.
Ideally I am looking for a somewhat R-based solution but I am happy to explore anything else if anyone knows a way to use large data + competing risk random forest.
In general, the memory profile for an RF-SRC data set is n x p x 8 on a 64-bit machine. With n=500,000 and p=20, RAM usage is approximately 80MB. This is not large.
You also need to consider the size of the forest, $nativeArray. With the default nodesize = 3, you will have n / 3 = 166,667 terminal nodes. Assuming symmetric trees for convenience, the total number of interanal/external nodes will approximately be 2 * n / 3 = 333,333. With the default ntree = 1000, and assuming no factors, $nativeArray will be of dimensions [2 * n / 3 * ntree] x [5]. A simple example will show you why we have [5] columns in the $nativeArray to tag the split parameter, and split value. Memory usage for the forest will be thus be 2 * n / 3 * ntree * 5 * 8 = 1.67GB.
So now we are getting into some serious memory usage.
Next consider the ensembles. You haven't said how many events you have in your competing risk data set, but let's assume there are two for simplicity.
The big arrays here are the cause-specific hazard function (CSH) and the cause-specific cumulative incidence function (CIF). These are both of dimension [n] x [time.interest] x [2]. In a worst case scenario, if all your times are distinct, and there are no censored events, time.interest = n. So each of these outputs is n * n * 2 * 8 bytes. This will blow up most machines. It's time.interest that is your enemy. In big-n situations, you need to constrain the time.interest vector to a subset of the actual event times. This can be controlled with the parameter ntime.
From the documentation:
ntime: Integer value used for survival families to constrain ensemble calculations to a grid of time values of no more than ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times.
My suggestion would be to start with a very small value of ntime, just to test whether the data set can be analyzed in its entirety without issue. Then increase it gradually and observe your RAM usage. Note that if you have missing data, then RAM usage will be much larger. Also note that I did not mention other arrays such as the terminal node statistics that also lead to heavy RAM usage. I only considered the ensembles, but the reality is that each terminal node will contain arrays of dimension [time.interest] x 2 for the node specific estimator of the CSH and CIF that is used in creating the forest ensemble.
In the future, we will be implementing a Big Data option that will suppress ensembles and optimize the memory profile of the package to better accommodate big-n scenarios. In the meantime, you will have to be diligent in using the existing options like ntree, nodesize, and ntime to reduce your RAM usage.
Some questions came to me as I read a paper 'Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift'.
In the paper, it says:
Since m examples from training data can estimate mean and variance of
all training data, we use mini-batch to train batch normalization
parameters.
My question is :
Are they choosing m examples and then fitting batch norm parameters concurrently, or choosing different set of m examples for each input dimension?
E.g. training set is composed of x(i) = (x1,x2,...,xn) : n-dimension
for fixed batch M = {x(1),x(2),...,x(N)}, perform fitting all gamma1~gamman and beta1~betan.
vs
For gamma_i, beta_i picking different batch M_i = {x(1)_i,...,x(m)_i}
I haven't found this question on cross-validated and data-science, so I can only answer it here. Feel free to migrate if necessary.
The mean and variance are computed for all dimensions in each mini-batch at once, using moving averages. Here's how it looks like in code in TF:
mean, variance = tf.nn.moments(incoming, axis)
update_moving_mean = moving_averages.assign_moving_average(moving_mean, mean, decay)
update_moving_variance = moving_averages.assign_moving_average(moving_variance, variance, decay)
with tf.control_dependencies([update_moving_mean, update_moving_variance]):
return tf.identity(mean), tf.identity(variance)
You shouldn't worry about technical details, here's what's going on:
First the mean and variance of the whole batch incoming are computed, along batch axis. Both of them are vectors (more precisely, tensors).
Then current values moving_mean and moving_variance are updated by an assign_moving_average call, which basically computes this: variable * decay + value * (1 - decay).
Every time batchnorm gets executed, it knows one current batch and some statistic of previous batches.