Neuralnet package in r with simple structure taking very long time , what is the issue here? - r

I have a problem with neuralnet function from neuralnet package in R.
I designed a simple structure with 82 feature as input and only 1 hidden layer with 10 neurons and output is 20 class and I left this line which represent neuralnet function to run above 4 hours and didn't finish !
This is the code :
nn=neuralnet(f, data = train, hidden = 10, err.fct = "sse",threshold = 1,
learningrate=.05,rep = 1, linear.output = FALSE)

Training of the neural network can be arbitrary long, what affects this time?
Complexity of the network (not a problem here as your network is quite small)
Size of the training data - even few thousands of samples can take quite a while, furthermore number of features also significantly increase computation time
Training algorithm and its hyperparameters - in particular for SGD based solutions - too small learning rate (or to big as it causes the oscilation)
Type of stopping criterion - there are many ways of checking whether to stop training a NN, some more expensive (validation score) than others (amplitude of gradient/number of epochs).
In your particular example your training takes at most 100,000 steps and you use rprop+ learning. Thus the most probable problem is the size of the training data. You can try to set stepmax to some much smaller value to see how much time it needs and how good is the model.
In general - neural networks are hard and slow to train, you have to deal with it or switch to other models.

You can easily predict the computation time and complexity of your code before running it on the full data with the GuessCompx package.
Create fake data with the same characteristics as yours, and 20-class Y vector and a wrapper function:
train = data.frame(matrix(rnorm(300000*82, 3), ncol=82))
train['Y'] = as.character(round(runif(300000, 1,20)))
nn_test = function(data) {
nn=neuralnet(formula=Y~., data=data, hidden = 10, err.fct = "sse",threshold = 1,
learningrate=.05,rep = 1, linear.output = FALSE)
}
And then do the audit:
library(GuessCompx) # get it by running: install.packages("GuessCompx")
library(neuralnet)
CompEst(train, nn_test)
#### $`TIME COMPLEXITY RESULTS`$best.model
#### [1] "NLOGN"
#### $`TIME COMPLEXITY RESULTS`$computation.time.on.full.dataset
#### [1] "1M 4.86S"
#### $`MEMORY COMPLEXITY RESULTS`$best.model
#### [1] "LINEAR"
#### $`MEMORY COMPLEXITY RESULTS`$memory.usage.on.full.dataset
#### [1] "55535 Mb"
#### $`MEMORY COMPLEXITY RESULTS`$system.memory.limit
#### [1] "16282 Mb"
See that the computation time is not a problem, but the memory usage and limitations might be impacting your computer, causing the long delay? The only nn output object takes more than 4Gb to be stored!

Related

Consensus clustering with diceR package

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

How to apply machine learning techniques / how to use model outputs

I am a plant scientist new to machine learning. I have had success writing code and following tutorials of machine learning techniques. My issue is trying to understand how to actually apply these techniques to answer real world questions. I don't really understand how to use the model outputs to answer questions.
I recently followed a tutorial creating an algorithm to detect credit card fraud. All of the models ran nicely and I understand how to build them; but, how in the world do I take this information and translate it into a definitive answer? Following the same example, lets say I wrote this code for my job how would I then take real credit card data and screen it using this algorithm? I really want to establish a link between running these models and generating a useful output from real data.
Thank you all.
In the name of being concise I will highlight some specific examples using the same data set found here:
https://drive.google.com/file/d/1CTAlmlREFRaEN3NoHHitewpqAtWS5cVQ/view
# Import
creditcard_data <- read_csv('PATH')
# Restructure
creditcard_data$Amount=scale(creditcard_data$Amount)
NewData=creditcard_data[,-c(1)]
head(NewData)
#Split
library(caTools)
set.seed(123)
data_sample = sample.split(NewData$Class,SplitRatio=0.80)
train_data = subset(NewData,data_sample==TRUE)
test_data = subset(NewData,data_sample==FALSE)
1) Decision Tree
library(rpart)
library(rpart.plot)
decisionTree_model <- rpart(Class ~ . , creditcard_data, method = 'class')
predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class')
probability <- predict(decisionTree_model, creditcard_data, type = 'prob')
rpart.plot(decisionTree_model)
2) Artificial Neural Network
library(neuralnet)
ANN_model =neuralnet (Class~.,train_data,linear.output=FALSE)
plot(ANN_model)
predANN=compute(ANN_model,test_data)
resultANN=predANN$net.result
resultANN=ifelse(resultANN>0.5,1,0)
3) Gradient Boosting
library(gbm, quietly=TRUE)
# train GBM model
system.time(
model_gbm <- gbm(Class ~ .
, distribution = "bernoulli"
, data = rbind(train_data, test_data)
, n.trees = 100
, interaction.depth = 2
, n.minobsinnode = 10
, shrinkage = 0.01
, bag.fraction = 0.5
, train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data))
)
)
# best iteration
gbm.iter = gbm.perf(model_gbm, method = "test")
model.influence = relative.influence(model_gbm, n.trees = gbm.iter, sort. = TRUE)
# plot
plot(model_gbm)
# plot
gbm_test = predict(model_gbm, newdata = test_data, n.trees = gbm.iter)
gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red")
print(gbm_auc)
You develop your model with, preferably, three data sets.
Training, Testing and Validation. (Sometimes different terminology is used.)
Here, Train and Test sets are used to develop the model.
The model you decide upon must never see any of the Validation set. This set is used to see how good your model is, in effect it would simulate real-world new data that may come to you in the future. Once you decide your model does perform to an acceptable level you can then go back to running all your data to produce the final operational model. Then any new 'live' data of interest is fed to the model and produces an output. In the case of the fraud detection it would output some probability: here you need human input to decide at what level you would flag the event as fraudulent enough to warrant further investigation.
At periodic intervals or as you data arrives or your model performance weakens (fraudsters may become more cunning!) you would repeat the whole process.

How to choose the nrounds using `catboost`?

If I understand correctly catboost, we need to tune the nrounds just like in xgboost, using CV. I see the following code in the official tutorial In [8]
params_with_od <- list(iterations = 500,
loss_function = 'Logloss',
train_dir = 'train_dir',
od_type = 'Iter',
od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)
Which result in the best iterations = 211.
My question are:
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
Catboost is doing cross validation to determine the optimum number of iterations. Both train_pool and test_pool are datasets that include the target variable. Earlier in the tutorial they write
train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'
column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
column_description_vector[i] <- 'factor'
train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])
When you execute catboost.train(train_pool, test_pool, params_with_od) train_pool is used for training and test_pool is used to determine the optimum number of iterations via cross validation.
Now you are right to be confused, since later on in the tutorial they again use test_pool and the fitted model to make a prediction (model_best is similar to model_with_od, but uses a different overfitting detector IncToDec):
prediction_best <- catboost.predict(model_best, test_pool, type = 'Probability')
This might be bad practice. Now they might get away with it with their IncToDec overfitting detector - I am not familiar with the mathematics behind it - but for the Iter type overfitting detector you would need to have separate train,validation and test data sets (and if you want to be on the save side, do the same for the IncToDec overfitting detector). However it is only a tutorial showing the functionality so I wouldn't be too pedantic about what data they have already used how.
Here a link to a little more detail on the overfitting detectors:
https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/
It is a very poor decision to base your number of iterations on one test_pool and from the best iterations of catboost.train(). In doing so, you are tuning your parameters to one specific test set and your model will not work well with new data. You are therefore correct in presuming that like XGBoost, you need to apply CV to find the optimal number of iterations.
There is indeed a CV function in catboost. What you should do is specify a large number of iterations and stop the training after a certain number of rounds without improvement by using parameters early_stopping_rounds. Unlike LightGBM unfortunately, catboost doesn't seem to have the option of automatically giving the optimal number of boosting rounds after CV to apply in catboost.train(). Therefore, it requires a bit of a workaround. Here is an example which should work:
library(catboost)
library(data.table)
parameter = list(
thread_count = n_cores,
loss_function = "RMSE",
eval_metric = c("RMSE","MAE","R2"),
iterations = 10^5, # Train up to 10^5 rounds
early_stopping_rounds = 100, # Stop after 100 rounds of no improvement
)
# Apply 6-fold CV
model = catboost.cv(
pool = train_pool,
fold_count = 6,
params = parameter
)
# Transform output to DT
setDT(cbt_occupancy)
model[, iterations := .I]
# Order from lowest to highgest RMSE
setorder(model, test.RMSE.mean)
# Select iterations with lowest RMSE
parameter$iterations = model[1, iterations]
# Train model with optimal iterations
model = catboost.train(
learn_pool = train_pool,
test_pool = test_pool,
params = parameter
)
I think this is a general question for xgboost and catboost.
The choice of nround gets along with the choice with learning rate.
Thus, I recommend the higher round (1000+) and low learning rate.
After you find the best hype-params and retry a lower learning rate to check the hype-params you choose are stable.
And I find #nikitxskv 's answer is misleading.
In the R tutorial, In [12] just chooses learning_rate = 0.1 without mutiple choices. Thus, there is no hint for nround tuning.
Actually, In [12] just uses function expand.grid to find the best hype-params. It functions on the selections of depth, gamma and so on.
And in practice, we don't use this way to find a proper nround (too long).
And now for the two questions.
Is it correct that: this command use the test_pool to choose the best iterations instead of using cross-validation?
Yes, but you can use CV.
If yes, does catboost provide a command to choose the best iterations from CV, or I need to do it manually?
It depends on yourself. If you have a great aversion on boosting overfitting, I recommend you try it. There are a lot of packages to solve this problem. I recommend tidymodel packages.

Consisten results with Multiple runs of h2o deeplearning

For a certain combination of parameters in the deeplearning function of h2o, I get different results each time I run it.
args <- list(list(hidden = c(200,200,200),
loss = "CrossEntropy",
hidden_dropout_ratio = c(0.1, 0.1,0.1),
activation = "RectifierWithDropout",
epochs = EPOCHS))
run <- function(extra_params) {
model <- do.call(h2o.deeplearning,
modifyList(list(x = columns, y = c("Response"),
validation_frame = validation, distribution = "multinomial",
l1 = 1e-5,balance_classes = TRUE,
training_frame = training), extra_params))
}
model <- lapply(args, run)
What would I need to do in order to get consistent results for the model each time I run this?
Deeplearning with H2O will not be reproducible if it is run on more than a single core. The results and performance metrics may vary slightly from what you see each time you train the deep learning model. The implementation in H2O uses a technique called "Hogwild!" which increases the speed of training at the cost of reproducibility on multiple cores.
So if you want reproducible results you will need to restrict H2O to run on a single core and make sure to use a seed in the h2o.deeplearning call.
Edit based on comment by Darren Cook:
I forgot to include the reproducible = TRUE parameter that needs to be set in combination with the seed to make it truly reproducible. Note that this will make it a lot slower to run. And is is not advisable to do this with a large dataset.
More information on "Hogwild!"

R problem with randomForest classification with raster package

I am having an issue with randomForest and the raster package. First, I create the classifier:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
outraster = "classified.pix"
training_band = 2
validation_band = 1
original_classes = c(125,126,136,137,151,152,159,170)
reclassd_classes = c(122,122,136,137,150,150,150,170)
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
# Reclass the training data classes as required
training_class = subs(training_class, data.frame(original_classes,reclassd_classes))
# Find pixels that have training data and prepare the data used to create the classifier
is_training = Which(training_class != 0, cells=TRUE)
training_predictors = extract(myraster, is_training)[,3:nlayers(myraster)]
training_response = as.factor(extract(training_class, is_training))
remove(is_training)
# Create and save the forest, use odd number of trees to avoid breaking ties at random
r_tree = randomForest(training_predictors, y=training_response, ntree = 201, keep.forest=TRUE) # Runs out of memory, does not allow more trees than this...
remove(training_predictors, training_response)
Up to this point, all is good. I can see that the forest was created correctly by looking at the error rates, confusion matrix, etc. When I try to classify some data, however, I run into trouble with the following, which returns all NA's in predictions:
# Classify the whole image
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictions = predict(predictor_data, r_tree, type='response', progress='text')
And gives this warning:
Warning messages:
1: In `[<-.factor`(`*tmp*`, , value = c(1, 1, 1, 1, 1, 1, ... :
invalid factor level, NAs generated
(keeps going like this)...
However, calling predict.randomForest directly works fine and returns the expected predictions (this is not a good option for me because the image is large, and I cannot store the whole matrix in memory):
# Classify the whole image and write it to file
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictor_data = extract(predictor_data, extent(predictor_data))
predictions = predict(r_tree, newdata=predictor_data)
How can I get it to work directly with the "raster" version? I know that this is possible, as shown in the examples of predict{raster}.
You could try nesting predict.randomForest within the writeRaster function and write the matrix as a raster in chunks as per the pdf included in the raster package. Before that, try the argument 'na.rm=TRUE' when calling predict in the raster function. You might also assign dummy values to the NAs in the predict rasters, then later rewriting them as NAs using functions in the raster package.
As for memory problems when calling RFs, I've had a plethora of memory issues dealing with BRTs. They're immense on disk and in memory! (Should a model be more complex than the data?) I've not had them run reliably on 32-bit machines (WinXp or Linux). Sometimes tweaking Windows memory allotment to applications has helped, and moving to Linux has helped more, but I get the most from 64-bit Windows or Linux machines, since they impose a higher (or no) limit on the amount of memory applications can take. You may be able to increase the number of trees you can use by doing this.

Resources