R problem with randomForest classification with raster package - r

I am having an issue with randomForest and the raster package. First, I create the classifier:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
outraster = "classified.pix"
training_band = 2
validation_band = 1
original_classes = c(125,126,136,137,151,152,159,170)
reclassd_classes = c(122,122,136,137,150,150,150,170)
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
# Reclass the training data classes as required
training_class = subs(training_class, data.frame(original_classes,reclassd_classes))
# Find pixels that have training data and prepare the data used to create the classifier
is_training = Which(training_class != 0, cells=TRUE)
training_predictors = extract(myraster, is_training)[,3:nlayers(myraster)]
training_response = as.factor(extract(training_class, is_training))
remove(is_training)
# Create and save the forest, use odd number of trees to avoid breaking ties at random
r_tree = randomForest(training_predictors, y=training_response, ntree = 201, keep.forest=TRUE) # Runs out of memory, does not allow more trees than this...
remove(training_predictors, training_response)
Up to this point, all is good. I can see that the forest was created correctly by looking at the error rates, confusion matrix, etc. When I try to classify some data, however, I run into trouble with the following, which returns all NA's in predictions:
# Classify the whole image
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictions = predict(predictor_data, r_tree, type='response', progress='text')
And gives this warning:
Warning messages:
1: In `[<-.factor`(`*tmp*`, , value = c(1, 1, 1, 1, 1, 1, ... :
invalid factor level, NAs generated
(keeps going like this)...
However, calling predict.randomForest directly works fine and returns the expected predictions (this is not a good option for me because the image is large, and I cannot store the whole matrix in memory):
# Classify the whole image and write it to file
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictor_data = extract(predictor_data, extent(predictor_data))
predictions = predict(r_tree, newdata=predictor_data)
How can I get it to work directly with the "raster" version? I know that this is possible, as shown in the examples of predict{raster}.

You could try nesting predict.randomForest within the writeRaster function and write the matrix as a raster in chunks as per the pdf included in the raster package. Before that, try the argument 'na.rm=TRUE' when calling predict in the raster function. You might also assign dummy values to the NAs in the predict rasters, then later rewriting them as NAs using functions in the raster package.
As for memory problems when calling RFs, I've had a plethora of memory issues dealing with BRTs. They're immense on disk and in memory! (Should a model be more complex than the data?) I've not had them run reliably on 32-bit machines (WinXp or Linux). Sometimes tweaking Windows memory allotment to applications has helped, and moving to Linux has helped more, but I get the most from 64-bit Windows or Linux machines, since they impose a higher (or no) limit on the amount of memory applications can take. You may be able to increase the number of trees you can use by doing this.

Related

Consensus clustering with diceR package

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

bartMachine in caret train error : incorrect number of dimensions

I encounter a strange problem when trying to train a model in R using caret :
> bart <- train(x = cor_data, y = factor(outcome), method = "bartMachine")
Error in tuneGrid[!duplicated(tuneGrid), , drop = FALSE] :
nombre de dimensions incorrect
However, when using rf, xgbTree, glmnet, or svmRadial instead of bartMachine, no error is raised.
Moreover, dim(cor_data) and length(outcome) return [1] 3056 134 and [1] 3056 respectively, which indicates that there is indeed no issue with the dimensions of my dataset.
I have tried changing the tuneGrid parameter in train, which resolved the problem but caused this issue instead :
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-89-thread-1"
My dataset includes no NA, and all variables are either numerical or binary.
My goal is to extract the most important variables in the bart model. For example, I use for random forests:
rf <- train(x = cor_data, y = factor(outcome), method = "rf")
rfImp <- varImp(rf)
rf_select <- row.names(rfImp$importance[order(- rfImp$importance$Overall)[1:43], , drop = FALSE])
Thank you in advance for your help.
Since your goal is to extract the most important variables in the bart model, I will assume you are willing to bypass the caret wrapper and do it directly in R bartMachine, which is the only way I could successfully run it.
For my system, solving the memory issue required 2 further things:
Restart R and before loading anything, allocate 8Gb memory as so:
options(java.parameters = "-Xmx8g")
When running bartMachineCV, turn off mem_cache_for_speed:
library(bartMachine)
set_bart_machine_num_cores(16)
bart <- bartMachineCV(X = cor_data, y = factor(outcome), mem_cache_for_speed = F)
This will iterate through 3 values of k (2, 3 and 5) and 2 values of m (50 and 200) running 5 cross-validations each time, then builds a bartMachine using the best hyperparameter combination. You may also have to reduce the number of cores depending on your system, but this took about an hour on a 20,000 observation x 12 variable training set on 16 cores. You could also reduce the number of hyperparameter combinations it tests using the k_cvs and num_tree_cvs arguments.
Then to get the variable importance:
vi <- investigate_var_importance(bart, num_replicates_for_avg = 20)
print(vi)
You can also use it as a predictive model with predict(bart, new_data=new) similar to the object normally returned by caret::train(). This worked on R4.0.5, bartMachine_1.2.6 and rJava_1.0-4

How to predict the labels for the test set when using a custom Iterator in MXnet?

I have a big dataset (around 20GB for training and 2GB for testing) and I want to use MXnet and R. Due to lack of memory, I search for an iterator to load the training and test set by a custom iterator and I found this solution.
Now, I can train the model using the code on this page, but the problem is that if I read the test set with the save iterator as follow:
test.iter <- CustomCSVIter$new(iter = NULL, data.csv = "test.csv", data.shape = 480, batch.size = batch.size)
Then, the prediction command does not work and there is no prediction template in the page;
preds <- predict(model, test.iter)
So, my specific problem is, if I build my model using the code on the page, how can I read my test set and predict its labels for the evaluation process? My test set and train set is in this format.
Thank you for your help
It actually works exactly as you explained. You just call predict with model and iterator:
preds = predict(model, test.iter)
The only trick here is that the predictions are displayed column-wise. By that I mean, if you take the whole sample you are referring to, execute it and add the following lines:
test.iter <- CustomCSVIter$new(iter = NULL, data.csv = "mnist_train.csv", data.shape = 28, batch.size = batch.size)
preds = predict(model, test.iter)
preds[,1] # index of the sample to see in the column position
You receive:
[1] 5.882561e-11 2.826923e-11 7.873914e-11 2.760162e-04 1.221306e-12 9.997239e-01 4.567645e-11 3.177564e-08 1.763889e-07 3.578671e-09
This show the softmax output for the 1st element of the training set. If you try to print everything by just writing preds, then you will see only empty values because of the RStudio print limit of 1000 - real data will have no chance to appear.
Notice that I reuse the training data for prediction. I do so, since I don't want to adjust iterator's code, which needs to be able to consume the data with and without a label in front (training and test sets). In real-world scenario you would need to adjust iterator so it would work with and without a label.

(R) function: object not found: environment depth fine?

I'm puzzled by a function error & would appreciate some insight.
The function, very briefly, automates the multiple processes involved in Boosted Regression Trees using gbm.step & other gbm's.
"gbm.auto" <- function (grids, samples, 3 parameters) {
starts 2 counters, require(gbm), does various small processing jobs with grids & samples
for parameter 1{
for parameter 2{
for parameter 3{
Runs 2 BRTs per parameter-combination loop, generates & iteratively updates a 'best' BRT for each, adds to counters. Extensive use of samples.
}}}
closes the loops, function continues as the first } is still open.
The next BRT can't find samples, even though it's at the same environment depth (1?) as the pre-loop processing jobs which used it successfully. Furthermore, adding "globalsamples<<-samples" after the }}} loop successfully saves the object, suggesting that samples is still available. Adding env1,2 & 3<<-environment() before the {{{ loop, within it & after it results in Environment for all three. Also suggesting it's all the same function environment & samples should be available.
What am I missing here? Thanks in advance!
Edit: exact message:
Error in eval(expr, envir, enclos) : object 'samples' not found
Function - loads removed & compacted but still gives same error message:
"gbm.auto" <-
function (samples, expvar, resvar, tc, lr, bf)
{ # open function
require(gbm)
require(dismo)
# create binary (0/1) response variable, for bernoulli BRTs
samples$brv <- ifelse(samples[resvar] > 0, 1, 0)
brvcol <- which(colnames(samples)=="brv") # brv column number for BRT
for(j in tc){ # permutations of tree complexity
for(k in lr){ # permutations of learning rate
for(l in bf){ # permutations of bag fraction
Bin_Best_Model<- gbm.step(data=samples,gbm.x = expvar, gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l)
}}} # close loops, producing all BRT/GBM objects & continue through model selection
Bin_Best_Simp_Check <- gbm.simplify(Bin_Best_Model) # simplify model
# if best number of variables to remove isn't 0 (i.e. it's worth simplifying), re-run the best model (Bin_Best_Model, using gbm.call to get its values)
# with just-calculated best number of variables to remove, removed. gbm.x asks which number of drops has the minimum mean (lowest point on the line)
# & that calls up the list of predictor variables with those removed, from $pred.list
if(min(Bin_Best_Simp_Check$deviance.summary$mean) < 0)
assign("Bin_Best_Simp", gbm.step(data = samples,
gbm.x = Bin_Best_Simp_Check$pred.list[[which.min(Bin_Best_Simp_Check$deviance.summary$mean)]],
gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l))
}
Read in data:
mysamples<-data.frame(response=round(sqrt(rnorm(5000, mean= 2.5, sd=1.5)^2)),
depth=sqrt(rnorm(5000, mean= 35, sd=24)^2),
temp=rnorm(5000, mean= 15, sd=1.2),
sal=rnorm(5000, mean= 34, sd=0.34))
Run this: gbm.auto(expvar=c(2,3,4),resvar=1,samples=mysamples,tc=2,lr=0.00000000000000001,bf=0.5)
Problem now: this causes a different error because my fake data are somehow wrong. ARGHG!
Edit: rounded the response data to integers and kept shrinking the learning rate until it runs. If it doesn't work for you, add zeroes until it does.
Edit: so this worked on my computer but reading it back to a clean sheet from online fails on a DIFFERENT count:
Error in var(cv.cor.stats, use = "complete.obs") :
no complete element pairs
In cor(y_i, u_i) : the standard deviation is zero
Is it allowed to attach or link to a csv of a small clip of my data? I'm currently burrowing deeper & deeper into bugfixing problems created by using fake data which I'm only using for this question, & thus getting off topic from the actual problem. Exasperation mode on!
Cheers
Edit2: if this is allowed: 1000row 4column csv link here: https://drive.google.com/file/d/0B6LsdZetdypkaC1WYXpKU3ZScjQ

rgdal efficiently reading large multiband rasters

I am working on an image classification script in R using the rgdal package. The raster in question is a PCIDSK file with 28 channels: a training data channel, a validation data channel, and 26 spectral data channels. The objective is to populate a data frame containing the values of each pixel which is not zero in the training data channel, plus the associated spectral values in the 26 bands.
In Python/Numpy, I can easily import all the bands for the entire image into a multi-dimensional array, however, due to memory limitations the only option in R seems to be importing this data block by block, which is very slow:
library(rgdal)
raster = "image.pix"
training_band = 2
validation_band = 1
BlockWidth = 500
BlockHeight = 500
# Get some metadata about the whole raster
myinfo = GDALinfo(raster)
ysize = myinfo[[1]]
xsize = myinfo[[2]]
numbands = myinfo[[3]]
# Iterate through the image in blocks and retrieve the training data
column = 0
training_data = NULL
while(column < xsize){
if(column + BlockWidth > xsize){
BlockWidth = xsize - column
}
row = 0
while(row < ysize){
if(row + BlockHeight > ysize){
BlockHeight = ysize - row
}
# Do stuff here
myblock = readGDAL(raster, region.dim = c(BlockHeight,BlockWidth), offset = c(row, column), band = c(training_band,3:numbands), silent = TRUE)
blockdata = matrix(NA, dim(myblock)[1], dim(myblock)[2])
for(i in 1:(dim(myblock)[2])){
bandname = paste("myblock", names(myblock)[i], sep="$")
blockdata[,i]= as.matrix(eval(parse(text=bandname)))
}
blockdata = as.data.frame(blockdata)
blockdata = subset(blockdata, blockdata[,1] > 0)
if (dim(blockdata)[1] > 0){
training_data = rbind(training_data, blockdata)
}
row = row + BlockHeight
}
column = column + BlockWidth
}
remove(blockdata, myblock, BlockHeight, BlockWidth, row, column)
Is there a faster/better way of doing the same thing without running out of memory?
The next step after this training data is collected is to create the classifier (randomForest package) which also requires a lot of memory, depending on the number of trees requested. This brings me to my second problem, which is that creating a forest of 500 trees is not possible given the amount of memory already occupied by the training data:
myformula = formula(paste("as.factor(V1) ~ V3:V", dim(training_data)[2], sep=""))
r_tree = randomForest(formula = myformula, data = training_data, ntree = 500, keep.forest=TRUE)
Is there a way to allocate more memory? Am I missing something? Thanks...
[EDIT]
As suggested by Jan, using the "raster" package is much faster; however as far as I can tell, it does not solve the memory problem as far as gathering the training data is concerned because it eventually needs to be in a dataframe, in memory:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
training_band = 2
validation_band = 1
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_data = as.data.frame(training_data)
So while this is much faster (and takes less code), it still does not solve the issue of not having enough free memory to create the classifier... Is there some "raster" package function that I have not found that can accomplish this? Thanks...
Check out the Raster package. The Raster package provides a handy wrapper for Rgdal without loading it into memory.
http://raster.r-forge.r-project.org/
Hopefully this help.
The 'raster' package deals with basic
spatial raster (grid) data access and
manipulation. It defines raster
classes; can deal with very large
files (stored on disk); and includes
standard raster functions such as
overlay, aggregation, and merge.
The purpose of the 'raster' package is
to provide easy to use functions for
raster manipulation and analysis.
These include high level functions
such as overlay, merge, aggregate,
projection, resample, distance,
polygon to raster conversion. All
these functions work for very large
raster datasets that cannot be loaded
into memory. In addition, the package
provides lower level functions such as
row by row reading and writing (to
many formats via rgdal) for building
other functions.
By using the Raster package you can avoid filling your memory before using randomForest.
[EDIT] To solve the memory problem with randomForest maybe it helps if you could learn the individual trees within the random forest on subsamples (of size << n) rather than bootstrap samples (of size n).
I think the key here is this: " a data frame containing the values of each pixel which is not zero in the training data channel". If the resulting data.frame is small enough to hold in memory you could determine this by reading just that band, then trimming to only those non-zero values, then try to create a data.frame with that many rows and the total number of colums you want.
Can you run this?
training_band = 2
df = readGDAL("image.pix", band = training_band)
df = as.data.frame(df[!df[,1] == 0, ])
Then you could populate the data.frame's columns one by one by reading each band separately and trimming as for the training band.
If that data.frame is too big then you're stuck - I don't know if randomForest can use the memory-mapped data objects in "ff", but it might be worth trying.
EDIT: some example code, and note that raster gives you memory-mapped access but the problem is whether randomForest can use memory mapped data structures. You can possibly read in only the data you need, one band at a time - you would want to try to build the full data.frame first, rather than append columns.
Also, if you can generate the full data.frame from the start then you'll know if it should work. By rbind()ing your way through as your code does you need increasingly larger chunks of contiguous memory, and that might be avoidable.

Resources