RSNNS neural network prediction for raster image classification in R - r

I'm trying to harness the power of neural networks for image classification of big rasters using the RSNNS package in R.
As for the data preparation and training of the model, everything works perfectly fine and the accuracies look quite promising.
Subsequently, I'm trying to classify the raster values using the function predict with the trained model. Having a quite big amount of data (rasterstack with the dimension 10980x10980x16), I'm processing the data block by block. And here's the problem:
The prediction of the class values is extremely slow. I'm working on a quite powerful machine (Windows x64, 32GB Ram, i7 3.4GHZ quad-core) but still the process is almost literally taking ages. I already reduced the size of my blocks, but still the amount of time needed is unacceptable. Currently I split the data in blocks of 64 rows per block. That would result in a total of 172 blocks. If I assume a linear processing time for each block (in my case 33 minutes !!!), it would take me almost 95 hours to process the whole image. Again, that can not be right.
I've tried other neural network packages and for instance nnet classifies bigger blocks like these in under one minute.
So please, if you have any pointers on what I'm doing wrong, I'd greatly appreciate it.
Here's a working example similar to my code:
library(RSNNS)
#example data for training and testing
dat <- matrix(runif(702720),ncol = 16)
#example data to classify
rasval <- matrix(runif(11243520),ncol = 16)
dat <- as.data.frame(dat)
#example class labels from 1 to 11
classes <- matrix(,ncol=1,nrow=nrow(dat))
classes <- apply(classes,1,function(x) floor(runif(1,0,11)))
dat$classes <- classes
#shuffle dataset
dat <- dat[sample(nrow(dat)),]
datValues <- dat[,1:16]
datTargets <- decodeClassLabels(dat[,17])
#split dataset
dat <- splitForTrainingAndTest(datValues, datTargets, ratio=0.15)
#normalize data
dat <- normTrainingAndTestSet(dat)
#extract normalization variables
ncolmeans <- attributes(dat$inputsTrain)$normParams$colMeans
ncolsds <- attributes(dat$inputsTrain)$normParams$colSds
#train model
model <- mlp(dat$inputsTrain, dat$targetsTrain, size=1, learnFunc="SCG", learnFuncParams=c(0, 0, 0, 0),
maxit=400, inputsTest=dat$inputsTest, targetsTest=dat$targetsTest)
#normalize raster data
rasval <- sweep(sweep(rasval,2,ncolmeans),2,ncolsds,'/')
#Predict classes ##Problem##
pred <- predict(model,rasval)

yes, unfortunately RSNNS can be very slow when predicting. The new version 0.4-8 (not on CRAN yet, but you can get it from github) should speed things up a bit but the general problem is that every row of data needs to be passed separately into the SNNS kernel, and resolving this issue would mean reimplementing some things in the kernel. Not impossible but some work to do.

Related

Using Amelia and decision trees

I have a panel dataset (countries and years) with a lot of missing data so I've decided to use multiple imputation. The goal is to see the relationship between the proportion of women in management (managerial_value) and total fatal workplace injuries (total_fatal)
From what I've read online, Amelia is the best option for panel data so I used that like so:
amelia_data <- amelia(spdata, ts = "year", cs = "country", polytime = 1,
intercs = FALSE)
where spdata is my original dataset.
This imputation process worked, but I'm unsure of how to proceed with forming decision trees using the imputed data (an object of class 'amelia').
I originally tried creating a function (amelia2df) to turn each of the 5 imputed datasets into a data frame:
amelia2df <- function(amelia_data, which_imp = 1) {
stopifnot(inherits(amelia_data, "amelia"), is.numeric(which_imp))
imps <- amelia_data$imputations[[which_imp]]
as.data.frame(imps)
}
one_amelia <- amelia2df(amelia_data, which_imp = 1)
two_amelia <- amelia2df(amelia_data, which_imp = 2)
three_amelia <- amelia2df(amelia_data, which_imp = 3)
four_amelia <- amelia2df(amelia_data, which_imp = 4)
five_amelia <- amelia2df(amelia_data, which_imp = 5)
where one_amelia is the data frame for the first imputed dataset, two_amelia is the second, and so on.
I then combined them using rbind():
total_amelia <- rbind(one_amelia, two_amelia, three_amelia, four_amelia, five_amelia)
And used the new combined dataset total_amelia to construct a decision tree:
set.seed(300)
tree_data <- total_amelia
I_index <- sample(1:nrow(tree_data), size = 0.75*nrow(tree_data), replace=FALSE)
I_train <- tree_data[I_index,]
I_test <- tree_data[-I_index,]
fatal_tree <- rpart(total_fatal ~ managerial_value, I_train)
rpart.plot(fatal_tree)
fatal_tree
This "works" as in it doesn't produce an error, but I'm not sure that it is appropriately using the imputed data.
I found a couple resources explaining how to apply least squares, logit, etc., but nothing about decision trees. I'm under the impression I'd need the 5 imputed datasets to be combined into one data frame, but I have not been able to find a way to do that.
I've also looked into Zelig and bind_rows but haven't found anything that returns one data frame that I can then use to form a decision tree.
Any help would be appreciated!
As already indicated by #Noah, you would set up the multiple imputation workflow different than you currently do.
Multiple imputation is not really a tool to improve your results or to make them more correct.
It is a method to enable you to quantify the uncertainty caused by the missing data, that comes along with your analysis.
All the different datasets created by multiple imputation are plausible imputations, because of the uncertainty, you don't know, which one is correct.
You would therefore use multiple imputation the following way:
Create your m imputed datasets
Build your trees on each imputed dataset separately
Do you analysis on each tree separately
In your final paper, you can now state how much uncertainty is caused trough the missing values/imputation
This means you get e.g. 5 different analysis results for m = 5 imputed datasets. First this looks confusing, but this enables you to give bounds, between the correct result probably lies. Or if you get completely different results for each imputed dataset, you know, there is too much uncertainty caused by the missing values to give reliable results.

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

KNN using R - in Production

I have some dummy data that consists of 99 rows of data, one column is
free text data and one column is the cateogry. It has been categorised into either Customer Service or Not Customer Service related.
I passed the 99 rows of data into my R script, created a Corpus, cleaned and parsed my data and converted it to a DocumentTermMatrix. I then converted my DTM to a dataframe to make it easier to view. I bound the category to my new dataframe. I then split it 50/50 so 50 rows into my training set, 49 into my testing set. I also pulled out the category.
train <- sample(nrow(mat.df), ceiling(nrow(mat.df) * .5))
test <- (1:nrow(mat.df))[- train]
cl <- mat.df[, "category"]
I then created a model with the stripped out category column and passed this new model to my KNN
knn.pred <- knn(modeldata[train, ], modeldata[test, ], cl[train])
conf.mat <- table("Predictions" = knn.pred, Actual = cl[test])
conf.mat
I can then work out the accuracy, generate a cross table or export the predictions to test the accuracy of the model.
The bit i am struggling to get my head around at the moment, is how do i use the model going forward for new data.
So if i then have 10 new rows of free text data that havent been manually classified, How do i then run my knn model i have just created to classify this additional data?
Maybe i am just misunderstanding the next process.
Thanks,
The same way you just found the hold-out test performance:
knn.pred.newdata <- knn(modeldata[train, ], completely_new_data, cl[train])
In a KNN model, your training data is intrinsically part of your model. Since it's just finding the nearest training points, how do you know which those are if you don't have their coordinates?
That said, why do you want to use a KNN model instead of something more modern (SVM, Random forest, Boosted trees, neural networks)? KNN models scale extremely poorly with the number of data points.

Random Forest with caret package: Error: cannot allocate vector of size 153.1 Gb

I was trying to build a random forest model for a dataset in Kaggle, i always doing machine learning with caret package, the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size), 40+ variables are categorical and the outcome is the response i am trying to predict and it is binary. After some pre-processing with dplyr, I started working on building model with caret package, but i got this error message when i was trying to run the "train" function:"Error: cannot allocate vector of size 153.1 Gb" Here is my code:
## load packages
require(tidyr)
require(dplyr)
require(readr)
require(ggplot2)
require(ggthemes)
require(caret)
require(parallel)
require(doParallel)
## prepare for parallel processing
n_Cores <- detectCores()
n_Cluster <- makeCluster(n_Cores)
registerDoParallel(n_Cluster)
## import orginal datasets
people_Dt <- read_csv("people.csv",col_names = TRUE)
activity_Train <- read_csv("act_train.csv",col_names = TRUE)
### join two sets together and remove variables not to be used
first_Try <- people_Dt%>%
left_join(activity_Train,by="people_id")%>%
select(-ends_with("y"))%>%
filter(!is.na(outcome))
## try with random forest
in_Tr <- createDataPartition(first_Try$outcome,p=0.75,list=FALSE)
rf_Train <- firt_Try[in_Tr,]
rf_Test <- firt_Try[-in_Tr,]
## set model cross validation parameters
model_Control <- trainControl(method = "repeatedcv",repeats=2,number=2,allowParallel = TRUE)
rf_RedHat <- train(outcome~.,
data=rf_Train,
method="rf",
tuneLength=10,
importance=TRUE,
trControl=model_Control)
My computer is a fairly powerful machine with E3 processors and 32GB RAM. I have two questions:
1. Where did i get a vector that is as large as 150GB? Is it because some codes I wrote?
2. I cannot get a machine with that big ram, is there any workarouds to solve the issue that i can move on with my model building process?
the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size)
To be clear here, you most likely don't need 1.5 million rows to build a model. Instead, you should be taking a smaller subset which doesn't cause the memory problems. If you are concerned about reducing the size of your sample data, then you can do some descriptive stats on the 40 predictors, on a smaller set, and make sure that the behavior appears to be the same.
The problem is probably related to the one-hot-encoding of caret in your categorical variables. Since you have a lot of categorical variables, this seems to be a real problem such that it increases your dataset in a huge way. One-hot encoding will create a new column for every factor per categorical variables that you have.
Maybe you could try something like the h2o-package, which handles categorical variable in another way such that in not exploding your dataset when the model is run.

How to sample rows in the randomForest package

I have a dataset with 1 million rows and 100 columns. randomForest is quite slow for data this big so I would like to train each tree on a subset of, say, 50000 columns each.
How do I achieve this with the randomForest function? Do I have to hack something together manually? I am not able to find any instruction on this in the vignette.
Do you mean that the sample for each tree should be different?
To start with, I would consider sampling before calling randomforest. Indeed, the fact that you take different samples for each tree could have an impact on the final result, and the importance matrix would probably be partly biased.
You can achieve this by doing that:
numrow <- nrow(data)
subset <- sample(numrow, 50000)
learn <- data[subset,]
test <- data[-subset,]
model_rf <- randomForest(formula=[...], data=learn, importance=T)

Resources