rgdal efficiently reading large multiband rasters - r

I am working on an image classification script in R using the rgdal package. The raster in question is a PCIDSK file with 28 channels: a training data channel, a validation data channel, and 26 spectral data channels. The objective is to populate a data frame containing the values of each pixel which is not zero in the training data channel, plus the associated spectral values in the 26 bands.
In Python/Numpy, I can easily import all the bands for the entire image into a multi-dimensional array, however, due to memory limitations the only option in R seems to be importing this data block by block, which is very slow:
library(rgdal)
raster = "image.pix"
training_band = 2
validation_band = 1
BlockWidth = 500
BlockHeight = 500
# Get some metadata about the whole raster
myinfo = GDALinfo(raster)
ysize = myinfo[[1]]
xsize = myinfo[[2]]
numbands = myinfo[[3]]
# Iterate through the image in blocks and retrieve the training data
column = 0
training_data = NULL
while(column < xsize){
if(column + BlockWidth > xsize){
BlockWidth = xsize - column
}
row = 0
while(row < ysize){
if(row + BlockHeight > ysize){
BlockHeight = ysize - row
}
# Do stuff here
myblock = readGDAL(raster, region.dim = c(BlockHeight,BlockWidth), offset = c(row, column), band = c(training_band,3:numbands), silent = TRUE)
blockdata = matrix(NA, dim(myblock)[1], dim(myblock)[2])
for(i in 1:(dim(myblock)[2])){
bandname = paste("myblock", names(myblock)[i], sep="$")
blockdata[,i]= as.matrix(eval(parse(text=bandname)))
}
blockdata = as.data.frame(blockdata)
blockdata = subset(blockdata, blockdata[,1] > 0)
if (dim(blockdata)[1] > 0){
training_data = rbind(training_data, blockdata)
}
row = row + BlockHeight
}
column = column + BlockWidth
}
remove(blockdata, myblock, BlockHeight, BlockWidth, row, column)
Is there a faster/better way of doing the same thing without running out of memory?
The next step after this training data is collected is to create the classifier (randomForest package) which also requires a lot of memory, depending on the number of trees requested. This brings me to my second problem, which is that creating a forest of 500 trees is not possible given the amount of memory already occupied by the training data:
myformula = formula(paste("as.factor(V1) ~ V3:V", dim(training_data)[2], sep=""))
r_tree = randomForest(formula = myformula, data = training_data, ntree = 500, keep.forest=TRUE)
Is there a way to allocate more memory? Am I missing something? Thanks...
[EDIT]
As suggested by Jan, using the "raster" package is much faster; however as far as I can tell, it does not solve the memory problem as far as gathering the training data is concerned because it eventually needs to be in a dataframe, in memory:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
training_band = 2
validation_band = 1
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_data = as.data.frame(training_data)
So while this is much faster (and takes less code), it still does not solve the issue of not having enough free memory to create the classifier... Is there some "raster" package function that I have not found that can accomplish this? Thanks...

Check out the Raster package. The Raster package provides a handy wrapper for Rgdal without loading it into memory.
http://raster.r-forge.r-project.org/
Hopefully this help.
The 'raster' package deals with basic
spatial raster (grid) data access and
manipulation. It defines raster
classes; can deal with very large
files (stored on disk); and includes
standard raster functions such as
overlay, aggregation, and merge.
The purpose of the 'raster' package is
to provide easy to use functions for
raster manipulation and analysis.
These include high level functions
such as overlay, merge, aggregate,
projection, resample, distance,
polygon to raster conversion. All
these functions work for very large
raster datasets that cannot be loaded
into memory. In addition, the package
provides lower level functions such as
row by row reading and writing (to
many formats via rgdal) for building
other functions.
By using the Raster package you can avoid filling your memory before using randomForest.
[EDIT] To solve the memory problem with randomForest maybe it helps if you could learn the individual trees within the random forest on subsamples (of size << n) rather than bootstrap samples (of size n).

I think the key here is this: " a data frame containing the values of each pixel which is not zero in the training data channel". If the resulting data.frame is small enough to hold in memory you could determine this by reading just that band, then trimming to only those non-zero values, then try to create a data.frame with that many rows and the total number of colums you want.
Can you run this?
training_band = 2
df = readGDAL("image.pix", band = training_band)
df = as.data.frame(df[!df[,1] == 0, ])
Then you could populate the data.frame's columns one by one by reading each band separately and trimming as for the training band.
If that data.frame is too big then you're stuck - I don't know if randomForest can use the memory-mapped data objects in "ff", but it might be worth trying.
EDIT: some example code, and note that raster gives you memory-mapped access but the problem is whether randomForest can use memory mapped data structures. You can possibly read in only the data you need, one band at a time - you would want to try to build the full data.frame first, rather than append columns.
Also, if you can generate the full data.frame from the start then you'll know if it should work. By rbind()ing your way through as your code does you need increasingly larger chunks of contiguous memory, and that might be avoidable.

Related

Normalising the data by scale() function changes the correlation in R

I have some data in data frame format and I want to see the correlation between two groups of these data (observation and estimation) in the original form, and in the normalized forms. The data is available here data. Here are the codes:
joint_df_estimation_observation <- data.frame(df_estimation = as.vector(as.matrix(rowSums(estimation_data))), df_observation = rowSums(observation_data))
joint_df_estimation_observation
cor.test(joint_df_estimation_observation$df_estimation, joint_df_estimation_observation$df_observation)
and I got the results:
normalised_estimation_data =
as.data.frame(scale(as.matrix(t(as.matrix(estimation_data))), scale = TRUE))
normalised_estimation_data = as.data.frame(t(as.matrix(normalised_estimation_data)))
normalised_observation_data = as.data.frame(scale(as.matrix(t(as.matrix(observation_data))), scale = TRUE))
normalised_observation_data = as.data.frame(t(as.matrix(normalised_observation_data)))
joint_df_estimation_observation_normalisation <- data.frame(df_estimation = as.vector(as.matrix(rowSums(normalised_estimation_data))), df_observation = rowSums(normalised_observation_data))
joint_df_estimation_observation_normalisation
cor.test(joint_df_estimation_observation_normalisation$df_estimation, joint_df_estimation_observation_normalisation$df_observation)
and I got the result
I do not understand why the correlation is so much different? I thought that the correlation will remain same even if the data are normalized. Is there anything wrong that I have done? I have considered the case there maybe NAs that are generated if there is a constant value across all columns for some rows in the original data, which caused this issue. However, I have removed them and It actually does not change the results.

Consensus clustering with diceR package

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

R problem with randomForest classification with raster package

I am having an issue with randomForest and the raster package. First, I create the classifier:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
outraster = "classified.pix"
training_band = 2
validation_band = 1
original_classes = c(125,126,136,137,151,152,159,170)
reclassd_classes = c(122,122,136,137,150,150,150,170)
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
# Reclass the training data classes as required
training_class = subs(training_class, data.frame(original_classes,reclassd_classes))
# Find pixels that have training data and prepare the data used to create the classifier
is_training = Which(training_class != 0, cells=TRUE)
training_predictors = extract(myraster, is_training)[,3:nlayers(myraster)]
training_response = as.factor(extract(training_class, is_training))
remove(is_training)
# Create and save the forest, use odd number of trees to avoid breaking ties at random
r_tree = randomForest(training_predictors, y=training_response, ntree = 201, keep.forest=TRUE) # Runs out of memory, does not allow more trees than this...
remove(training_predictors, training_response)
Up to this point, all is good. I can see that the forest was created correctly by looking at the error rates, confusion matrix, etc. When I try to classify some data, however, I run into trouble with the following, which returns all NA's in predictions:
# Classify the whole image
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictions = predict(predictor_data, r_tree, type='response', progress='text')
And gives this warning:
Warning messages:
1: In `[<-.factor`(`*tmp*`, , value = c(1, 1, 1, 1, 1, 1, ... :
invalid factor level, NAs generated
(keeps going like this)...
However, calling predict.randomForest directly works fine and returns the expected predictions (this is not a good option for me because the image is large, and I cannot store the whole matrix in memory):
# Classify the whole image and write it to file
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictor_data = extract(predictor_data, extent(predictor_data))
predictions = predict(r_tree, newdata=predictor_data)
How can I get it to work directly with the "raster" version? I know that this is possible, as shown in the examples of predict{raster}.
You could try nesting predict.randomForest within the writeRaster function and write the matrix as a raster in chunks as per the pdf included in the raster package. Before that, try the argument 'na.rm=TRUE' when calling predict in the raster function. You might also assign dummy values to the NAs in the predict rasters, then later rewriting them as NAs using functions in the raster package.
As for memory problems when calling RFs, I've had a plethora of memory issues dealing with BRTs. They're immense on disk and in memory! (Should a model be more complex than the data?) I've not had them run reliably on 32-bit machines (WinXp or Linux). Sometimes tweaking Windows memory allotment to applications has helped, and moving to Linux has helped more, but I get the most from 64-bit Windows or Linux machines, since they impose a higher (or no) limit on the amount of memory applications can take. You may be able to increase the number of trees you can use by doing this.

Resources