PCA using raster datasets in R - r

I have several large rasters that I want to process in a PCA (to produce summary rasters).
I have seen several examples whereby people seem to be simply calling prcomp or princomp. However, when I do this, I get the following error message:
Error in as.vector(data): no method for coercing this S4 class to a vector
Example code:
files<-list.files() # a set of rasters
layers<-stack(files) # using the raster package
pca<-prcomp(layers)
I have tried using a raster brick instead of stack but that doesn't seem to the issue. What method do I need to provide the command so that it can convert the raster data to vector format? I understand that there are ways to sample the raster and run the PCA from that, but I would really like to understand why the above method is not working.
Thanks!

The above method is not working simply because prcomp does not know how to deal with a raster object. It only knows how to deal with vectors, and coercing to vector does not work, hence the error.
What you need to do is read each of your files into a vector, and put each of the rasters in a column of a matrix. Each row will then be a time series of values at a single spatial location, and each column will be all the pixels at a certain time step. Note that the exact spatial coordinates are not needed in this approach. This matrix serves as the input of prcomp.
Reading the files can be done using readGDAL, and using as.data.frame to cast the spatial data to data.frame.

Answer to my own question: I ended up doing something slightly different: rather than using every raster cell as input (very large dataset), I took a sample of points, ran the PCA and then saved the output model so that I could make predictions for each grid cell…maybe not the best solution but it works:
rasters <- stack(myRasters)
sr <- sampleRandom(rasters, 5000) # sample 5000 random grid cells
# run PCA on random sample with correlation matrix
# retx=FALSE means don't save PCA scores
pca <- prcomp(sr, scale=TRUE, retx=FALSE)
# write PCA model to file
dput(pca, file=paste("./climate/", name, "/", name, "_pca.csv", sep=""))
x <- predict(rasters, pca, index=1:6) # create new rasters based on PCA predictions

There is rasterPCA function in RStoolbox package http://bleutner.github.io/RStoolbox/rstbx-docu/rasterPCA.html
For example:
library('raster')
library('RStoolbox')
rasters <- stack(myRasters)
pca1 <- rasterPCA(rasters)
pca2 <- rasterPCA(rasters, nSamples = 5000) # sample 5000 random grid cells
pca3 <- rasterPCA(rasters, norm = FALSE) # without normalization

here is a working solution:
library(raster)
filename <- system.file("external/rlogo.grd", package="raster")
r1 <- stack(filename)
pca<-princomp(r1[], cor=T)
res<-predict(pca,r1[])
Display result:
r2 <- raster(filename)
r2[]<-res[,1]
plot(r2)

Yet another option would be to extract the vales from the raster-stack, i.e.:
rasters <- stack(my_rasters)
values <- getValues(rasters)
pca <- prcomp(values, scale = TRUE)

Here is another approach that expands on the getValues approach proposed by #Daniel. The result is a raster stack. The index (idx) references non-NA positions so that NA values are accounted for.
library(raster)
r <- stack(system.file("external/rlogo.grd", package="raster"))
r.val <- getValues(r)
idx <- which(!is.na(r.val))
pca <- princomp(r.val, cor=T)
ncomp <- 2 # first two principle components
r.pca <- r[[1:ncomp]]
for(i in 1:ncomp) { r.pca[[i]][idx] <- pca$scores[,i] }
plot(r.pca)

Related

igraph error long vectors not supported yet when trying to create adjacency matrix

I'm trying to perform a social network analysis in R, and I'm having some troubles with creating adjacency matrices from very large matrices using the igraph package. One of the main matrices is 10998555876 elements large (82 Gb) - created from a dataset with 176881 rows.
The error I get when running:
adjacency_matrix <- graph.adjacency(one_mode_matrix, mode = "undirected", weighted = TRUE, diag = TRUE)
is as follows:
Error in graph.adjacency.dense(adjmatrix, mode = mode, weighted = weighted, :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
The data is two-mode, so I've had to transpose it to get the one-mode matrix with the units I'm interested in. The code used before to create the matrix is:
graph <- graph.data.frame(data, directed = FALSE) # Making a graph object from the dataframe.
types <- bipartite.mapping(graph)$type
matrix <- as_incidence_matrix(graph, types = type) # Creating a two-mode matrix.
one_mode_matrix <- tcrossprod(matrix) # Transposing to get one-mode matrix.
diag(matrix) <- 0
mode(matrix) <- "numeric"
adjacency_matrix <- graph.adjacency(one_mode_matrix, mode = "undirected", weighted = TRUE, diag = FALSE) # This is where things break down.
Having done some research, e.g. in this thread https://github.com/igraph/rigraph/issues/255 , it looks like a problem in R base. It seems to me (without being an expert on these things) that igraph is trying to create an object in a format that R cannot handle because it is too big(?) Does anybody know how to handle this issue? Perhaps there are other packages for creating adjacency matrices that would do a better job on a large matrix?
Solution to anybody who might be interested:
I discovered that igraph can handle sparse matrices. Convert the matrix to a sparse matrix using the Matrix package like so:
sparse_matrix <- as(one_mode_matrix, "sparseMatrix")
Then make it into a graph object like this:
g <- graph_from_adjacency(sparse_matrix)
And ravel in all the functionality igraph has to offer.

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

Accessing R Self Organizing Map Codebook Vectors

I am working to use SOMs to help analyze variability of weather forecast model ensembles. To do so, access a 20 ensemble global weather forecast model over a specific geographic domain. I convert a 20 x Nlat x Nlon matrix to a 20 x Nlat*Nlon matrix, present it to the Kohonen package som function. I then seek to access the som "codebook vector" output and transform it back to the latitude longitude grid. I am, however, receiving an error message at this step.
The error message I recieve is:
'Error in var.som$codes[i, ] : incorrect number of dimensions.' In this case, var.som is the Kohonen object. I loop from N = 1:Nsom, where Nsom is the number of "maps" specified in the call to the som function.
The attribute data for var.som indicates the size of the list var.som$codes is
"num [1:4, 1:500]", suggesting two dimensions, which is why I think my code should work. I have tried different permutations to access the list data but none work. I.e. var.som$codes[1], and var.som$codes[[1]] but they do not solve the problem. var.som$codes[1,1] yields NULL.
In the R script below, I have reduced the process to only the essential steps. A random number generator replaces the accessing of the weather model data. In the code, I indicate the location of where the error occurs and what the error message is.
Help and guidance for how to access the var.som$codes one codebook vector at a time is appreciated.
# An R script that provides an example of using a Self Organizing Map to calucate a SOM from latitude/longitude
# data. An error occurs fails accessing the SOM data vector codes.
library("kohonen")
# Set a few parameters
Nlon <- 20 # Number of longitude points
Nlat <- 25 # Number of latitude points
Nens <- 20 # number of ensemble members
Nsom <- 4 # number of "maps" in SOM
t2m.en <- as.list(rep(0,Nens))
# Generate Nlon * Nlat random numbers for Nens ensembles
for (i in 1:Nens) {
t2m.en[[i]] <- runif(Nlon*Nlat, -5, 5)
}
#array containing ensemble data
t2m.ens <- array(unlist(t2m.en),dim=c(20,Nlon,Nlat))
t2m.vec <- matrix(t2m.ens, nrow=20, ncol=Nlat*Nlon, byrow=TRUE)
# remove the column mean from each column of data (i.e. each grid point)
t2m.scaled <- apply(t2m.vec, 2, scale, scale=FALSE, center=TRUE)
rm(t2m.en)
# LOOP OVER THE VARIABLES TO PLOT
# Conduct the SOM analysis
var.som <- som(t2m.scaled, grid = somgrid(2,2, "rectangular"))#, keep.data=TRUE))
var.vecc = mat.or.vec(Nlat*Nlon, Nsom)
#populate var.vecc with the SOM output maps
for (i in 1:Nsom) {
print(i)
## THIS IS WHERE THE ERROR IS
var.vecc[,i] <- var.som$codes[i,]
## The Error Message is:
## Error in var.som$codes[i, ] : incorrect number of dimensions
}
#var.som$codes[1]
# Plot data from var.vecc on a map
Try: var.vecc[,i]<-var.som$codes[[1]][i,]

Cross-validation for kriging in R: how to include the trend while reestimating the variogram using xvalid?

I have a question very specific for the function xvalid (package geoR) in R which is used in spatial statistics only, so I hope it's not too specific for someone to be able to answer. In any case, suggestions for alternative functions/packages are welcome too.
I would like to compute a variogram, fit it, and then perform cross-validation. Function xvalid seems to work pretty nice to do the cross-validation. It works when I set reestimate=TRUE (so it reestimates the variogram for every point removed from the dataset in cross-validation) and it also works when using a trend. However, it does not seem to work when combining these two...
Here is an example using the classical Meuse dataset:
library(geoR)
library(sp)
data(meuse) # import data
coordinates(meuse) = ~x+y # make spatialpointsdataframe
meuse#proj4string <- CRS("+init=epsg:28992") # add projection
meuse_geo <- as.geodata(meuse) # create object of class geodata for geoR compatibility
meuse_geo$data <- meuse#data # attach all data (incl. covariates) to meuse_geo
meuse_vario <- variog(geodata=meuse_geo, data=meuse_geo$data$lead, trend= ~meuse_geo$data$elev) # variogram
meuse_vfit <- variofit(meuse_vario, nugget=0.1, fix.nugget=T) # fit
# cross-validation works fine:
xvalid(geodata=meuse_geo, data=meuse_geo$data$lead, model=meuse_vfit, variog.obj = meuse_vario, reestimate=F)
# cross-validation does not work when reestimate = T:
xvalid(geodata=meuse_geo, data=meuse_geo$data$lead, model=meuse_vfit, variog.obj = meuse_vario, reestimate=T)
The error I get is:
Error in variog(coords = cv.coords, data = cv.data, uvec = variog.obj$uvec, : coords and trend have incompatible sizes
It seems to remove the point from the dataset during cross-validation, but it doesn't seem to remove the point from the covariates/trend data. Any ideas on solving this or using a different package?
Thanks a lot in advance!

Perform operation on each imputed dataset in R's MICE

How can I perform an operation (like subsetting or adding a calculated column) on each imputed dataset in an object of class mids from R's package mice? I would like the result to still be a mids object.
Edit: Example
library(mice)
data(nhanes)
# create imputed datasets
imput = mice(nhanes)
The imputed datasets are stored as a list of lists
imput$imp
where there are rows only for the observations with imputation for the given variable.
The original (incomplete) dataset is stored here:
imput$data
For example, how would I create a new variable calculated as chl/2 in each of the imputed datasets, yielding a new mids object?
This can be done easily as follows -
Use complete() to convert a mids object to a long-format data.frame:
long1 <- complete(midsobj1, action='long', include=TRUE)
Perform whatever manipulations needed:
long1$new.var <- long1$chl/2
long2 <- subset(long1, age >= 5)
use as.mids() to convert back manipulated data to mids object:
midsobj2 <- as.mids(long2)
Now you can use midsobj2 as required. Note that the include=TRUE (used to include the original data with missing values) is needed for as.mids() to compress the long-formatted data properly. Note that prior to mice v2.25 there was a bug in the as.mids() function (see this post https://stats.stackexchange.com/a/158327/69413)
EDIT: According to this answer https://stackoverflow.com/a/34859264/4269699 (from what is essentially a duplicate question) you can also edit the mids object directly by accessing $data and $imp. So for example
midsobj2<-midsobj1
midsobj2$data$new.var <- midsobj2$data$chl/2
midsobj2$imp$new.var <- midsobj2$imp$chl/2
You will run into trouble though if you want to subset $imp or if you want to use $call, so I wouldn't recommend this solution in general.
Another option is to calculate the variables before the imputation and place restrictions on them.
library(mice)
# Create the additional variable - this will have missing
nhanes$extra <- nhanes$chl / 2
# Change the method of imputation for extra, so that it always equals chl/2
# Change the predictor matrix so only chl predicts extra
ini <- mice(nhanes, max = 0, print = FALSE)
meth <- ini$meth
meth["extra"] <- "~I(chl / 2)"
pred <- ini$pred # extra isn't used to predict
pred["extra", "chl"] <- 1
# Imputations
imput <- mice(nhanes, seed = 1, pred = pred, meth = meth, print = FALSE)
There are examples in mice: Multivariate Imputation by Chained Equations in R.
There is an overload of with that can help you here
with(imput, chl/2)
the documentation is given at ?with.mids
There's a function for this in the basecamb package:
library(basecamb)
apply_function_to_imputed_data(mids_object, function)

Resources