calculate mean for each cell in grid in R [duplicate] - r

Does somebody know whether there is a sliding window method in R for 2d matrices and not just vectors. I need to apply median function to an image stored in matrix

The function focal() in the excellent raster package is good for this. It takes several arguments beyond those shown in the example below, and can be used to specify a non-rectangular sliding window if that's needed.
library(raster)
## Create some example data
m <- matrix(1, ncol=10, nrow=10)
diag(m) <- 2
r <- as(m, "RasterLayer") # Coerce matrix to RasterLayer object
## Apply a function that returns a single value when passed values of cells
## in a 3-by-3 window surrounding each focal cell
rmean <- focal(r, w=matrix(1/9, ncol=3, nrow=3), fun=mean)
rmedian <- focal(r, w=matrix(1/9, ncol=3, nrow=3), fun=median)
## Plot the results to confirm that this behaves as you'd expect
par(mfcol=c(1,3))
plot(r)
plot(rmean)
plot(rmedian)
## Coerce results back to a matrix, if you so desire
mmean <- as(rmean, "matrix")

I know this is an old question, but I have come across this many times when looking to solve a similar problem. While the focal function in the raster package IS very straightforward and convienent, I have found it to be very slow when working with large rasters. There are many ways to try and address this, but one way that I found is by using system commands to "whitebox tools" which is a command-line driven set of raster analysis tools. It's main advantage it that it executes the tools in parallel and really takes advantage of multi-core CPUs. I know R has many cluster functions and packages (which I use for randomforest model raster prediction), but I have struggled with much of the parallel computing implementation in R. Whitebox tools has discrete functions for mean, max, majority, median, etc... filters (not to mention loads of terrain processing tools which is great for DEM-centric analyses).
Some example code for how I implemented a modal or majority filter (3x3 window) in R in of a large classified land cover raster (nrow=3793, ncol=6789, ncell=25750677) using whitebox tools:
system('C:/WBT2/target/release/whitebox_tools --wd="D:/Temp" ^
--run=MajorityFilter -v --input="input_rast.tif" ^
--output="maj_filt_rast.tif" --filterx=3 --filtery=3',
wait = T, timeout=0, show.output.on.console = T)
The above code took less than 3.5 seconds to execute, meanwhile the equivalent raster package "focal" function using "modal", also from the raster package, took 5 minutes to complete coded below as:
maj_filt_rast<- focal(input_rast, fun=modal, w=matrix(1,nrow=3,ncol=3))
Getting whitebox tools compiled and installed IS a bit annoying, but good instructions are provided. In my opinion it is well worth the effort as it makes raster processes that were previously prohibitively slow in R run amazingly fast and it allows me to keep the coding for everything inside R with system commands.

Related

How slow is too slow when kriging with gstat in R

I am trying to use the krige function in the gstat package of R to interpolate some spatial ocean depth data in R. I am finding for more than about ~1000 points, the function starts taking unreasonable amounts of time to finish (i.e., hours to days to hasn't ever finished). Is this normal or am I doing something wrong? I am particularly concerned because my eventual goal is to do spatio-temporal kriging of a very large dataset (>30,000 data points) and I am worried that it just won't be feasible given these run times.
I am running gstat-1.1-3 and R-3.3.2. Below is the code I am running:
library(sp); library(raster); library(gstat)
v.utm # SpatialPointsDataFrame with >30,000 points
# Remove points with identical positons
zd = zerodist(v.utm)
nzd = v.utm[-zd[,1],] # Layer with no identical positions
# Make a raster layer covering point layer
resolution=1e4
e = extent(as.matrix(v.utm#coords))+resolution
r = raster(e,resolution=resolution)
proj4string(r) = proj4string(v.utm)
# r is a 181x157 raster
# Fit variogram
fv = fit.variogram(variogram(AVGDEPTH~1, nzd),model=vgm(6000,"Exp",1,5e5,1))
# Krige on random sample of 500 points - works fine
size=500
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
# Krige on random sample of 5000 points - never seems to end
size=5000
ss=nzd[sample.int(nrow(nzd),size),]
depth.krig = krige(AVGDEPTH~1,ss,as(r,"SpatialPixelsDataFrame"),
model=depth.fit)
The complexity of the choleski decomposition (or similar) is O(n^3), meaning that if you multiply the number of points by 10, the time it will take increases with a factor 1000. There are two ways out of this problem, at least for as far as gstat is concerned:
install an optimized version of BLAS (eg OpenBLAS, or MKL) - this does not solve the O(n^3) problem, but may speed up maximally a factor n with n the number of cores available
Avoid decomposing the full covariance matrix by choosing local neighbourhoods (arguments maxdist and/or nmax)
A much faster alternative to kriging for large datasets is griddify in the marmap package. It took me a while to find this, but it works well. It uses bilinear interpolation and although it is designed for bathymetric maps, it works with any xyz data.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

What can I do about svmpath needing so much memory in one gulp?

I am trying out the svmpath package, which is supposed to find optimal hyperparameters for a trained SVM without requiring multiple runs over different subsets of the data. More importantly, it's supposed to be less computationally complex (according to its docs).
However, it seems to ask for a lot of memory all at once.
Minimal working example:
library(data.table)
library(svmpath)
# Loaded svmpath 0.953
features <- data.table(matrix(runif(100000*16),ncol=16))
labels <- (runif(100000) > 0.7)
svmpath(x=features,y=labels)
# Error in x %*% t(y) : requires numeric/complex matrix/vector arguments
svmpath(x=as.matrix(features),y=labels)
# Error: cannot allocate vector of size 74.5 Gb
library(kernlab)
ksvm(as.matrix(features),y=labels,kernel=vanilla)
# runs
Inspecting the training function only shows one line that pops out as possibly big, Kscript <- K * outer(y, y). Indeed this seems to be the culprit: runif(100000) %o% runif(100000) produces the same error.
Are there any quick fixes that are easy to implement in R?
Apparently, it doesn't find the optimal C (cost) value.
But, it lists all the C values that you should try in order
to find the best one using N folds cross validation or a test dataset.

Interpreting the phom R package - persistent homology - topological analysis of data - Cluster analysis

I am learning to analyze the topology of data with the pHom package of R.
I would like to understand (characterize) a set of data (A Matrix(3500 rows,10 colums). In order to achieve such aim the R-package phom runs a persistent homology test that describes the data.
(Reference: The following video describes what we are seeking to do with homology in topology - reference video 4 min: http://www.youtube.com/embed/XfWibrh6stw?rel=0&autoplay=1).
Using the R-package "phom" (link: http://cran.r-project.org/web/packages/phom/phom.pdf) the following example can be run.
I need help in order to properly understand how the phom function works and how to interpret the data (plot).
Using the Example # 1 of the reference manual of the phom package in r, running it on R
Load Packages
library(phom)
library(Rccp)
Example 1
x <- runif(100)
y <- runif(100)
points <- t(as.matrix(rbind(x, y)))
max_dim <- 2
max_f <- 0.2
intervals <- pHom(points, max_dim, max_f, metric="manhattan")
plotPersistenceDiagram(intervals, max_dim, max_f,
title="Random Points in Cube with l_1 Norm")
I would kindly appreciate if someone would be able to help me with:
Question:
a.) what does the value max_f means and where does it come from? from my data? I set them?
b.) the plot : plotPersistenceDiagram (if you run the example in R you will see the plot), how do I interpret it?
Thank you.
Note: in order to run the "phom" package you need the "Rccp" package and you need the latest version of R 3.03.
The previous example was done in R after loading the "phom" and the "Rccp" packages respectively.
This is totally the wrong venue for this question, but just in case you're still struggling with it a year later I happen to know the answer.
Computing persistent homology has two steps:
Turn the point cloud into a filtration of simplicial complexes
Compute the homology of the simplicial complex
The "filtration" part of step 1 means you have to compute a simplicial complex for a whole range of parameters. The parameter in this case is epsilon, the distance threshold within which points are connected. The max_f variable caps the range of epsilon sweep from zero to max_f.
plotPersistenceDiagram displays the homological "persistence barcodes" as points instead of lines. The x-coordinate of the point is the birth time of that topological feature (the value of epsilon for which it first appears), and the y-coordinate is the death time (the value of epsilon for which it disappears).

Efficient moving window statistics for matrix and/or spatial data (neighborhood statistics) in R

I'm using the raster and related packages in R to do a bit of remote sensing work. For a number of the functions I'm writing, I'd love to rapidly compute neighborhood / moving window statistics. Unfortunately, any R implementations I or others write are very, very slow.
I know that the caTools package offers this functionality written in C for vectors / time series, which yields a 10X+ time savings. Is anyone familiar with a similar package or function that provides this functionality for matrices and spatial data?
Quick example:
# Generate a raster with random values
r <- raster(nrows=100, ncols=100)
values(r) <- rbinom(dim(r)[1] * dim(r)[2], 1, 0.1)
# Now generate a raster highlighting the original values plus immediate neighbors
# (By default ngb yields a queen-esque weighting system)
r.neighbor <- focal(r, ngb=3, fun=max)
# system.time() of the above function for a 100x100 raster takes 0.8 seconds on my laptop
# and takes over 15 seconds for a 1000x1000 raster
Ideally, I'd like to do this faster and for much larger rasters.
Much thanks,
Nick
Ps. There's some interesting discussion of the massive speed differences across R functions for doing moving window operations on vectors here: http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

Resources