matrix index matching for large raster data - r

I have a large raster data (X) with a dimension of 32251*51333. The values of X are repetitions of another array (Y), which has a size of 3*10^6.
Now I want to change the values of X by matching it against each value of Y, for example I can program like this,
for (i in 1:length(Y)){
X[X==Y[i]] = Z[i] #Z is just another array with the same size as Y
}
The problem is that, first the index matching X[X==Y[i]] = Z[i] does not work because X is too large. After a few minutes the program just stops by giving an error "Error: cannot allocate vector of size 6.2 Gb".
Second, going over the loops from 1 to length(Y), even though Y is of size 10^6, may take "forever" to complete.
One approach came to my mind is to split X into small chunks and then do the index match for each chunk. But I feel this would still take a lot of time.
Is there a better way to achieve the above goal?
1st Update:
Thanks to the example provided by #Lyngbakr, I will elaborate this question further. Because the raster I'm working with is very large (32251*51333), it seems not possible to upload it. The example given by #Lyngbakr is very similar to what I want, except that the raster created is too small. Now following the example, I ran two tests by generating a much larger raster with dimension of 3000*2700. See code below.
#Method 1: Use subs
start_time <- Sys.time()
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow=3000,ncol = 2700))
df <- data.frame(Y, Z)
X <- subs(X, df)
end_time <- Sys.time()
end_time - start_time
#Time difference of 2.248908 mins
#Method 2: Use for loop
start_time <- Sys.time()
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow=3000,ncol = 2700))
for (i in 1:length(Y)){
X[X==Y[i]]=Z[i] #this indexing of R seems not efficient if X becomes large
}
end_time <- Sys.time()
end_time - start_time
#Time difference of 10.22717 secs
As you can see, a simple for loop is even more efficient than the subs function. Remember, the raster shown in the example is still smaller than what I work with (about an order of 100 smaller). Also, the array Y in the example is very small. Now the question could be, how to speed up the Method 2, which is just a simple for loop?

You're looking for the subs function. I don't know if it works with large rasters, but here's how you'd try.
I load the raster package and create some dummy data. (It would be really helpful if you provide data in your question.) Then, I plot the results.
# Load library
library(raster)
#> Loading required package: sp
# Z holds values that will replace Y
Y <- 1:9
Z <- 91:99
# Create dummy raster
X <- raster(matrix(rep(Y, 3), ncol = 9))
# Examine raster
plot(X)
As you can see, X is just a bunch of Y vectors patched together. Next, I bind Y and Z together into a data frame df.
# Combine y & z into a data frame
df <- data.frame(Y, Z)
Finally, I use subs to replace Y values with Z values.
# Substitute Z for Y in X
X <- subs(X, df)
A quick look at the raster shows that the values have been replaced correctly.
# Examine raster
plot(X)
Created on 2019-06-25 by the reprex package (v0.2.1.9000)
Update
Rcppis really helpful when performance is an issue. Below, I compare three methods:
Looping in R (from the question)
Using subs from the raster package
Looping in C++ using Rcpp
By the way, Sys.time() isn't a great way to examine performance, so I'd recommend microbenchmark instead.
# Load library
library(raster)
# Define vectors and raster
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow = 3000, ncol = 2700))
method_1 is the subs function.
# Using subs function
method_1 <- function(){
df <- data.frame(Y, Z)
X <- subs(X, df)
}
method_2 is your original looping approach.
# Using R loop
method_2 <- function(){
for (i in 1:length(Y)){
X[X==Y[i]]=Z[i]
}
X
}
method_3 is the looping approach implemented in C++.
# Using Rcpp loops
src <-
"Rcpp::NumericMatrix subs_cpp(Rcpp::NumericMatrix X, Rcpp::NumericVector Y, Rcpp::NumericVector Z){
for(int i = 0; i < Y.length(); ++i){
for(int j = 0; j < X.ncol(); ++j){
for(int k = 0; k < X.nrow(); ++k){
if(X(k, j) == Y(i)){
X(k, j) = Z(i);
}
}
}
}
return X;
}"
Rcpp::cppFunction(src)
method_3 <- function(){
subs_cpp(as.matrix(X), Y, Z)
}
And here I benchmark the approaches.
# Run benchmarking
microbenchmark::microbenchmark(method_1(), method_2(), method_3(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# method_1() 16861.5447 17737.2124 19321.5674 18628.8573 20117.0159 25506.208 10
# method_2() 671.2223 677.6029 1111.3935 738.6216 1657.0542 2163.137 10
# method_3() 316.9810 319.1484 481.3548 320.2337 326.7133 1477.454 10
As you can see, the Rcpp approach is by far the fastest.
You can also compare the output to ensure they produce the same result using a smaller raster.
# Examine all three outputs with smaller raster
X <- raster(matrix(rep(Y, 3), ncol = 9))
plot(method_1(), main = "Method 1")
plot(method_2(), main = "Method 2")
plot(raster(method_3()), main = "Method 3") # Needs to converted into a raster
And they all look alike. Note that for the third method, the result needs to be converted back to a raster from a matrix.

Related

Assigning value to dataframe in R - for loop speed

I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]

"Sapply" function in R counterpart in MATLAB to convert a code from R to MATLAB

I want to convert the code in R to MATLAB (not to executing the R code in MATLAB).
The code in R is as follows:
data_set <- read.csv("lab01_data_set.csv")
# get x and y values
x <- data_set$x
y <- data_set$y
# get number of classes and number of samples
K <- max(y)
N <- length(y)
# calculate sample means
sample_means <- sapply(X = 1:K, FUN = function(c) {mean(x[y == c])})
# calculate sample deviations
sample_deviations <- sapply(X = 1:K, FUN = function(c) {sqrt(mean((x[y == c] - sample_means[c])^2))})
To implement it in MATLAB I write the following:
%% Reading Data
% read data into memory
X=readmatrix("lab01_data_set(ViaMatlab).csv");
% get x and y values
x_read=X(1,:);
y_read=X(2,:);
% get number of classes and number of samples
K = max(y_read);
N = length(y_read);
% Calculate sample mean - 1st method
% funct1 = #(c) mean(c);
% G1=findgroups(y_read);
% sample_mean=splitapply(funct1,x_read,G1)
% Calculate sample mean - 2nd method
for m=1:3
sample_mean(1,m)=mean(x(y_read == m));
end
sample_mean;
% Calculate sample deviation - 2nd method
for m=1:3
sample_mean=mean(x(y_read == m));
sample_deviation(1,m)=sqrt(mean((x(y_read == m)-sample_mean).^2));
sample_mean1(1,m)=sample_mean;
end
sample_deviation;
sample_mean1;
As you see I get how to use a for loop in MATLAB instead of sapply in R (as 2nd method in code), but do not know how to use a function (Possibly splitaplly or any other).
PS: Do not know how to upload the data, so sorry for that part.
The MATLAB equivalent to R sapply is arrayfun - and its relatives cellfun, structfun and varfun depending on what data type your input is.
For example, in R:
> sapply(1:3, function(x) x^2)
[1] 1 4 9
is equivalent to MATLAB:
>>> arrayfun(#(x) x^2, 1:3)
ans =
1 4 9
Note that if the result of the function you pass to arrayfun, cellfun etc. doesn't have identical type or size for every input, you'll need to specify 'UniformOutput', 'false' .

Avoiding a loop when populating data frames in R

I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))

Efficiently building a large (200 MM line) dataframe

I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.

R: fast sliding window with given coordinates

I have a data table with nrow being around a million or two and ncol of about 200.
Each entry in a row has a coordinate associated with it.
Tiny portion of the data:
[1,] -2.80331471 -0.8874522 -2.34401863 -3.811584 -2.1292443
[2,] 0.03177716 0.2588624 0.82877467 1.955099 0.6321881
[3,] -1.32954665 -0.5433407 -2.19211837 -2.342554 -2.2142461
[4,] -0.60771429 -0.9758734 0.01558774 1.651459 -0.8137684
Coordinates for the first 4 rows:
9928202 9928251 9928288 9928319
What I would like is a function that given the data and window-size would return a data table of the same size with a mean sliding window applied on each column. Or in other words - for each row entry i it would find entries with coordinates between coords[i]-windsize and coords[i]+windsize and replace the initial value with the mean of the values inside that interval (separately for each column).
Speed is the main issue here.
Here is my first take of such function.
doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])
### get indeces of rows falling in each window
COORDS <- as.list(coords)
WINDOWINDS <- sapply(COORDS, function(crds){ unique(RANGE[(crds-windHalfSize):
(crds+windHalfSize)]) })
### do windowing
wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
wind_ints[i,] <- apply(as.matrix(intensities[WINDOWINDS[[i]],]), 2, mean)
}
return(wind_ints)
}
The code before the last for loop is quite fast and it gets me a list of the indexes I need to use for each entry. However then everything falls apart since I need to grind the for loop a million times, take subsets of my data table and also make sure that I have more than one row to be able to work with all the columns at once inside apply.
My second approach is to just stick the actual values in the RANGE list, fill the gaps with zeroes and do rollmean from zoo package, repeated for each column. But this is redundant since rollmean will go through all the gaps and I will only be using the values for original coordinates in the end.
Any help to make it faster without going to C would be very appreciated.
Data generation:
N <- 1e5 # rows
M <- 200 # columns
W <- 10 # window size
set.seed(1)
intensities <- matrix(rnorm(N*M), nrow=N, ncol=M)
coords <- 8000000 + sort(sample(1:(5*N), N))
Original function with minor modifications I used for benchmarks:
doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])
### get indices of rows falling in each window
### NOTE: Each elements of WINDOWINDS holds zero. Not a big problem though.
WINDOWINDS <- sapply(coords, function(crds) ret <- unique(RANGE[(crds-windHalfSize):(crds+windHalfSize)]))
### do windowing
wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
# CORRECTION: When it's only one row in window there was a trouble
wind_ints[i,] <- apply(matrix(intensities[WINDOWINDS[[i]],], ncol=ncol(intensities)), 2, mean)
}
return(wind_ints)
}
POSSIBLE SOLUTIONS:
1) data.table
data.table is known to be fast with subsetting, but this page (and other related to sliding window) suggests, that this is not the case. Indeed, data.table code is elegant, but unfortunately very slow:
require(data.table)
require(plyr)
dt <- data.table(coords, intensities)
setkey(dt, coords)
aaply(1:N, 1, function(i) dt[WINDOWINDS[[i]], sapply(.SD,mean), .SDcols=2:(M+1)])
2) foreach+doSNOW
Basic routine is easy to run in parallel, so, we can benefit from it:
require(doSNOW)
doSlidingWindow2 <- function(intensities, coords, windsize) {
NC <- 2 # number of nodes in cluster
cl <- makeCluster(rep("localhost", NC), type="SOCK")
registerDoSNOW(cl)
N <- ncol(intensities) # total number of columns
chunk <- ceiling(N/NC) # number of columns send to the single node
result <- foreach(i=1:NC, .combine=cbind, .export=c("doSlidingWindow")) %dopar% {
start <- (i-1)*chunk+1
end <- ifelse(i!=NC, i*chunk, N)
doSlidingWindow(intensities[,start:end], coords, windsize)
}
stopCluster(cl)
return (result)
}
Benchmark shows notable speed-up on my Dual-Core processor:
system.time(res <- doSlidingWindow(intensities, coords, W))
# user system elapsed
# 306.259 0.204 307.770
system.time(res2 <- doSlidingWindow2(intensities, coords, W))
# user system elapsed
# 1.377 1.364 177.223
all.equal(res, res2, check.attributes=FALSE)
# [1] TRUE
3) Rcpp
Yes, I know you asked "without going to C". But, please, take a look. This code is inline and rather straightforward:
require(Rcpp)
require(inline)
doSlidingWindow3 <- cxxfunction(signature(intens="matrix", crds="numeric", wsize="numeric"), plugin="Rcpp", body='
#include <vector>
Rcpp::NumericMatrix intensities(intens);
const int N = intensities.nrow();
const int M = intensities.ncol();
Rcpp::NumericMatrix wind_ints(N, M);
std::vector<int> coords = as< std::vector<int> >(crds);
int windsize = ceil(as<double>(wsize)/2);
for(int i=0; i<N; i++){
// Simple search for window range (begin:end in coords)
// Assumed that coords are non-decreasing
int begin = (i-windsize)<0?0:(i-windsize);
while(coords[begin]<(coords[i]-windsize)) ++begin;
int end = (i+windsize)>(N-1)?(N-1):(i+windsize);
while(coords[end]>(coords[i]+windsize)) --end;
for(int j=0; j<M; j++){
double result = 0.0;
for(int k=begin; k<=end; k++){
result += intensities(k,j);
}
wind_ints(i,j) = result/(end-begin+1);
}
}
return wind_ints;
')
Benchmark:
system.time(res <- doSlidingWindow(intensities, coords, W))
# user system elapsed
# 306.259 0.204 307.770
system.time(res3 <- doSlidingWindow3(intensities, coords, W))
# user system elapsed
# 0.328 0.020 0.351
all.equal(res, res3, check.attributes=FALSE)
# [1] TRUE
I hope results are quite motivating. While data fits in memory Rcpp version is pretty fast. Say, with N <- 1e6 and M <-100 I got:
user system elapsed
2.873 0.076 2.951
Naturally, after R starts using swap everything slows down. With really large data that doesn't fit in memory you should consider sqldf, ff or bigmemory.
Rollapply works great with a small dataset. However, if you are working with several million rows (genomics) it is quite slow.
The following function is super fast:
data <- c(runif(100000, min=0, max=.1),runif(100000, min=.05, max=.1),runif(10000, min=.05, max=1), runif(100000, min=0, max=.2))
slideFunct <- function(data, window, step){
total <- length(data)
spots <- seq(from=1, to=(total-window), by=step)
result <- vector(length = length(spots))
for(i in 1:length(spots)){
result[i] <- mean(data[spots[i]:(spots[i]+window)])
}
return(result)
}
Details here.

Resources