Speed up WMA (Weighted Moving Average) calculation - r

I am trying to calculate exponential moving average on 15 day bars, but want to see "evolution" of the 15 day bar EMA on each (end of) day/bar. So, this means that I have 15 day bars. When new data comes in on a daily basis I would like to recalculate EMA using new information. Actually I have 15 day bars and then, after each day my new 15 day bar starts to grow and each new bar that comes along is supposed to be used for EMA calculation together with previous full 15 day bars.
Lets say we start at 2012-01-01 (we have data for each calender day for this example), at the end of 2012-01-15 we have the first complete 15 day bar. After 4 completed full 15 day bars on 2012-03-01 we can start calculating 4 bar EMA (EMA(x, n=4)). On the end of 2012-03-02 we use information we have until this moment and calculate EMA on 2012-03-02 pretending that OHLC for 2012-03-02 is the 15 day bar in progress. So we take the 4 complete bars and the bar on 2012-03-02 and calculate EMA(x, n=4). We then wait another day, see what happened with the new 15 day bar in progress (see function to.period.cumulative below for details) and calculate new value for EMA... And so for the next 15 days onwards... See function EMA.cumulative below for details...
Below please find what I was able to come up with until now. The performance is not acceptable for me and I can not make it any faster with my limited R knowledge.
library(quantmod)
do.call.rbind <- function(lst) {
while(length(lst) > 1) {
idxlst <- seq(from=1, to=length(lst), by=2)
lst <- lapply(idxlst, function(i) {
if(i==length(lst)) { return(lst[[i]]) }
return(rbind(lst[[i]], lst[[i+1]]))
})
}
lst[[1]]
}
to.period.cumulative <- function(x, name=NULL, period="days", numPeriods=15) {
if(is.null(name))
name <- deparse(substitute(x))
cnames <- c("Open", "High", "Low", "Close")
if (has.Vo(x))
cnames <- c(cnames, "Volume")
cnames <- paste(name, cnames, sep=".")
if (quantmod:::is.OHLCV(x)) {
x <- OHLCV(x)
out <- do.call.rbind(
lapply(split(x, f=period, k=numPeriods),
function(x) cbind(rep(first(x[,1]), NROW(x[,1])),
cummax(x[,2]), cummin(x[,3]), x[,4], cumsum(x[,5]))))
} else if (quantmod:::is.OHLC(x)) {
x <- OHLC(x)
out <- do.call.rbind(
lapply(split(x, f=period, k=numPeriods),
function(x) cbind(rep(first(x[,1]), NROW(x[,1])),
cummax(x[,2]), cummin(x[,3]), x[,4])))
} else {
stop("Object does not have OHLC(V).")
}
colnames(out) <- cnames
return(out)
}
EMA.cumulative<-function(cumulativeBars, nEMA = 4, period="days", numPeriods=15) {
barsEndptCl <- Cl(cumulativeBars[endpoints(cumulativeBars, on=period, k=numPeriods)])
# TODO: This is sloooooooooooooooooow...
outEMA <- do.call.rbind(
lapply(split(Cl(cumulativeBars), period),
function(x) {
previousFullBars <- barsEndptCl[index(barsEndptCl) < last(index(x)), ]
if (NROW(previousFullBars) >= (nEMA - 1)) {
last(EMA(last(rbind(previousFullBars, x), n=(nEMA + 1)), n=nEMA))
} else {
xts(NA, order.by=index(x))
}
}))
colnames(outEMA) <- paste("EMA", nEMA, sep="")
return(outEMA)
}
getSymbols("SPY", from="2010-01-01")
SPY.cumulative <- to.period.cumulative(SPY, , name="SPY")
system.time(
SPY.EMA <- EMA.cumulative(SPY.cumulative)
)
On my system it takes
user system elapsed
4.708 0.000 4.410
Acceptable execution time would be less than one second... Is it possible to achieve this using pure R?
This post is linked to Optimize moving averages calculation - is it possible? where I received no answers. I was now able to create a reproducible example with more detailed explanation of what I want to speed up. I hope the question makes more sense now.
Any ideas on how to speed this up are highly appreciated.

I have not find a satisfactory solution for my question using R. So I took the old tool, c language, and results are better than I would have ever expected. Thanks for "pushing" me using this great tools of Rcpp, inline etc. Amazing. I guess, whenever I have performance requirements in the future and can not be met using R I will add C to R and performance is there. So, please see below my code and resolution of the performance issues.
# How to speedup cumulative EMA calculation
#
###############################################################################
library(quantmod)
library(Rcpp)
library(inline)
library(rbenchmark)
do.call.rbind <- function(lst) {
while(length(lst) > 1) {
idxlst <- seq(from=1, to=length(lst), by=2)
lst <- lapply(idxlst, function(i) {
if(i==length(lst)) { return(lst[[i]]) }
return(rbind(lst[[i]], lst[[i+1]]))
})
}
lst[[1]]
}
to.period.cumulative <- function(x, name=NULL, period="days", numPeriods=15) {
if(is.null(name))
name <- deparse(substitute(x))
cnames <- c("Open", "High", "Low", "Close")
if (has.Vo(x))
cnames <- c(cnames, "Volume")
cnames <- paste(name, cnames, sep=".")
if (quantmod:::is.OHLCV(x)) {
x <- quantmod:::OHLCV(x)
out <- do.call.rbind(
lapply(split(x, f=period, k=numPeriods),
function(x) cbind(rep(first(x[,1]), NROW(x[,1])),
cummax(x[,2]), cummin(x[,3]), x[,4], cumsum(x[,5]))))
} else if (quantmod:::is.OHLC(x)) {
x <- OHLC(x)
out <- do.call.rbind(
lapply(split(x, f=period, k=numPeriods),
function(x) cbind(rep(first(x[,1]), NROW(x[,1])),
cummax(x[,2]), cummin(x[,3]), x[,4])))
} else {
stop("Object does not have OHLC(V).")
}
colnames(out) <- cnames
return(out)
}
EMA.cumulative<-function(cumulativeBars, nEMA = 4, period="days", numPeriods=15) {
barsEndptCl <- Cl(cumulativeBars[endpoints(cumulativeBars, on=period, k=numPeriods)])
# TODO: This is sloooooooooooooooooow...
outEMA <- do.call.rbind(
lapply(split(Cl(cumulativeBars), period),
function(x) {
previousFullBars <- barsEndptCl[index(barsEndptCl) < last(index(x)), ]
if (NROW(previousFullBars) >= (nEMA - 1)) {
last(EMA(last(rbind(previousFullBars, x), n=(nEMA + 1)), n=nEMA))
} else {
xts(NA, order.by=index(x))
}
}))
colnames(outEMA) <- paste("EMA", nEMA, sep="")
return(outEMA)
}
EMA.c.c.code <- '
/* Initalize loop and PROTECT counters */
int i, P=0;
/* ensure that cumbars and fullbarsrep is double */
if(TYPEOF(cumbars) != REALSXP) {
PROTECT(cumbars = coerceVector(cumbars, REALSXP)); P++;
}
/* Pointers to function arguments */
double *d_cumbars = REAL(cumbars);
int i_nper = asInteger(nperiod);
int i_n = asInteger(n);
double d_ratio = asReal(ratio);
/* Input object length */
int nr = nrows(cumbars);
/* Initalize result R object */
SEXP result;
PROTECT(result = allocVector(REALSXP,nr)); P++;
double *d_result = REAL(result);
/* Find first non-NA input value */
int beg = i_n*i_nper - 1;
d_result[beg] = 0;
for(i = 0; i <= beg; i++) {
/* Account for leading NAs in input */
if(ISNA(d_cumbars[i])) {
d_result[i] = NA_REAL;
beg++;
d_result[beg] = 0;
continue;
}
/* Set leading NAs in output */
if(i < beg) {
d_result[i] = NA_REAL;
}
/* Raw mean to start EMA - but only on full bars*/
if ((i != 0) && (i%i_nper == (i_nper - 1))) {
d_result[beg] += d_cumbars[i] / i_n;
}
}
/* Loop over non-NA input values */
int i_lookback = 0;
for(i = beg+1; i < nr; i++) {
i_lookback = i%i_nper;
if (i_lookback == 0) {
i_lookback = 1;
}
/*Previous result should be based only on full bars*/
d_result[i] = d_cumbars[i] * d_ratio + d_result[i-i_lookback] * (1-d_ratio);
}
/* UNPROTECT R objects and return result */
UNPROTECT(P);
return(result);
'
EMA.c.c <- cfunction(signature(cumbars="numeric", nperiod="numeric", n="numeric", ratio="numeric"), EMA.c.c.code)
EMA.cumulative.c<-function(cumulativeBars, nEMA = 4, period="days", numPeriods=15) {
ratio <- 2/(nEMA+1)
outEMA <- EMA.c.c(cumbars=Cl(cumulativeBars), nperiod=numPeriods, n=nEMA, ratio=ratio)
outEMA <- reclass(outEMA, Cl(cumulativeBars))
colnames(outEMA) <- paste("EMA", nEMA, sep="")
return(outEMA)
}
getSymbols("SPY", from="2010-01-01")
SPY.cumulative <- to.period.cumulative(SPY, name="SPY")
system.time(
SPY.EMA <- EMA.cumulative(SPY.cumulative)
)
system.time(
SPY.EMA.c <- EMA.cumulative.c(SPY.cumulative)
)
res <- benchmark(EMA.cumulative(SPY.cumulative), EMA.cumulative.c(SPY.cumulative),
columns=c("test", "replications", "elapsed", "relative", "user.self", "sys.self"),
order="relative",
replications=10)
print(res)
EDIT: To give an indication of performance improvement over my cumbersome (I am sure it can be made better, since in effect I have created double for loop) R here is a print out:
> print(res)
test replications elapsed relative user.self
2 EMA.cumulative.c(SPY.cumulative) 10 0.026 1.000 0.024
1 EMA.cumulative(SPY.cumulative) 10 57.732 2220.462 56.755
So, by my standards, a SF type of improvement...

Related

How to find the smallest circumcircle of an irregular polygon on R project?

I was wondering about how to find the smallest circumcircle of an irregular polygon. I've worked with spatial polygons in R.
I want to reproduce some of the fragstats metrics in a vector mode because I had hard times with the package 'landscapemetrics' for a huge amount of data. In specific I would like to implement the circle (http://www.umass.edu/landeco/research/fragstats/documents/Metrics/Shape%20Metrics/Metrics/P11%20-%20CIRCLE.htm). So far, I could not find the formula or script for the smallest circumcircle.
All your comments are more than welcome.
Than you
As I mentioned in a comment, I don't know of existing R code for this, but a brute force search should be fast enough if you don't have too many points that need to be in the circle. I just wrote this one. The center() function is based on code from Wikipedia for drawing a circle around a triangle; circumcircle() is the function you want, found by brute force search through all circles that pass through 2 or 3 points in the set. On my laptop it takes about 4 seconds to handle 100 points. If you have somewhat bigger sets, you can probably get tolerable results by translating to C++, but it's an n^4 growth rate, so you'll need a better solution
for a really large set.
center <- function(D) {
if (NROW(D) == 0)
matrix(numeric(), ncol = 2)
else if (NROW(D) == 1)
D
else if (NROW(D) == 2) {
(D[1,] + D[2,])/2
} else if (NROW(D) == 3) {
B <- D[2,] - D[1,]
C <- D[3,] - D[1,]
Dprime <- 2*(B[1]*C[2] - B[2]*C[1])
if (Dprime == 0) {
drop <- which.max(c(sum((B-C)^2), sum(C^2), sum(B^2)))
center(D[-drop,])
} else
c((C[2]*sum(B^2) - B[2]*sum(C^2))/Dprime,
(B[1]*sum(C^2) - C[1]*sum(B^2))/Dprime) + D[1,]
} else
center(circumcircle(D))
}
radius <- function(D, U = center(D))
sqrt(sum((D[1,] - U)^2))
circumcircle <- function(P) {
n <- NROW(P)
if (n < 3)
return(P)
P <- P[sample(n),]
bestset <- NULL
bestrsq <- Inf
# Brute force search
for (i in 1:(n-1)) {
for (j in (i+1):n) {
D <- P[c(i,j),]
U <- center(D)
rsq <- sum((D[1,] - U)^2)
if (rsq >= bestrsq)
next
failed <- FALSE
for (k in (1:n)[-j][-i]) {
Pk <- P[k,,drop = FALSE]
if (sum((Pk - U)^2) > rsq) {
failed <- TRUE
break
}
}
if (!failed) {
bestset <- c(i,j)
bestrsq <- rsq
}
}
}
# Look for the best 3 point set
for (i in 1:(n-2)) {
for (j in (i+1):(n-1)) {
for (l in (j+1):n) {
D <- P[c(i,j,l),]
U <- center(D)
rsq <- sum((D[1,] - U)^2)
if (rsq >= bestrsq)
next
failed <- FALSE
for (k in (1:n)[-l][-j][-i]) {
Pk <- P[k,,drop = FALSE]
if (sum((Pk - U)^2) > rsq) {
failed <- TRUE
break
}
}
if (!failed) {
bestset <- c(i,j,l)
bestrsq <- rsq
}
}
}
}
P[bestset,]
}
showP <- function(P, ...) {
plot(P, asp = 1, type = "n", ...)
text(P, labels = seq_len(nrow(P)))
}
showD <- function(D) {
U <- center(D)
r <- radius(D, U)
theta <- seq(0, 2*pi, len = 100)
lines(U[1] + r*cos(theta), U[2] + r*sin(theta))
}
n <- 100
P <- cbind(rnorm(n), rnorm(n))
D <- circumcircle(P)
showP(P)
showD(D)
This shows the output

Small bug in backpropagation algorithm in r

I've been trying to implement backpropagation in R, but I've been getting some strange results. It appears that after 1000 iterations of backprop, the program predicts 1 for all values. I was hoping it was a problem in the test function, but testing on smaller numbers of iterations shows that 0 is predicted as an output value in some instances. It seems that somewhere in iterating through the dataset, the weight updates tend toward increasing, when they should tend toward reducing error.
I apologize that the code is difficult to read in spots. I'm working on this with a partner and I dislike the way that he names variables. It's also not as fully commented as I'd like. Any help is appreciated
# initialize a global output vector and a global vector of data frames
createNeuralNet <- function(numberOfInputNodes,hiddenLayers,nodesInHiddenLayer){
L <<- initializeWeightDataFrames(numberOfInputNodes,nodesInHiddenLayer,hiddenLayers)
# print(L)
OutputList <<- initializeOutputVectors(hiddenLayers)
}
# creates a list of weight data frames
# each weight data frame uses the row as an index of the "tail" for a connection
# the "head" of the connection (where the arrow points) is in the column index
# the value in the weight data frame is the weight of that connection
# the last row is the weight between the bias and a particular node
initializeWeightDataFrames <- function(numberOfInputNodes, nodesPerHiddenLayer, numberOfHiddenLayers) {
weights <- vector("list", numberOfHiddenLayers + 1)
# this code simply creates empty data frames of the proper size so that they may
first <- read.csv(text=generateColumnNamesCSV(nodesPerHiddenLayer))
middle <- read.csv(text=generateColumnNamesCSV(nodesPerHiddenLayer))
# assume binary classifier, so output layer has 1 node
last <- read.csv(text=generateColumnNamesCSV(1))
first <- assignWeights(first, numberOfInputNodes + 1)
weights[[1]] <- first
# assign random weights to each row
if (numberOfHiddenLayers != 1) {
for (i in 1:numberOfHiddenLayers - 1) {
middle <- assignWeights(middle, nodesPerHiddenLayer + 1)
weights[[i+1]] <- middle
}
}
last <- assignWeights(last, nodesPerHiddenLayer + 1)
weights[[length(weights)]] <- last
return(weights)
}
# generate a comma-separated string of column names c1 thru cn for creating arbitrary size data frame
generateColumnNamesCSV <- function(n) {
namesCSV <- ""
if (n==1) {
return("c1")
}
for (i in 1:(n-1)) {
namesCSV <- paste0(namesCSV, "c", i, ",")
}
namesCSV <- paste0(namesCSV, "c", n)
return(namesCSV)
}
assignWeights <- function(weightDF, numRows) {
modifiedweightDF <- weightDF
for (rowNum in 1:numRows) {
# creates a bunch of random numbers from -1 to 1, used to populate a row
rowVector <- runif(length(weightDF))
for (i in 1:length(rowVector)) {
sign <- (-1)^round(runif(1))
rowVector[i] <- sign * rowVector[i]
}
modifiedweightDF[rowNum,] <- rowVector
}
return(modifiedweightDF)
}
# create an empty list of the right size, will hold vectors of node outputs in the future
initializeOutputVectors <- function(numberOfHiddenLayers) {
numberOfLayers <- numberOfHiddenLayers + 1
outputVectors <- vector("list", numberOfLayers)
return(outputVectors)
}
# this is the main loop that does feed-forward and back prop
trainNeuralNet <- function(trainingData,target,iterations){
count <- 0
# iterations is a constant for how many times the dataset should be iterated through
while(count<iterations){
print(count)
for(row in 1:nrow(trainingData)) { # for each row in the data set
#Feed Forward
# instance is the current row that's being looked at
instance <- trainingData[row,]
# print(instance)
for (l in 1:length(L)) { # for each weight data frame
# w is the current weights
w <- L[[l]]
#print(w)
Output <- rep(NA, length(w))
if (l!=1) {
# x is the values in the previous layer
# can't access the previous layer if you're on the first layer
x <- OutputList[[l-1]]
#print(x)
}
for (j in 1:ncol(w)) { # for each node j in the "head" layer
s <- 0
for (i in 1:(nrow(w)-1)) {
# calculate the weighted sum s of connection weights and node values
# this is used to calculate a node in the next layer
# check the instance if on the first layer
if (l==1) {
# print(i)
# print(instance[1,i])
# i+1 skips over the target column
s <- s + instance[1,i+1]*w[i,j]
# print(s)
# if the layer is 2 or more
}else{
# print(i)
#print(j)
# print(w)
# print(w[i,j])
s <- s + x[i]*w[i,j] # weighted sum
# sigmoid activation function value for node j
}
}
#print(s)
s <- s + w[nrow(w),j] # add weighted bias
# print("s")
# print(s)
# print("sigmoid s")
# print(sigmoid(s))
Output[j] <- sigmoid(s)
}
OutputList[[l]] <- Output
}
# print(OutputList)
# print("w")
# print(L)
# print("BAck prop Time")
#Back Propagation
out <- OutputList[length(OutputList)]
#print(OutputList)
outputError <- rep(NA, length(w))
outputErrorPresent <- rep(NA, length(w))
outputError[1] <- out[[1]]*(1-out[[1]])*(out[[1]]-target[row])
for (h in (length(L)):1) { # for each weight matrix in hidden area h (going backwards)
hiddenOutput <- OutputList[h]
#print("hiddenOutput")
#print(h)
if (row==1||row==2) {
# print("h")
# print(h)
# print("output error Present")
# print(outputErrorPresent)
}
if (h!=(length(L))) {
outputError <- outputErrorPresent
}
w <- L[[h]]
for (j in 1:(nrow(w))) { # for each node j in hidden layer h
#print("length w")
#print(length(w))
if (row==1||row==2) {
# print("j")
# print(j)
}
errSum <- 0
nextLayerNodes <- L[[h]]
# print(nextLayerNodes)
#print(class(nextLayerNodes))
for (k in 1:ncol(nextLayerNodes)) {
errSum <- errSum + outputError[k]*nextLayerNodes[j,k]
}
m <- 0
if (h == 1) {
m <- as.numeric(instance)
m <- m[-1]
} else {
m <- OutputList[h-1][[1]]
}
deltaWeight <- 0
for (k in 1:ncol(nextLayerNodes)) {
hiddenNodeError <- hiddenOutput[[1]][k]*(1- hiddenOutput[[1]][k])*errSum
if (j == nrow(w)) {
deltaWeight <- learningRate*hiddenNodeError
} else {
deltaWeight <- learningRate*hiddenNodeError*m[j]
}
# print(deltaWeight)
w[j,k] <- w[j,k] + deltaWeight
}
if (j != nrow(w)) {
outputErrorPresent[j] <- hiddenNodeError
}
}
L[[h]] <<- w
}
# print(OutputList)
}
count <- count +1
# print(L)
#calculate global error
}
########################repeat
# print("w")
}
sigmoid <- function(s){
sig <- 1/(1+exp(-s))
return(sig)
}
testNeuralNetwork <- function(testingData,testTarget){
correctCount <- 0
# run the same code as feed forward
# this time run it on testing examples and compare the outputs
for(row in 1:nrow(testingData)) { # for each test instance
#Feed Forward
instance <- testingData[row,]
#print(instance)
for (l in 1:length(L)) { # for each layer l
w <- L[[l]]
#print(w)
Output <- rep(NA, length(w))
if (l!=1) {
x <- OutputList[[l-1]]
#print(x)
}
for (j in 1:ncol(w)) { # for each node j in layer l
s <- 0
for (i in 1:(nrow(w)-1)) {
if (l==1) {
# i+1 skips over the target column
s <- s + instance[1,i+1]*w[i,j]
# print(s)
}else{
# print(i)
#print(j)
# print(w)
# print(w[i,j])
s <- s + x[i]*w[i,j] # weighted sum
# sigmoid activation function value for node j
}
}
#print(s)
s <- s + w[nrow(w),j] # add weighted bias
Output[j] <- sigmoid(s)
#print(sigmoid(s))
}
OutputList[[l]] <- Output
}
# print(OutputList)
outputVal <- threshold(OutputList[[length(OutputList)]])
if (outputVal==testTarget[row]) {
print(paste0(" ", outputVal, " Correct!"))
correctCount <- correctCount + 1
}else{
print(paste0(" ", outputVal, " Wrong."))
}
#print()
#print(paste0("s2 ",str))
}
}
# convert real-valued output to a binary classification
threshold <- function(value){
if (value>=0.5) {
return(1)
}else{
return(0)
}
}
# this modifies df by removing 30 random rows
# this means that the same df will be changed permanently, so be careful of that
# it also returns the 30 random rows as a test set
makeTestSet <- function(df, size) {
len <- 1:length(df[,1])
randRows <- sample(len, size, replace=F)
return(randRows)
}
Data <- read.csv(file = "Downloads/numericHouse-votes-84.csv", head = TRUE, sep = ",")
learningRate <<- 0.1
# assume that the first column of the data is the column that is to be predicted
# thus the number of inputs is 1 less than the number of columnns
numberOfInputNodes <- ncol(Data) - 1
randRows <- makeTestSet(Data,30) #change this to 30
testData <- Data[randRows,]
trainingData <- Data[-randRows,]
testTarget <- testData[,1]
#trainingData <- Data[,1:numberOfInputNodes]
trainingTarget <- trainingData[,1]
createNeuralNet(numberOfInputNodes,1,numberOfInputNodes)
iterations <- 100
trainNeuralNet(trainingData,trainingTarget,iterations)
testNeuralNetwork(testData,testTarget)
L

A faster function to lower the resolution of a raster R

I am using the raster package to lower the resolution of big rasters, using the function aggregate like this
require(raster)
x <- matrix(rpois(1000000, 2),1000)
a <-raster(x)
plot(a)
agg.fun <- function(x,...)
if(sum(x)==0){
return(NA)
} else {
which.max(table(x))
}
a1<-aggregate(a,fact=10,fun=agg.fun)
plot(a1)
the raster images I have to aggregate are much bigger 34000x34000 so I would like to know if there is a faster way to implement the agg.fun function.
You can use gdalUtils::gdalwarp for this. For me, it's less efficient than #JosephWood's fasterAgg.Fun for rasters with 1,000,000 cells, but for Joseph's larger example it's much faster. It requires that the raster exists on disk, so factor writing time into the below if your raster is in memory.
Below, I've used the modification of fasterAgg.Fun that returns the most frequent value, rather than its index in the block.
library(raster)
x <- matrix(rpois(10^8, 2), 10000)
a <- raster(x)
fasterAgg.Fun <- function(x,...) {
myRle.Alt <- function (x1) {
n1 <- length(x1)
y1 <- x1[-1L] != x1[-n1]
i <- c(which(y1), n1)
x1[i][which.max(diff(c(0L, i)))]
}
if (sum(x)==0) {
return(NA)
} else {
myRle.Alt(sort(x, method="quick"))
}
}
system.time(a2 <- aggregate(a, fact=10, fun=fasterAgg.Fun))
## user system elapsed
## 67.42 8.82 76.38
library(gdalUtils)
writeRaster(a, f <- tempfile(fileext='.tif'), datatype='INT1U')
system.time(a3 <- gdalwarp(f, f2 <- tempfile(fileext='.tif'), r='mode',
multi=TRUE, tr=res(a)*10, output_Raster=TRUE))
## user system elapsed
## 0.00 0.00 2.93
Note that there is a slight difference in the definition of the mode when there are ties: gdalwarp selects the highest value, while the functions passed to aggregate above (via which.max's behaviour) select the lowest (e.g., see which.max(table(c(1, 1, 2, 2, 3, 4)))).
Also, storing the raster data as integer is important (when applicable). If the data are stored as float (the writeRaster default), for example, the gdalwarp operation above takes ~14 sec on my system. See ?dataType for available types.
Try this:
fasterAgg.Fun <- function(x,...) {
myRle.Alt <- function (x1) {
n1 <- length(x1)
y1 <- x1[-1L] != x1[-n1]
i <- c(which(y1), n1)
which.max(diff(c(0L, i)))
}
if (sum(x)==0) {
return(NA)
} else {
myRle.Alt(sort(x, method="quick"))
}
}
library(rbenchmark)
benchmark(FasterAgg=aggregate(a, fact=10, fun=fasterAgg.Fun),
AggFun=aggregate(a, fact=10, fun=agg.fun),
replications=10,
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
test replications elapsed relative
1 FasterAgg 10 12.896 1.000
2 AggFun 10 30.454 2.362
For a larger test object, we have:
x <- matrix(rpois(10^8,2),10000)
a <- raster(x)
system.time(a2 <- aggregate(a, fact=10, fun=fasterAgg.Fun))
user system elapsed
111.271 22.225 133.943
system.time(a1 <- aggregate(a, fact=10, fun=agg.fun))
user system elapsed
282.170 24.327 308.112
If you want the actual values as #digEmAll says in the comments above, simply change the return value in myRle.Alt from which.max(diff(c(0L, i))) to x1[i][which.max(diff(c(0L, i)))].
Just for fun I created also an Rcpp function (not much faster than #JosephWood) :
########### original function
#(modified to return most frequent value instead of index)
agg.fun <- function(x,...){
if(sum(x)==0){
return(NA)
} else {
as.integer(names(which.max(table(x))))
}
}
########### #JosephWood function
fasterAgg.Fun <- function(x,...) {
myRle.Alt <- function (x1) {
n1 <- length(x1)
y1 <- x1[-1L] != x1[-n1]
i <- c(which(y1), n1)
x1[i][which.max(diff(c(0L, i)))]
}
if (sum(x)==0) {
return(NA)
} else {
myRle.Alt(sort(x, method="quick"))
}
}
########### Rcpp function
library(Rcpp)
library(inline)
aggrRcpp <- cxxfunction(signature(values='integer'), '
Rcpp::IntegerVector v(clone(values));
std::sort(v.begin(),v.end());
int n = v.size();
double sum = 0;
int currentValue = 0, currentCount = 0, maxValue = 0, maxCount = 0;
for(int i=0; i < n; i++) {
int value = v[i];
sum += value;
if(i==0 || currentValue != value){
if(currentCount > maxCount){
maxCount = currentCount;
maxValue = currentValue;
}
currentValue = value;
currentCount = 0;
}else{
currentCount++;
}
}
if(sum == 0){
return Rcpp::IntegerVector::create(NA_INTEGER);
}
if(currentCount > maxCount){
maxCount = currentCount;
maxValue = currentValue;
}
return wrap( maxValue ) ;
', plugin="Rcpp", verbose=FALSE,
includes='')
# wrap it to support "..." argument
aggrRcppW <- function(x,...)aggrRcpp(x);
Benchmark :
require(raster)
set.seed(123)
x <- matrix(rpois(10^8, 2), 10000)
a <- raster(x)
system.time(a1<-aggregate(a,fact=100,fun=agg.fun))
# user system elapsed
# 35.13 0.44 35.87
system.time(a2<-aggregate(a,fact=100,fun=fasterAgg.Fun))
# user system elapsed
# 8.20 0.34 8.59
system.time(a3<-aggregate(a,fact=100,fun=aggrRcppW))
# user system elapsed
# 5.77 0.39 6.22
########### all equal ?
all(TRUE,all.equal(a1,a2),all.equal(a2,a3))
# > [1] TRUE
If your goal is aggregation, wouldn't you want the max function?
library(raster)
x <- matrix(rpois(1000000, 2),1000)
a <- aggregate(a,fact=10,fun=max)

Performance suggestions in R

I have this piece of code:
library("GO.db")
lookParents <- function(x) {
parents <- subset(yy[x][[1]], labels(yy[x][[1]])=="is_a")
for (parent in parents) {
m[index,1] <<- Term(x)
m[index,2] <<- Term(parent)
m[index,3] <<- -log2(go_freq[x,1]/go_freq_all)
m[index,4] <<- log2(go1_freq2[x])
m[index,5] <<- x
m[index,6] <<- parent
index <<- index + 1
}
if (is.null(parents)) {
return(c())
} else {
return(parents)
}
}
getTreeMap <- function(GOlist, xx, m) {
print(paste("Input list has",length(GOlist), "terms", sep=" "))
count <- 1
for (go in GOlist) {
parents <- lookParents(go)
if (count %% 100 == 0) {
print(count)
}
while (length(parents) != 0) {
x <- parents[1]
parents <- parents[-1]
parents <- c(lookParents(x), parents)
}
count <- count + 1
}
}
xx <- c(as.list(GOBPANCESTOR), as.list(GOCCANCESTOR), as.list(GOMFANCESTOR))
go1_freq2 <- table(as.character(unlist(xx[go1])))
xx <- c(as.list(GOBPPARENTS), as.list(GOCCPARENTS), as.list(GOMFPARENTS))
m <- as.data.frame(matrix(nrow=1,ncol=6))
m[1,] <- c("all", "null", 0, 0, "null","null")
##biological processes
index <- 2
getTreeMap(BP, xx, m)
but it is really slow. BP is simply a vector. Do you have performance suggestions to apply? I would like to make it run faster, but that's all I can do at the moment.
I suggest following improvements:
add your functions into RProfile.site and compile them using cmpfun
use foreach and dopar instead of normal for
always delete the variables you don't need anymore and then call the garbage collector

looping through a matrix with a function

I'd like to perform this function on a matrix 100 times. How can I do this?
v = 1
m <- matrix(0,10,10)
rad <- function(x) {
idx <- sample(length(x), size=1)
flip = sample(0:1,1,rep=T)
if(flip == 1) {
x[idx] <- x[idx] + v
} else if(flip == 0) {
x[idx] <- x[idx] - v
return(x)
}
}
This is what I have so far but doesn't work.
for (i in 1:100) {
rad(m)
}
I also tried this, which seemed to work, but gave me an output of like 5226 rows for some reason. The output should just be a 10X10 matrix with changed values depending on the conditions of the function.
reps <- unlist(lapply(seq_len(100), function(x) rad(m)))
Ok I think I got it.
The return statement in your function is only inside a branch of an if statement, so it returns a matrix with a probability of ~50% while in the other cases it does not return anything; you should change the code function into this:
rad <- function(x) {
idx <- sample(length(x), size=1)
flip = sample(0:1,1,rep=T)
if(flip == 1) {
x[idx] <- x[idx] + v
} else if(flip == 0) {
x[idx] <- x[idx] - v
}
return(x)
}
Then you can do:
for (i in 1:n) {
m <- rad(m)
}
Note that this is semantically equal to:
for (i in 1:n) {
tmp <- rad(m) # return a modified verion of m (m is not changed yet)
# and put it into tmp
m <- tmp # set m equal to tmp, then in the next iteration we will
# start from a modified m
}
When you run rad(m) is not do changes on m.
Why?
It do a local copy of m matrix and work on it in the function. When function end it disappear.
Then you need to save what function return.
As #digEmAll write the right code is:
for (i in 1:100) {
m <- rad(m)
}
You don't need a loop here. The whole operation can be vectorized.
v <- 1
m <- matrix(0,10,10)
n <- 100 # number of random replacements
idx <- sample(length(m), n, replace = TRUE) # indices
flip <- sample(c(-1, 1), n, replace = TRUE) # subtract or add
newVal <- aggregate(v * flip ~ idx, FUN = sum) # calculate new values for indices
m[newVal[[1]]] <- m[newVal[[1]]] + newVal[[2]] # add new values

Resources