How to optimize this function in R? - r

I have about 977 obs in top500Stocks which contains name of 977 stocks.
head(top500Stocks,10)
ï..Symbol
1 RELIANCE
2 TCS
3 HDFCBANK
4 INFY
5 HINDUNILVR
6 HDFC
7 ICICIBANK
8 KOTAKBANK
9 SBIN
10 BAJFINANCE
and I have Date, OHLC and Adj.Close, Vol and Ret of each stocks from the top500Stocks in stocksRetData
head(stocksRetData[[1]],3)
Date Open High Low Close Adj.Close Volume Ret
1 20000103 28.18423 29.86935 28.18423 38.94457 29.86935 28802010 0.000
2 20000104 30.66445 32.26056 29.82188 42.06230 32.26056 61320457 0.080
3 20000105 30.45677 34.16522 30.45677 43.71014 33.52440 173426953 0.039
Now for a given lookbackPeriod and holdPeriod I am trying to run the below function but it takes about 1 minute. How can I make it faster? Because I have to run for multiple lookbackPeriod and holdPeriod it will take forever to complete.
CalC.MOD_MScore.Ret.High <- function(lookbackPeriod, holdPeriod, fnoStocks,
stocksRetData, totalTestPeriod) {
#We go through each stock and calculate Modified mscores where we give more importance to recent data
WeeklyData <- list()
wmean <- function(x, k) mean(seq(k)/k * x)
for (i in 1:nrow(fnoStocks)){
out <- stocksRetData[[i]]
out <- tail(out,totalTestPeriod)
if (nrow(out)==totalTestPeriod){
tempDF <- transform(out, wtMean = rollapply(Ret, lookbackPeriod, wmean,
k = lookbackPeriod, align = "right",
fill = NA))
tempDF <- transform(tempDF, ExitVal = rollapply(lead(High, holdPeriod),
holdPeriod, max,
align = "right",
fill = NA))
tempDF$NWeekRet <- (tempDF$ExitVal - tempDF$Adj.Close ) / tempDF$Adj.Close
tempDF <- tempDF[!is.na(tempDF$wtMean),]
tempDF <- tempDF[!is.na(tempDF$ExitVal),]
tempDF$StockName = fnoStocks[i,1]
tempDF$WeekNum = c((lookbackPeriod):(nrow(tempDF)+lookbackPeriod-1))
WeeklyData[[i]] <- data.frame(
StockName = tempDF$StockName,
WeekNum = tempDF$WeekNum,
M_Score = tempDF$wtMean,
NWeekRet = tempDF$NWeekRet,
stringsAsFactors = FALSE
)
}
}# i ends here
return(bind_rows(WeeklyData))
}
This takes more than a minute to complete.
a <- CalC.MOD_MScore.Ret.High(4,14,fnoStocks = top500Stocks, stocksRetData = stocksRetData, 2000)

First of all, I wouldn't suggest using for-loops in R. I would rewrite your loop with a lapply like
CalC.MOD_MScore.Ret.High <- function(lookbackPeriod, holdPeriod, fnoStocks,
stocksRetData, totalTestPeriod) {
#We go through each stock and calculate Modified mscores where we give more importance to recent data
wmean <- function(x, k) mean(seq(k)/k * x)
WeeklyData <- lapply(1:nrow(fnoStocks), function(i) {
out <- stocksRetData[[i]]
out <- tail(out,totalTestPeriod)
if(nrow(out)!=totalTestPeriod) return(NULL)
tempDF <- transform(out, wtMean = rollapply(Ret, lookbackPeriod, wmean,
k = lookbackPeriod, align = "right",
fill = NA))
tempDF <- transform(tempDF, ExitVal = rollapply(lead(High, holdPeriod),
holdPeriod, max,
align = "right",
fill = NA))
tempDF$NWeekRet <- (tempDF$ExitVal - tempDF$Adj.Close ) / tempDF$Adj.Close
tempDF <- tempDF[!is.na(tempDF$wtMean),]
tempDF <- tempDF[!is.na(tempDF$ExitVal),]
tempDF$StockName = fnoStocks[i,1]
tempDF$WeekNum = c((lookbackPeriod):(nrow(tempDF)+lookbackPeriod-1))
data.frame(
StockName = tempDF$StockName,
WeekNum = tempDF$WeekNum,
M_Score = tempDF$wtMean,
NWeekRet = tempDF$NWeekRet,
stringsAsFactors = FALSE
)
})
return(bind_rows(WeeklyData))
}
Having an lapply makes it easier to throw some parallelization-tools on it.
You can have a look at the package parallel. With this package, you can parallelize and make use of multiple cores on you machine. Therefore, you need to setup a cluster, which produces some overhead, but I think it will pay out in your case. To use it, setup a cluster via cl <- parallel::makeCluster(parallel::detectCores()). The detectCores-method gets the number of available cores on your machine. Then, you can edit the lapply to
WeeklyData <- parallel::parLapply(cl = cl, 1:nrow(fnoStocks), function(i) {
...
})
After all your caluclations finished, call parallel::stopCluster(cl) to stop the cluster.

Related

Vectorized calculation of adjacency matrix

I have the following function:
CFC_GLM <- function(data, frequency_bins){
adj_mat <- matrix(0, nrow = dim(data)[1], ncol = dim(data)[1])
bf_filters <- list()
combs <- combinations(length(frequency_bins), 2, repeats.allowed = T)
all_adj_mat <- list()
for(z in 1:length(frequency_bins)){
bf_filters[[z]] <- butter(3, c(frequency_bins[[z]][1]/1200,
frequency_bins[[z]][2]/1200), type = "pass")
}
for(f in 1:nrow(combs)){
for(i in 1:dim(data)[1]){
for(j in 1:dim(data)[1]){
sensor_1 <- data[i,]
sensor_2 <- data[j,]
sensor_1_filt = filtfilt(bf_filters[[combs[f,1]]], sensor_1)
sensor_2_filt = filtfilt(bf_filters[[combs[f,2]]], sensor_2)
a_y <- abs(hilbert(sensor_2_filt, 1200))
a_x <- abs(hilbert(sensor_1_filt, 1200))
theta_x <- angle(hilbert(sensor_1_filt, 1200)) %% 2*pi
a_x_norm <- (a_x - mean(a_x))/std(a_x)
a_y_norm <- (a_y - mean(a_y))/std(a_y)
theta_x_norm <- (theta_x - mean(theta_x))/std(theta_x)
fit <- lm(a_y_norm ~ sin(theta_x_norm) + cos(theta_x_norm) +
a_x_norm)
summ <- summary(fit)
r <- sqrt(summ$r.squared)
adj_mat[i,j] <- r
}
}
all_adj_mat[[f]] <- adj_mat
}
return(all_adj_mat)
}
Just to summarize, the function takes a matrix of signals (246 sensors by 2400 samples), performs some signal processing, and then performs a GLM between every possible pairs of sensors. This process is repeated for 4 frequency bandwidths and their combinations (within and cross-frequency coupling). Right now, this code seems terribly inefficient and takes a really long time to run. Is there a way to vectorize/parallelize this function? I have researched this question extensively and cannot seem to find an answer.
I am not sure whether to make some of the tasks within the function parallel or just make the whole function able to be called by parApply (vectorized). My intuition is the latter but I am not sure how to approach this. Any help is greatly appreciated.
Reproducible Example
test_data <- c(-347627.104358097, 821947.421444641, 496824.676355433,
-178091.364312102, -358842.250713998, 234666.210462063,
-1274153.04141668,
1017066.42839987, -158388.137875357, 191691.279588641,
-16231.2106151229,
378249.600546794, 1080850.88212858, -688841.640871254,
-616713.991288002,
639401.465180969, -1625802.44142751, 472370.867686569,
-631863.239075449,
-598755.248911174, 276422.966753179, -44010.9403226763,
1569374.08537143,
-1138797.2585617, -824232.849278583, 955783.332556046,
-1943384.98409094,
-54443.829280377, -1040354.44654998, -1207674.05255178,
496481.331429747,
-417435.356472725, 1886817.1254085, -1477199.59091112,
-947353.716505171,
1116336.49812969, -2173805.84111182, -574875.152250742,
-1343996.2219146,
-1492260.06197604, 626856.67540728, -713761.48191904, 1987730.27341334,
-1673384.77863935, -968522.886481198, 1089458.71433614,
-2274932.19262517,
-1096749.79392427, -1520842.86946059, -1390794.61065106,
669864.477272507,
-906096.822125892, 1863506.59188299, -1720956.06310511,
-889359.420058576,
885300.628410276, -2224340.54992297, -1619386.88041896,
-1570131.07127786,
-934848.556063722, 644671.113108699, -973418.329437102,
1541962.53750178,
-1636863.31666018, -728992.972371437, 551297.997356909,
-2026413.5471505,
-2129730.49230266, -1511423.25789691, -236962.889589694,
580683.399845852,
-906261.700784793, 1080101.95011954, -1455931.89179814,
-518630.187846405,
158846.288141661, -1715610.22092989, -2601349.5081924,
-1380068.64260811,
541310.557194977, 509125.333244057, -711696.682554995,
551748.792106809,
-1222430.29467688, -293847.487823853, -215078.751157158,
-1354005.89576504,
-2997647.23289805, -1220136.14918605, 1231169.98678596,
455388.081391798,
-415489.975542684, 32724.7895795912, -980848.930757441,
-86618.5594163355,
-506333.915891838, -1022235.58829567, -3279232.01820961,
-1076344.95091665,
1696655.88400158), .Dim = c(10L, 10L))
frequency_bins <- list(band1 = c(2,4), band2 = c(4,12), band3 =
c(12,30), band4 = c(30,100))
system.time(test_result <- CFC_GLM(test_data, frequency_bins))
user system elapsed
1.839 0.009 1.849
I'm not sure how to include the result in a manageable way. Sorry for the naivety. This is only with 10 sensors by 10 samples, to have a manageable test set.
Right off the bat I would suggest predeclaring the length of your lists.
bf_filters <- rep(list(NA), length(frequency_bins))
all_adj_mat <- rep(list(NA), nrow(combos))
#this is your function to be applied
i_j_fun <- function ( perms ) {
sensor_1_filt = filtfilt(bf_filters[[combos[f,1]]], data[perms[1],])
sensor_2_filt = filtfilt(bf_filters[[combos[f,2]]], data[persm[2],])
a_y <- abs(hilbert(sensor_2_filt, 1200))
a_x <- abs(hilbert(sensor_1_filt, 1200))
theta_x <- angle(hilbert(sensor_1_filt, 1200)) %% 2*pi
a_x_norm <- (a_x - mean(a_x))/std(a_x)
a_y_norm <- (a_y - mean(a_y))/std(a_y)
theta_x_norm <- (theta_x - mean(theta_x))/std(theta_x)
fit <- lm(a_y_norm ~ sin(theta_x_norm) + cos(theta_x_norm) +
a_x_norm)
summ <- summary(fit)
r <- sqrt(summ$r.squared)
return(r)
}
Your i and j for loops can be turned into a function and used with apply.
#perms acts like the for loop
perms <- permuations(dim(data)[1], 2, seq_len(dim(data)[1]))
for(f in 1:nrow(combs)){
all_adj_mat[[f]] <- matrix(apply(perms, 1, i_j_fun),
nrow = dim(data)[1], ncol = dim(data[2]), byrow = TRUE)
}
That should do it.

R - split large dataframe into list in parallel

I have a large transaction dataset (around 5 million rows), i need to split all transactions by ID (around 1 million unique ID). The expected results would be unique ID with item in lists.
I did try the most simple and direct way to split the transaction dataset (by referring to Why is split inefficient on large data frames with many groups? ), i know that convert dataframe into datatable might be more efficient.
Sample source df
set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE)
)
Convert character to factor
x$ID <- as.character(x$ID)
x$Item <- as.factor(x$Item)
Convert df into dt, then split dt into lists
library(data.table)
x <- as.data.table(x)
system.time(
xx <- split(x$Item, x$ID)
)
Expected results in lists
head(xx, 2)
#$A100
#[1] tea orange
#Levels: apple lemon orange rice tea
#$A101
#[1] rice
#Levels: apple lemon orange rice tea
Problem: After running for 2 hours, on my machine (4 cores, 16Gb RAM, Win10, R 3.4.3) it still running and never completes. I did check my CPU usage when it's running, it only consumed 35-40% of the CPU usage.
My idea:
I'm thinking is there any way to fully utilized the computational power of my machine (run the "split" in parallel), using only detectCores() - 1 = 3 cores.
1st: Split the large transaction dataset by IDs into 3 smaller partitions (smaller dataset)
2nd: Using foreach loop to run split 3 partitions (smaller dataset) into list in parallel, then append(row bind) each list for every iteration until the end.
Question: Is my idea practical? i did read about mclapply and it's mc.cores, but seems like mc.cores = 1 is the only option for windows, so it won't help for my case. Is there any better and more efficient way to do the split for large dataset? Any comment is welcome, Thanks!
Surprisingly and interestingly, consider by (the object-oriented wrapper to tapply) which operates similarly as split on data frames with an added feature to run splits into a function call. The equivalent to split would be to return the argument or call identity.
by(x$Item, x$ID, function(x) x)
by(x$Item, x$ID, identity)
Do note, the return of by is a by class object which essentially is a list with additional attributes.
Using your random data frame example, base::split did not finish after 1 hour, but base::by did well below 5 mins on my machine with a 64 GB RAM! Usually, I assumed by would have more overhead being a sibling to the apply family but my opinion may soon change.
50K ROW EXAMPLE
set.seed(123)
n = 50000 #number of sample data (50k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE)
)
system.time( xx <- split(x$Item, x$ID) )
# user system elapsed
# 20.09 0.00 20.09
system.time( xx2 <- by(x$Item, x$ID, identity) )
# user system elapsed
# 1.55 0.00 1.55
all.equal(unlist(xx), unlist(xx2))
# [1] TRUE
identical(unlist(xx), unlist(xx2))
# [1] TRUE
500K ROW EXAMPLE
set.seed(123)
n = 500000 #number of sample data (500k as trial)
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE)
)
system.time( xx <- split(x$Item, x$ID) )
# DID NOT FINISH AFTER 1 HOUR
system.time( xx2 <- by(x$Item, x$ID, identity) )
# user system elapsed
# 23.00 0.06 23.09
Source code reveals split.default might run more processes at the R (unlike C or Fortran) level with a for loop across factor levels:
getAnywhere(split.data.frame)
function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...)
{
if (!missing(...))
.NotYetUsed(deparse(...), error = FALSE)
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f))
f <- as.factor(f)
else if (drop)
f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class")))
return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}
Conversely, source code for by.data.frame reveals a call to tapply which itself is a wrapper to lapply:
getAnywhere(by.data.frame)
function (data, INDICES, FUN, ..., simplify = TRUE)
{
if (!is.list(INDICES)) {
IND <- vector("list", 1L)
IND[[1L]] <- INDICES
names(IND) <- deparse(substitute(INDICES))[1L]
}
else IND <- INDICES
FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
nd <- nrow(data)
structure(eval(substitute(tapply(seq_len(nd), IND, FUNx,
simplify = simplify)), data), call = match.call(), class = "by")
}
The factors seems to be the key here. I don't have 64GB RAM 😆 but maybe you can try again with stringsAsFactors = F. My results for a smaller test are below and it seems split is quite faster when not using factors.
n <- 50000
x <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE),
stringsAsFactors = T
)
x2 <- data.frame(ID = paste(LETTERS[1:8],sample(1:round(n/3), n, replace = TRUE),sep = ""),
Item= sample(c('apple','orange','lemon','tea','rice'), n, replace=TRUE),
stringsAsFactors = F)
splitFactor <- function() split(x$Item, x$ID)
byFactor <- function() by(x$Item, x$ID, identity)
splitNotFactor <- function() split(x2$Item, x2$ID)
byNotFactor <- function() by(x2$Item, x2$ID, identity)
a <- microbenchmark::microbenchmark(splitFactor(),
byFactor(),
splitNotFactor(),
byNotFactor(),
times = 3
)
Unit: milliseconds
expr min lq mean median uq max neval cld
splitFactor() 51743.1633 51936.7261 52025.1205 52130.2889 52166.0990 52201.9091 3 d
byFactor() 1963.0673 1987.7360 2030.5779 2012.4048 2064.3332 2116.2616 3 b
splitNotFactor() 399.7618 401.6796 412.4632 403.5973 418.8139 434.0306 3 a
byNotFactor() 2410.3804 2518.3651 2578.3501 2626.3499 2662.3349 2698.3199 3 c
splitNotFactor() should also result in an object with much smaller memory footprint than the other functions.

how to optimise my code to run on a Windows 2012 Server

My code is running very slowly on my laptop and i have access to a windows 2012 server x64 with 256Gb ram.
I have the server set up running R and have this code working but 48 hours = 25%
From what i have learnt its due to only using one core.
Currently I'm exploring foreach loop but getting nowhere slowly
library("sp")
library("rgeos")
library("geosphere")
library("gdistance")
# Data
dna <- data.frame(cbind(rnorm(400) * 2 + 13, rnorm(400) + 48))
dna$ID <- seq.int(nrow(dna))
match <- data.frame(cbind(rnorm(4000) * 2 + 13, rnorm(4000) + 48))
match$ID <- seq.int(nrow(match))
##Set row id
RID2 <- 1
#create output table
tablelength <- print (nrow(dna))
match1 = data.frame( UPRN=rep(0, tablelength), Long=rep(0,tablelength), Lats=rep(0,tablelength), MatchID=rep(0,tablelength) , Longm=rep(0,tablelength), Latsm=rep(0,tablelength), distance=rep(0,tablelength))
#start loop
for(RID2 in dna[,3]) {
#Set UPRN and Exchange Name
Name <- paste(dna[RID2,3])
set1 <- data.frame(dna[RID2,1:2])
set2 <- data.frame(match[,1:2])
set1sp <- SpatialPoints(set2)
set2sp <- SpatialPoints(set1)
set1$ID <- apply(gDistance(set1sp, set2sp, byid=TRUE), 1, which.min)
ID <- paste(apply(gDistance(set1sp, set2sp, byid=TRUE), 1, which.min))
#insert Row
match1[RID2, ] = c(Name, set1[,1], set1[,2], paste(match[ID,3]), set2[ID,1], set2[ID,2], distVincentyEllipsoid(c(set1[,1], set1[,2]), c(set2[ID,1], set2[ID,2]), a=6378137, b=6356752.3142, f=1/298.257223563))
remove(set1,set2,set1sp,set2sp)
}
The output is what i am looking for but ideally with a sub 1 day runtime (currently at 8)
This works for me, and cuts calculation time (on your sample data) in half on my machine..
set.seed(123)
# Data
dna <- data.frame(cbind(rnorm(400) * 2 + 13, rnorm(400) + 48))
dna$ID <- seq.int(nrow(dna))
match <- data.frame(cbind(rnorm(4000) * 2 + 13, rnorm(4000) + 48))
match$ID <- seq.int(nrow(match))
###
library( sf )
library( data.table )
dna.sf <- st_as_sf( x = dna,
coords = c( "X1", "X2"),
crs = "+proj=longlat +datum=WGS84" )
match.sf <- st_as_sf( x = match,
coords = c( "X1", "X2"),
crs = "+proj=longlat +datum=WGS84" )
#create data.tables
setDT(dna)
setDT(match)
#add suffixes to identify columns later (after join)
setnames(dna, names(dna), paste0(names(dna),".dna"))
setnames(match, names(match), paste0(names(match),".match"))
#create distance matrix
m <- round( st_distance( dna.sf, match.sf ), digits = 0 )
colnames( m ) <- match.sf$ID
rownames( m ) <- dna.sf$ID
#get colname of min to nearest (remember, colname = match-ID ;-) )
dna$nearest <- apply( m, 1, which.min )
#get the min distance
dna$dist <- apply( m, 1, min )
#now left-join to get the coordinates of match, use data.table for speed
library( data.table )
result <- match[dna, on = c("ID.match==nearest") ]
The results seem to be the same as when using your 'old' method, but calculation time is roughly cut in half (7.5 -> 4 secs)
You can already get a good speed boost by simply optimizing the code and removing redundant parts. For example, this is more or less twice as fast on the test data, and is easily parallelizable.
library("sp")
library("rgeos")
library("geosphere")
library("gdistance")
# Data
dna <- data.frame(cbind(rnorm(400) * 2 + 13, rnorm(400) + 48))
dna$ID <- seq.int(nrow(dna))
match <- data.frame(cbind(rnorm(40000) * 2 + 13, rnorm(40000) + 48))
match$ID <- seq.int(nrow(match))
##Set row id
RID2 <- 1
#create output table
tablelength <- nrow(dna)
matchlist <- list()
set2 <- match[,1:2]
set1sp <- SpatialPoints(set2)
for(RID2 in dna[,3]) {
set1 <- dna[RID2,1:2]
set2sp <- SpatialPoints(set1)
ID <- which.min(gDistance(set1sp, set2sp, byid=TRUE))
#insert Row
matchlist[[RID2]] = data.frame(UPRN = dna[RID2,3],
Long = set1[,1],
Lats = set1[,2],
matchid = match[ID,3],
Longm = set2[ID,1],
Latsm = set2[ID,1],
distance = distVincentyEllipsoid(set1, set2[ID,],
a=6378137, b=6356752.3142, f=1/298.257223563))
}
match1 <- data.table::rbindlist(matchlist)
thanks you all for your input i will be reading the different styles to further my R learning's.
I have used a solution posted from the reddit thread i also made at the same time.
require(foreach)
require(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
temp <- foreach(I = 1:nrow(dna),.combine = "c", .packages = c("rgeos","sp")) %dopar% {
return(c(which.min(
gDistance(
SpatialPoints(data.frame(dna[I,1:2]))
, SpatialPoints(data.frame(match[,1:2]))
, byid=TRUE
))))
}
https://old.reddit.com/r/rstats/comments/aebamb/how_do_i_use_all_the_cores_on_a_server_to_match/
Again thank you for the help :-D

How to vectorize a function in R

I need some help vectorizing the following code because I believe that it will become more efficient. However i do not know how to begin... I created a loop that goes through z. z has 3 columns and 112847 rows, which might be a reason it takes a long time. The 3 columns contain numbers that are used in the MACD() function...
library(quantmod)
library(TTR)
# get stock data
getSymbols('LUNA')
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
#Create matrix for returns only
y <- stock
#Create a "MATRIX" by choosing the Adjusted Close
Nudata3 <- stock
#Sharpe Ratio Matrix
SR1<- matrix(NA, nrow=1)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
colnames(z)<- c("one","two","three")
n = 1
stretches <- length(z[,1])
while (n < stretches){
# I am trying to go through all the values in "z"
Nuw <- MACD((stock), nFast=z[n,1], nSlow=z[n,2], nSig=z[n,3], maType="EMA")
colnames(Nuw) <- c("MACD","Signal") #change the col names to create signals
x <- na.omit(merge((stock), Nuw))
x$sig <- NA
# Create trading signals
sig1 <- Lag(ifelse((x$MACD <= x$Signal),-1, 0)) # short when MACD < SIGNAL
sig2 <- Lag(ifelse((x$MACD >= x$Signal),1, 0)) # long when MACD > SIGNAL
x$sig <- sig1 + sig2
#calculate Returns
ret <- na.omit(ROC(Ad(x))*x$sig)
colnames(ret)<- c(paste(z[n,1],z[n,2],z[n,3],sep=","))
x <- merge(ret,x)
y <- merge(y,ret) #This creates a MATRIX with RETURNs ONLY
Nudata3 <- merge(Nudata3, x)
((mean(ret)/sd(ret)) * sqrt(252)) -> ANNUAL # Creates a Ratio
ANNUAL->Shrat # stores Ratio into ShRat
SR1 <- cbind(SR1,Shrat) # binds all ratios as it loops
n <- (n+1)
}
I would like to know how to vectorize the MACD() function, to speed up the process since the length of stretches is approx. 112847. It takes my computer quite some time to go through the loop itself.
First and foremost - case specific optimization - remove the cases where nFast > nSlow as it doesn't make sense technically.
Secondly - you are creating objects and copying them over and over again. This is very expensive.
Thirdly - you can code this better perhaps by creating a matrix of signals in one loop and doing rest of the operations in vectorized manner.
I would code what you are doing something like this.
Please read help pages of mapply, do.call, merge and sapply if you don't understand.
require(quantmod)
getSymbols("LUNA")
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
IMO : This is where your first optimization should be. Remove cases where i > k
z <- z[z[,1]<z[,2], ]
It reduces the number of cases from 112847 to 57575
#Calculate only once. No need to calculate this in every iteration.
stockret <- ROC(stock)
getStratRet <- function(nFast, nSlow, nSig, stock, stockret) {
x <- MACD((stock), nFast=nFast, nSlow=nSlow, nSig=nSig, maType="EMA")
x <- na.omit(x)
sig <- Lag(ifelse((x$macd <= x$signal),-1, 0)) + Lag(ifelse((x$macd >= x$signal),1, 0))
return(na.omit(stockret * sig))
}
RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[,1], nSlow = z[,2], nSig = z[,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE))
getAnnualSharpe <- function(ret) {
ret <- na.omit(ret)
return ((mean(ret)/sd(ret)) * sqrt(252))
}
SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
Results will be as below. Which column belongs to which combo of i, j, k is trivial.
head(RETURNSLIST[, 1:3])
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2
## 2007-01-10 0.012739026 -0.012739026 0
## 2007-01-11 -0.051959739 0.051959739 0
## 2007-01-12 -0.007968170 -0.007968170 0
## 2007-01-16 -0.007905180 -0.007905180 0
## 2007-01-17 -0.005235614 -0.005235614 0
## 2007-01-18 0.028315920 -0.028315920 0
SHARPELIST
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2 LUNA.Adjusted.3 LUNA.Adjusted.4 LUNA.Adjusted.5 LUNA.Adjusted.6
## 0.04939150 -0.07428392 NaN 0.02626382 -0.06789803 -0.22584987 -0.07305477
## LUNA.Adjusted.7 LUNA.Adjusted.8 LUNA.Adjusted.9
## -0.05831643 -0.08864845 -0.08221986
system.time(
+ RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[1:100,1], nSlow = z[1:100,2], nSig = z[1:100,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE)),
+ SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
+ )
user system elapsed
2.28 0.00 2.29

R: Row resampling loop speed improvement

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.

Resources