I need some help vectorizing the following code because I believe that it will become more efficient. However i do not know how to begin... I created a loop that goes through z. z has 3 columns and 112847 rows, which might be a reason it takes a long time. The 3 columns contain numbers that are used in the MACD() function...
library(quantmod)
library(TTR)
# get stock data
getSymbols('LUNA')
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
#Create matrix for returns only
y <- stock
#Create a "MATRIX" by choosing the Adjusted Close
Nudata3 <- stock
#Sharpe Ratio Matrix
SR1<- matrix(NA, nrow=1)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
colnames(z)<- c("one","two","three")
n = 1
stretches <- length(z[,1])
while (n < stretches){
# I am trying to go through all the values in "z"
Nuw <- MACD((stock), nFast=z[n,1], nSlow=z[n,2], nSig=z[n,3], maType="EMA")
colnames(Nuw) <- c("MACD","Signal") #change the col names to create signals
x <- na.omit(merge((stock), Nuw))
x$sig <- NA
# Create trading signals
sig1 <- Lag(ifelse((x$MACD <= x$Signal),-1, 0)) # short when MACD < SIGNAL
sig2 <- Lag(ifelse((x$MACD >= x$Signal),1, 0)) # long when MACD > SIGNAL
x$sig <- sig1 + sig2
#calculate Returns
ret <- na.omit(ROC(Ad(x))*x$sig)
colnames(ret)<- c(paste(z[n,1],z[n,2],z[n,3],sep=","))
x <- merge(ret,x)
y <- merge(y,ret) #This creates a MATRIX with RETURNs ONLY
Nudata3 <- merge(Nudata3, x)
((mean(ret)/sd(ret)) * sqrt(252)) -> ANNUAL # Creates a Ratio
ANNUAL->Shrat # stores Ratio into ShRat
SR1 <- cbind(SR1,Shrat) # binds all ratios as it loops
n <- (n+1)
}
I would like to know how to vectorize the MACD() function, to speed up the process since the length of stretches is approx. 112847. It takes my computer quite some time to go through the loop itself.
First and foremost - case specific optimization - remove the cases where nFast > nSlow as it doesn't make sense technically.
Secondly - you are creating objects and copying them over and over again. This is very expensive.
Thirdly - you can code this better perhaps by creating a matrix of signals in one loop and doing rest of the operations in vectorized manner.
I would code what you are doing something like this.
Please read help pages of mapply, do.call, merge and sapply if you don't understand.
require(quantmod)
getSymbols("LUNA")
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
IMO : This is where your first optimization should be. Remove cases where i > k
z <- z[z[,1]<z[,2], ]
It reduces the number of cases from 112847 to 57575
#Calculate only once. No need to calculate this in every iteration.
stockret <- ROC(stock)
getStratRet <- function(nFast, nSlow, nSig, stock, stockret) {
x <- MACD((stock), nFast=nFast, nSlow=nSlow, nSig=nSig, maType="EMA")
x <- na.omit(x)
sig <- Lag(ifelse((x$macd <= x$signal),-1, 0)) + Lag(ifelse((x$macd >= x$signal),1, 0))
return(na.omit(stockret * sig))
}
RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[,1], nSlow = z[,2], nSig = z[,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE))
getAnnualSharpe <- function(ret) {
ret <- na.omit(ret)
return ((mean(ret)/sd(ret)) * sqrt(252))
}
SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
Results will be as below. Which column belongs to which combo of i, j, k is trivial.
head(RETURNSLIST[, 1:3])
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2
## 2007-01-10 0.012739026 -0.012739026 0
## 2007-01-11 -0.051959739 0.051959739 0
## 2007-01-12 -0.007968170 -0.007968170 0
## 2007-01-16 -0.007905180 -0.007905180 0
## 2007-01-17 -0.005235614 -0.005235614 0
## 2007-01-18 0.028315920 -0.028315920 0
SHARPELIST
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2 LUNA.Adjusted.3 LUNA.Adjusted.4 LUNA.Adjusted.5 LUNA.Adjusted.6
## 0.04939150 -0.07428392 NaN 0.02626382 -0.06789803 -0.22584987 -0.07305477
## LUNA.Adjusted.7 LUNA.Adjusted.8 LUNA.Adjusted.9
## -0.05831643 -0.08864845 -0.08221986
system.time(
+ RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[1:100,1], nSlow = z[1:100,2], nSig = z[1:100,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE)),
+ SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
+ )
user system elapsed
2.28 0.00 2.29
Related
Everyone. I'm trying to filter GPS location data based on distance (UTMs) and time (H:M:S) criteria independently and concurrently. Here's the data structure:
head(collar)
FID animal date time zone easting northing
1 URAM01_2012 6/24/2012 10:00:00 AM 13S 356664 3971340
2 URAM01_2012 6/24/2012 1:02:00 PM 13S 356760 3971480
3 URAM01_2012 6/24/2012 4:01:00 PM 13S 357482 3972325
4 URAM01_2012 6/24/2012 7:01:00 PM 13S 356882 3971327
5 URAM01_2012 6/25/2012 4:01:00 AM 13S 356574 3971765
6 URAM01_2012 6/25/2012 7:01:00 AM 13S 357796 3972231
Right now I'm filtering by distance only but I'm having some issues. The code should calculate the distance between FID[1] and FID[2] and then assign that distance to FID[1] in a new column ($step.length). After all distances have been calculated, the data is then subsetted based on a distance rule. Right now I have it set to where I want all locations that are >200m apart. Once subsetted, the process is then repeated until the distance between all subsequent locations is >200m. Here's the code that I've written that accomplishes only a portion of what I'd like to do:
reps <- 10
#Begin loop for the number of reps. Right now it's at 10 just to see if the code works.
for(rep in 1:reps){
#Begin loop for the number of GPS locations in the file
for(i in 1:length(collar$FID)){
#Calculate the distance between a GPS location and the next GPS locations. the formula is the hypotenuse of the Pythagorean theorem.
collar$step.length[i] <- sqrt(((collar$easting[i] - collar$easting[i+1])^2) + ((collar$northing[i] - collar$northing[i+1])^2))
}
#Subset the data. Select all locations that are >200m from the next GPS location.
collar <- subset(collar, step.length >200)
}
Now, the code isn't perfect and I would like to add 2 conditions into the code.
1.) Animal ID isn't considered. Therefore, a distance for the last location of an animal will be generated using the first location of a new animal when the distance should be NA. I thought using for(i in 1:unique(collar$animal)) might work, but it didn't (shocking) and I'm not sure what to do since for(i in length(collar$animal)) doesn't use only unique values.
2.) I'd also like to insert a break in the for loop when all locations that are >200m. I'm sure there has to be a better way of doing this, but I thought I'd set reps to something large (e.g., 10000) and once a criteria was met then R would break:
if(collar$step.length > 200){
break }
Yet, since the if condition is >1 only the first element is used. I've haven't thought about time or distance/time yet, but if anyone has any suggestions for those endeavors, I'd appreciate the advice. Thanks for your help and guidance.
I don't quite understand what you are trying to do with the reps but you can take advantage of the split and unsplit functions to focus on each individual animal.
First I created a distance() function that finds the columns named easting and northing from the object to create a vector of distances. Then we split collar up by the animal, and apply the distance function to each animal. We add this list of distances to the list of animals with some mapply code and then unsplit the results to make everything go back together.
Let me know what you want to do with the ">200" step.
distance <- function(x){
easting <- x$easting
northing <- x$northing
easting2 <- c(easting[-1], NA)
northing2 <- c(northing[-1], NA)
sqrt((easting - easting2)^2 + (northing - northing2)^2)
}
s <- split(collar, collar$animal)
distances <- lapply(s, distance)
s2 <- mapply(cbind, s, "Distance" = distances, SIMPLIFY = F)
collar.new <- unsplit(s2, collar$animal)
EDIT:
Apologies if this is cumbersome, I'm sure I can get it shorter but as of now let me know if it works for you. I would also be curious to see how fast it runs as I have been making up my own data.
filterout <- function(input, value = NULL){
# requirements of the input object
stopifnot(all(c("FID","animal","easting","northing") %in% colnames(input)))
distance <- function(x){ # internal distance function
e1 <- x$easting; e2 <- c(NA, e1[-nrow(x)])
n1 <- x$northing; n2 <- c(NA, n1[-nrow(x)])
sqrt((e1 - e2)^2 + (n1 - n2)^2)
}
nc <- ncol(input) # save so we can "rewrite" Distance values each reiteration
f <- function(input){ # the recursive function (will run until condition is met)
z <- split(input[,-(nc+1)], input$animal) # split by animal & remove (if any) prior Distance column
distances <- lapply(z, distance) # collect distances
z2 <- mapply(cbind, z, "Distance" = distances, SIMPLIFY = F) # attach distances
r1 <- lapply(z2, function(x) { # delete first row under criteria
a <- x$Distance < value # CRITERIA
a[is.na(a)] <- FALSE # Corrects NA values into FALSE so we don't lose them
first <- which(a == T)[1] # we want to remove one at a time
`if`(is.na(first), integer(0), x$FID[first]) # returns FIDs to remove
})
z3 <- unsplit(z2, input$animal)
# Whether to keep going or not
if(length(unlist(r1)) != 0){ # if list of rows under criteria is not empty
remove <- which(z3$FID %in% unlist(r1, use.names = F)) # remove them
print(unlist(r1, use.names = F)) # OPTIONAL*** printing removed FIDs
f(z3[-remove,]) # and run again
} else {
return(z3) # otherwise return the final list
}
}
f(input)
}
And the function can be used as follows:
filterout(input = collar, value = 200)
filterout(input = collar, value = 400)
filterout(input = collar, value = 600)
EDIT2:
I opened up a bounty question to figure out how to do a certain step but hopefully this answer helps. It might take a little ~ a minute to do 37k rows but let me know~
x <- collar
skipdistance <- function(x, value = 200){
d <- as.matrix(dist(x[,c("easting","northing")]))
d[lower.tri(d)] <- 0
pick <- which(d > value, arr.ind = T) # pick[order(pick[,"row"]),] # visual clarity
findConnectionsBase <- function(m) {
n <- nrow(m)
myConnections <- matrix(integer(0), nrow = n, ncol = 2)
i <- j <- 1L
k <- 2L
while (i <= n) {
myConnections[j, ] <- m[i, ]
while (k <= n && m[i, 2] != m[k, 1]) {k <- k + 1L}
i <- k
j <- j + 1L
}
myConnections[!is.na(myConnections[,1]), ]
}
keep.ind <- findConnectionsBase(pick)
keep.row <- unique(c(keep.ind))
cbind(x[keep.row,], Distance = c(NA,d[keep.ind]))
}
a <- do.call(rbind,lapply(split(x, x$animal), skipdistance, value = 200))
dim(a)
Edit #3:
library(lubridate) # great package for string -> dates
# changed to give just rows that satisfy greater than value criteria
skip <- function(dist.var, value = 200){
d <- as.matrix(dist(dist.var))
d[lower.tri(d)] <- 0
pick <- which(d > value, arr.ind = T) # pick[order(pick[,"row"]),] # visual clarity
findConnectionsBase <- function(m) {
n <- nrow(m)
myConnections <- matrix(integer(0), nrow = n, ncol = 2)
i <- j <- 1L
k <- 2L
while (i <= n) {
myConnections[j, ] <- m[i, ]
while (k <= n && m[i, 2] != m[k, 1]) {k <- k + 1L}
i <- k
j <- j + 1L
}
myConnections[!is.na(myConnections[,1]), ]
}
unique(c(findConnectionsBase(pick)))
}
collar <- structure(list(FID = 1:8, animal = c("URAM01_2012", "URAM01_2012", "URAM01_2012", "URAM01_2012", "URAM01_2013", "URAM01_2013", "URAM01_2013", "URAM01_2013"), date = c("6/24/2012", "6/24/2012", "6/24/2012", "6/24/2012", "6/25/2012", "6/25/2012", "6/25/2012", "6/25/2012" ), time = c("10:00:00AM", "1:02:00PM", "4:01:00PM", "7:01:00PM", "4:01:00AM", "7:01:00AM", "7:01:00AM", "7:01:00AM"), zone = c("13S", "13S", "13S", "13S", "13S", "13S", "13S", "13S"), easting = c(356664L,
356760L, 356762L, 356882L, 356574L, 357796L, 357720L, 357300L), northing = c(3971340L, 3971480L, 3971498L, 3971498L, 3971765L, 3972231L, 3972230L, 3972531L)), .Names = c("FID", "animal", "date", "time", "zone", "easting", "northing"), class = "data.frame", row.names = c(NA, -8L))
collar[skip(dist.var = collar[,c("easting","northing")],
value = 200),]
# dist function works on dates, but it makes sense to convert to hours
dist(lubridate::mdy_hms(paste(collar$date, collar$time)))
hours <- 2.99
collar[ skip(dist.var = lubridate::mdy_hms(paste(collar$date, collar$time)),
value = hours * 3600), ]
Big thanks and shout out to Evan for all of his hard work. Obviously, the code that he generated is a bit different than what I proposed, but that's the great thing about this community; sharing unique solutions ourselves may not think come to. See Edit #2 for the final code which filters GPS collar data by the distance between consecutive points.
I am working on calculating a new raster (output ras) based on 2 rasters (input ras) and a 'stratum' raster. The Stratum raster values (1 to 4) refer to the rows in the bias and weight dataframes. Strata value '4' was used to fill any 'NA' in the Strata raster, otherwise the function would crash. The following input is required.
# load library
library(raster)
# reproducing the bias and weight data.frames
bias <- data.frame(
ras_1 = c(56,-7,-30,0),
ras_2 = c(29,18,-52,0),
ras_3 = c(44,4,-15,0)
)
rownames(bias) <- c("Strat 1","Strat 2","Strat 3","Strat 4")
weight <- data.frame(
ras_1 = c(0.56,0.66,0.23,0.33),
ras_2 = c(0.03,0.18,0.5,0.33),
ras_3 = c(0.41,0.16,0.22,0.34)
)
rownames(weight) <- c("Strat 1","Strat 2","Strat 3","Strat 4")
The following function (fusion) allows me to add a 'bias' value to the input rasters. After the bias has been added, the two corrected input raster cell values will be multiplied by a weight value, depending in which stratum they belong.
The result of the input 2 raster values will be summed and returned using 'calc'.
## Create raster data for input
# create 2 rasters
r1 <- raster(ncol=10,nrow=10)
r2 <- raster(ncol=10,nrow=10)
r1[] <- sample(seq(from = 1, to = 500, by = 1), size = 100, replace = TRUE)
r2[] <- sample(seq(from = 1, to = 500, by = 1), size = 100, replace = TRUE)
r2[1:2] <- NA # include NA in input maps for example purpose
# Create strata raster (4 strata's)
r3 <- raster(ncol=10,nrow=10)
r3[] <- sample(seq(from = 1, to = 4, by = 1), size = 100, replace = TRUE)
Strata.n <- 4 # number of strata values in this example
fusion <- function(x) {
result <- matrix(NA, dim(x)[1], 1)
for (n in 1:Strata.n) {
ok <- !is.na(x[,3]) & x[,3] == n
a <- x[ok,1] + bias[n,1] # add bias to first input raster value
b <- x[ok,2] + bias[n,2] # add bias to second input raster value
result[ok] <- a * weight[n,1] + b * weight[n,2] # Multiply values by weight
}
return(result)
}
s <- stack(r1,r2,r3)
Fused.map <- calc(s, fun = fusion, progress = 'text')
The problem with the above function is that:
It is only suited for 2 rasters
If one raster has NA, then the result will be NA for that cell
is.na(Fused.map#data#values) # check for NA in the fused map
What I would like to have is:
A function that takes any number of input rasters
It can work with NA values (ignores NA values in the rasters)
Re-adjusts the 'weight' if a raster has a NA value, so that the remaining weight values add up to 1
EDIT
The following function does what I need, but is significantly slower than the function above on large rasters. Fusion does it in 10 seconds, fusion2 function below needs 8 hours on large rasters...
fusion2 <- function(x) {
m <- matrix(x, nrow= 1, ncol=3) # Create matrix per stack of cells
n <- m[,3] # get the stratum
g <- m[1:(Strata.n-1)] + as.matrix(bias[n,]) # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- weight[n,1:(Strata.n-1)] # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
pp <- as.numeric(pp)
result <- as.integer(round(sum(pp*g, na.rm = T))) # return raster value
return(result)
}
Fused.map <- calc(s, fun = fusion2, progress = 'text')
Any way to optimize the fusion2 function to a similar method as fusion1?
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Thank you for your time!
There seems to be a lot of unnecessary format conversions going on, and using the simplest data structures available is the fastest. calc parameter is a numeric vector, so you can use numeric vectors everywhere. Also, rounding and casting into an integer is redundant.
fusion3 <- function(x) {
n <- x[3] # get the stratum
g <- x[1:(Strata.n-1)] + as.numeric(bias[n,]) # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- as.numeric(weight[n,1:(Strata.n-1)]) # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
result <- as.integer(sum(pp*g, na.rm = T)+0.5) # return raster value
return(result)
}
On a 100x100 raster, your original functions take:
system.time(Fused.map <- calc(s, fun = fusion, progress = 'text'))
user system elapsed
0.015 0.000 0.015
system.time(Fused.map <- calc(s, fun = fusion2, progress = 'text'))
user system elapsed
8.270 0.078 8.312
The modified function is already 5 times faster:
system.time(Fused.map3 <- calc(s, fun = fusion3, progress = 'text'))
user system elapsed
1.970 0.026 1.987
Next, precompute matrices from the data frames so you don't need to do that for each pixel:
bias_matrix = as.matrix(bias)
weight_matrix = as.matrix(weight)
fusion3 <- function(x) {
n <- x[3] # get the stratum
g <- x[1:(Strata.n-1)] + bias_matrix[n,] # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- weight_matrix[n,1:(Strata.n-1)] # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
result <- as.integer(sum(pp*g, na.rm = T)+0.5) # return raster value
return(result)
}
We get:
system.time(Fused.map3 <- calc(s, fun = fusion3, progress = 'text'))
user system elapsed
0.312 0.008 0.318
And finally, also precompute 1:(Strata.n-1):
bias_matrix = as.matrix(bias)
weight_matrix = as.matrix(weight)
Strata.minus1 = 1:(Strata.n-1)
fusion3 <- function(x) {
n <- x[3] # get the stratum
g <- x[Strata.minus1] + bias_matrix[n,] # add bias to raster values
g[g < 0] <- 0 # set values below 0 to 0
w <- weight_matrix[n,Strata.minus1] # get correct strata weight values
w[is.na(g)]<- NA # set weight to NA if (g) raster values are NA
p <- sum(w, na.rm = T) # calculate sum of weight values
pp <- w/p # divide weight values by sum to get the proportion to == 1
result <- as.integer(sum(pp*g, na.rm = T)+0.5) # return raster value
return(result)
}
We get:
system.time(Fused.map3 <- calc(s, fun = fusion3, progress = 'text'))
user system elapsed
0.252 0.011 0.262
That's not quite 0.015 yet, but you also have to take into consideration that your original function does not output integers, nor does it set values below 0 to 0, nor does it make the proportions sum to 1, nor as you mentioned deal with NAs.
Mind you, this function still only works with only two rasters, because you hardcode stratum as layer 3. You should instead use raster::overlay with two parameters, the stratum raster and the layers themselves (or use calc with the stratum raster as layer 1, but that's not what calc is designed for).
I am currently working my way through the book 'R for Data Science'.
I am trying to solve this exercise question (21.2.1 Q1.4) but have not been able to determine the correct output before starting the for loop.
Write a for loop to:
Generate 10 random normals for each of μ= −10, 0, 10 and 100.
Like the previous questions in the book I have been trying to insert into a vector output but for this example, it appears I need the output to be a data frame?
This is my code so far:
values <- c(-10,0,10,100)
output <- vector("double", 10)
for (i in seq_along(values)) {
output[[i]] <- rnorm(10, mean = values[[i]])
}
I know the output is wrong but am unsure how to create the format I need here. Any help much appreciated. Thanks!
There are many ways of doing this. Here is one. See inline comments.
set.seed(357) # to make things reproducible, set random seed
N <- 10 # number of loops
xy <- vector("list", N) # create an empty list into which values are to be filled
# run the loop N times and on each loop...
for (i in 1:N) {
# generate a data.frame with 4 columns, and add a random number into each one
# random number depends on the mean specified
xy[[i]] <- data.frame(um10 = rnorm(1, mean = -10),
u0 = rnorm(1, mean = 0),
u10 = rnorm(1, mean = 10),
u100 = rnorm(1, mean = 100))
}
# result is a list of data.frames with 1 row and 4 columns
# you can bind them together into one data.frame using do.call
# rbind means they will be merged row-wise
xy <- do.call(rbind, xy)
um10 u0 u10 u100
1 -11.241117 -0.5832050 10.394747 101.50421
2 -9.233200 0.3174604 9.900024 100.22703
3 -10.469015 0.4765213 9.088352 99.65822
4 -9.453259 -0.3272080 10.041090 99.72397
5 -10.593497 0.1764618 10.505760 101.00852
6 -10.935463 0.3845648 9.981747 100.05564
7 -11.447720 0.8477938 9.726617 99.12918
8 -11.373889 -0.3550321 9.806823 99.52711
9 -7.950092 0.5711058 10.162878 101.38218
10 -9.408727 0.5885065 9.471274 100.69328
Another way would be to pre-allocate a matrix, add in values and coerce it to a data.frame.
xy <- matrix(NA, nrow = N, ncol = 4)
for (i in 1:N) {
xy[i, ] <- rnorm(4, mean = c(-10, 0, 10, 100))
}
# notice that i name the column names post festum
colnames(xy) <- c("um10", "u0", "u10", "u100")
xy <- as.data.frame(xy)
As this is a learning question I will not provide the solution directly.
> values <- c(-10,0,10,100)
> for (i in seq_along(values)) {print(i)} # Checking we iterate by position
[1] 1
[1] 2
[1] 3
[1] 4
> output <- vector("double", 10)
> output # Checking the place where the output will be
[1] 0 0 0 0 0 0 0 0 0 0
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
Error in output[[i]] <- rnorm(10, mean = values[[i]]) :
more elements supplied than there are to replace
As you can see the error say there are more elements to put than space (each iteration generates 10 random numbers, (in total 40) and you only have 10 spaces. Consider using a data format that allows to store several values for each iteration.
So that:
> output <- ??
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
> output # Should have length 4 and each element all the 10 values you created in the loop
# set the number of rows
rows <- 10
# vector with the values
means <- c(-10,0,10,100)
# generating output matrix
output <- matrix(nrow = rows,
ncol = 4)
# setting seed and looping through the number of rows
set.seed(222)
for (i in 1:rows){
output[i,] <- rnorm(length(means),
mean=means)
}
#printing the output
output
Using a database with a numeric range defined by two columns start and end, I am trying to look up the factor, code, associated with a numeric value in a separate vector identityCodes.
database <- data.frame(start = seq(1, 150000000, 1000),
end = seq(1000, 150000000, 1000),
code = paste0(sample(LETTERS, 15000, replace = TRUE),
sample(LETTERS, 15000, replace = TRUE)))
identityCodes <- sample(1:15000000, 1000)
I've come up with a method for finding the corresponding codes using a for loop and subsetting:
fun <- function (x, y) {
z <- rep(NA, length(x))
for (i in 1:length(x)){
z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
}
return(z)
}
a <- fun(identityCodes, database)
But the method is slow, especially if I am to scale it:
system.time(fun(identityCodes, database))
user system elapsed
15.36 0.00 15.50
How can I identify the factors associated with each identityCodes faster? Is there a better way to go about this than using a for loop and subsetting?
Here's my attempt using data.table. Very fast - even though I am sure I am not leveraging it efficiently.
Given function:
# method 1
system.time(result1 <- fun(identityCodes, database))
user system elapsed
8.99 0.00 8.98
Using data.table
# method 2
require(data.table)
# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
# convert x to a data.table and set 'start' and 'end' as keys
DT <- setDT(x)
setkey(DT, start, end)
# create a lookup data.table where start and end are the identityCodes
DT2 <- data.table(start=y, end=y)
# overlap join where DT2 start & end are within DT start and end
res <- foverlaps(DT2, DT[, .(start, end)], type="within")
# store i as row number and key (for sorting later)
res[, i:=seq_len(nrow(res))]
setkey(res, i)
# merge the joined table to the original to get codes
final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]
# export as character the codes
as.character(final[[1]])
}
system.time(result2 <- dt_comb(x=database, y=identityCodes))
user system elapsed
0.08 0.00 0.08
identical(result1, result2)
[1] TRUE
edit: trimmed a couple lines from the function
This is about 45% faster on my machine:
result = lapply(identityCodes, function(x) {
data.frame(identityCode=x,
code=database[database$start <= x & database$end >= x, "code"])
})
result = do.call(rbind, result)
Here's a sample of the output:
identityCode code
1 6836845 OK
2 14100352 RB
3 2313115 NK
4 8440671 XN
5 11349271 TI
6 14467193 VL
I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
Here's the row subsample function:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.
Let me know what you think! Thanks in advance.
=========================================================================
UPDATE EDIT: Sample data from which to sample:
https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran's code in a function (in a sourced function.R file):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
Calling the function
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
outputs data, unfortunately with this warning message:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.
Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.
This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
The best part about R is that not only is this way, way faster, it's also way less code.