Make this simple loop more efficient in R? - r

I am trying to clean a large dataset. I have a matrix of prices in date order with the most recent date in the first row and different stocks along the columns. If a price is missing or NA on a given day for a given stock, I use the price from the day before. If the price on the final day is NA, I leave it.
I start by looping over the whole matrix and using an IF statement for each (i,j) pair. This was extremely slow. The next approach is as follows for the matrix of prices b: I use indexing to find indices which are NA, and only deal with these.
for(j in 1:ncol(b))
{
Index<-which( is.na(b[,j]) | b[,j]==0)
if(length(Index)==0)
{print("0 Missings")
Index<-c(1)#to ensure its not empty}
for(k in length(Index):1 )#backwards to fill from the bottom
{
i=Index[k]
#If the oldest date is missing, then set it to N/A so that N/A is passed forward as opposed to 0.
if( i==nrow(b) & ( b[[i,j]]==0 | is.na(b[[i,j]]) ) )
{
b[[i,j]]<-'#N/A'
}
else( b[[i,j]]==0 | is.na(b[[i,j]]) )
{
b[[i,j]] <- b[[i+1,j]]#Take the price from the date before
}
}
}
This is a little faster, but not much. It still takes over an hour for a 400x6000 matrix. I was hoping for a fully vectorised approach, where I did something like:
b[[Index,j]]<-b[[Index+1,j]]
However, I don't think R will use sequentially updated values. By this, I mean it wont progressively update from the bottom so that new values are used. This is important when I have 2 NA entries in a row, since the vectorised approach above will only fill one. But some sort of efficient sequential vectorised code will update the first, and use this to update the second. Any ideas?
Many thanks for your efforts

Here's an possibility using the MESS package and is essentially not different from #Roland's comment above so I'm only including it here so you can see the formatting. The filldown function is written in C++ so it's rather fast:
x <- matrix(c(1, 2, 3, 4, NA, 6, NA, NA, NA, NA, 11, 12, 13, 14, 15, NA, 17, 18, NA, 20), nrow=5)
x
[,1] [,2] [,3] [,4]
[1,] 1 6 11 NA
[2,] 2 NA 12 17
[3,] 3 NA 13 18
[4,] 4 NA 14 NA
[5,] NA NA 15 20
and then use
library(MESS)
apply(x, 2, filldown)
which produces
[,1] [,2] [,3] [,4]
[1,] 1 6 11 NA
[2,] 2 6 12 17
[3,] 3 6 13 18
[4,] 4 6 14 18
[5,] 4 6 15 20

If you want a completely R version then I have a consideration for you:
First define a proper, big, test set:
set.seed(42)
nRow <- 1000
nCol <- 500
test <- matrix(rnorm(nRow * nCol),
nrow = nRow,
ncol = nCol)
test[sample(nRow * nCol, nRow)] <- NA
Then write come code that passes down in the way you want using vectorisation (applicable to each column). PLease note that the horrible excel '#N/A' has been converted to a NaN which is maintains the storage mode (i.e. numeric).
innerF <- function(x){
# Nothing to change
if(!any(idx <- is.na(x) | x == 0))
return(x)
# Alter first value
if(is.na(x[1]) | x[1] == 0)
x[1] <- NaN
# First value altered
idx[1] <- FALSE
# Pass down
x[idx] <- x[which(idx) - 1]
# Return
x
}
Then define a calling strategy:
outerF <- function(x){
# Run once
y <- innerF(x)
# Run till done
while(any((is.na(y) & !is.nan(y)) |
(!is.na(y) & y == 0L))){
y <- innerF(y)
}
# Return
y
}
The test it against the alternative, and weep.... (hint: use MESS and its C++):
library(microbenchmark)
library(MESS)
microbenchmark(apply(test, 2, outerF), times = 100)
#Unit: microseconds
# expr min lq mean median uq max neval
# apply(test, 2, outerF) 630.07 652.4505 806.4808 670.6965 686.234 3253.27 100
microbenchmark(apply(test, 2, filldown), times = 100)
#Unit: microseconds
# expr min lq mean median uq max neval
# apply(test, 2, filldown) 107.482 110.048 134.9092 112.329 114.895 1980.016 100

My R Studio does not allow me to install the packages MOSS and ZOO, so I have to settle for a solution similar to na.locf. The code is here in case anybody wants to use this approach:
start.time<-Sys.time()
nrow<-nrow(b)
for(j in 2:ncol(b))
{
ColumnReversed<-rev(b[,j]) #So we fill from the bottom - Oldest date first
Index<-!is.na(as.numeric(matrix(ColumnReversed,ncol=1))) #1 for valid, 0 for Missing
ValidVals <- c("NA",ColumnReversed[Index]) #[NA,final known, second final known,...,first known]
FilledIndex <- cumsum(Index) + 1 # [0,0,0,0,0...,1,1,1,1,...,2,2,2,2,2,...3,3,3,3,3...,k,k] + 1
#This line stores the index of ValidValues which contains the prices (and values to be filled)
b[,j]<-rev( matrix(ValidVals[FilledIndex],ncol=1) )#Store in reversed order
}
Timing improved from 90 minutes to 65 seconds. AMAZING!

Related

R - Calculate rolling mean of previous k non-NA values

I'm trying to calculate the rolling mean of the previous k non-NA values within the dplyr/tidyverse framework. I've written a function that seems to work but was wondering if there's already a function from some package (which will probably be much more efficient than my attempt) doing exactly this. An example dataset:
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
Let's say I want the rolling mean of the previous 3 non-NA values. Then the output y should be:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
The first 5 elements of y are NAs because the first time x has 3 previous non-NA values is on row 6 and the average of those 3 elements is 2. The next y elements are self-explanatory. Row 9 gets a 4 because the 3 previous non-NA values of x are in rows 5, 6, and 7 and so on.
My attempt is this:
roll_mean_previous_k <- function(x, k){
require(dplyr)
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
to be used as follows (using the slide_dbl function from the slider package):
library(dplyr)
tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
which gives the desired output. However, I'm wondering if there's a ready-made, and (as mentioned before) more efficient way of doing this. I should mention that I know of rollmean and roll_mean from the zoo and RcppRoll packages respectively, but unless I'm mistaken, they seem to work on a fixed rolling window with the option of dealing with NA values (e.g ignoring them). In my case, I want to "extend" my window to include k non-NA values.
Any thoughts/suggestions are welcome.
EDIT - SIMULATION RESULTS
Thank you to all contributors. First of all, I had not mentioned that my datasets are indeed much larger and run often so any performance improvements are most welcome. I therefore ran the following simulation to check execution times, before deciding which answer to accept. Note, that some of the answers needed small tweaks to return the desired output, but if you feel that your solution is misrepresented (and therefore is less efficient than intended) feel free to let me know and I'll edit accordingly. I've used G. Grothendieck's trick from his answer below, to remove the need for if-else checks regarding the length of the lagged , non-NA vector.
So here's the simulation code:
library(tidyverse)
library(runner)
library(zoo)
library(slider)
library(purrr)
library(microbenchmark)
set.seed(20211004)
test_vector <- sample(x = 100, size = 1000, replace = TRUE)
test_vector[sample(1000, size = 250)] <- NA
# Based on GoGonzo's answer and the runner package
f_runner <- function(z, k){
runner(
x = z,
f = function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
}
)
}
# Based on my inital answer (but simplified), also mentioned by GoGonzo
f_slider <- function(z, k){
slide_dbl(
z,
function(x) {
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
.before = Inf
)
}
# Based on helios' answer. Return the correct results but with a warning.
f_helios <- function(z, k){
reduced_vec <- na.omit(z)
unique_means <- rollapply(reduced_vec, width = k, mean)
start <- which(!is.na(z))[k] + 1
repeater <- which(is.na(z)) + 1
repeater_cut <- repeater[(repeater > start-1) & (repeater <= length(z))]
final <- as.numeric(rep(NA, length(z)))
index <- start:length(z)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# Based on G. Grothendieck's answer (but I couldn't get it to run with the performance improvements)
f_zoo <- function(z, k){
rollapplyr(
z,
seq_along(z),
function(x, k){
mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
},
k)
}
# Based on AnilGoyal's answer
f_purrr <- function(z, k){
map_dbl(
seq_along(z),
~ ifelse(
length(tail(na.omit(z[1:(.x -1)]), k)) == k,
mean(tail(na.omit(z[1:(.x -1)]), k)),
NA
)
)
}
# Check if all are identical #
all(
sapply(
list(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_zoo(test_vector, 10)
),
FUN = identical,
f_slider(test_vector, 10),
)
)
# Run benchmarking #
microbenchmark(
# f_helios(test_vector, 10),
f_purrr(test_vector, 10),
f_runner(test_vector, 10),
f_slider(test_vector, 10),
f_zoo(test_vector, 10)
)
And the results:
Unit: milliseconds
expr min lq mean median uq max neval cld
f_purrr(test_vector, 10) 31.9377 37.79045 39.64343 38.53030 39.65085 104.9613 100 c
f_runner(test_vector, 10) 23.7419 24.25170 29.12785 29.23515 30.32485 98.7239 100 b
f_slider(test_vector, 10) 20.6797 21.71945 24.93189 26.52460 27.67250 32.1847 100 a
f_zoo(test_vector, 10) 43.4041 48.95725 52.64707 49.59475 50.75450 122.0793 100 d
Based on the above, and unless the code can be further improved, it seems as the slider and runner solutions are faster. Any final suggestions are more than welcome.
Many thanks for your time!!
With runner it will be something like mean of 3-elements tail window of non-na values. You can achive the same result with slider
library(runner)
tmp.df <- data.frame(
x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)
# using runner
tmp.df$y_runner <- runner(
x = tmp.df$x,
f = function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
}
)
# using slider
tmp.df$y_slider <- slider::slide_dbl(
tmp.df$x,
function(x) {
mean(
tail(
x[!is.na(x)],
3
)
)
},
.before = Inf
)
tmp.df
# x y_runner y_slider
# 1 NA NaN NaN
# 2 1 1.0 1.0
# 3 2 1.5 1.5
# 4 NA 1.5 1.5
# 5 3 2.0 2.0
# 6 4 3.0 3.0
# 7 5 4.0 4.0
# 8 NA 4.0 4.0
# 9 NA 4.0 4.0
# 10 NA 4.0 4.0
# 11 6 5.0 5.0
# 12 7 6.0 6.0
# 13 NA 6.0 6.0
rollapplyr. Regarding the comment about rollmean in the question, zoo also has rollappy and rollapplyr (right aligned) and those allow different widths (and offsets) for each component of the input by specifying a vector (as we do here) or list for width -- see ?rollapply for more info. We use a relatively naive vector of widths below and also show some improved width vectors which run faster.
Operation Create a Mean function which takes a vector, removes the last element and all NA's and takes the last k elements of what is left extending it to k elements with NA's as needed. Finally take the mean of that. We use rollapplyr to apply that to x with a width of seq_along(x).
Performance improvements. With this small data the following likely don't make much difference but if you have larger data you could try these which might improve the speed:
replace na.omit with na_rm from the collapse package
replace the second argument of rollapplyr with the code shown here.
The idea here is that the sum of the lengths of the k+1 longest runs of NA plus k+1 forms a bound on the number of elements that we need to consider. This (plus using na_rm) ran about 25% faster as the code in the question on a problem when I tried it with 1300 rows (formed from 100 copies of the data in the question) and does not add much extra code.
pmin(with(rle(is.na(x)), sum(tail(sort(lengths[values]), k+1)))+k+1, seq_along(x))
replace the second argument of rollapplyr with w where w is shown here. The idea here is to use findInterval to find the element k non-NA's back which provides an even tigher bound. This one (plus using na_rm) ran nearly twice as fast as the code in the question when tried with the same 1300 rows at the expense of adding 2 more lines of code.
tt <- length(x) - rev(cumsum(rev(!is.na(x))))
w <- seq_along(tt) - findInterval(tt - k - 1, tt)
Code. With the data in the question the code below (not using the above improvements) ran slightly faster (not a lot) than the code in the question based on my benchmarking and it is only two lines of code.
library(dplyr)
library(zoo)
Mean <- function(x, k) mean(`length<-`(tail(na.omit(head(x, -1)), k), k))
tmp.df %>% mutate(y = rollapplyr(x, seq_along(x), Mean, k = 3))
giving:
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6
Since I am not aware of a ready-made way of computing your output in any standard library, I came up with the implementation roll_mean_k_efficient below, which seems to speed up your computations considerably. Note that this implementation makes use of the rollapply and the na.locf methods from the zoo package.
rm(list = ls())
library("zoo")
library("rbenchmark")
library("dplyr")
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
# your sample (extended)
tmp.df <- data.frame(
x = rep(c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA), 100)
)
# enhanced implementation
roll_mean_k_efficient <- function(x, k){
reduced_vec = na.omit(x)
unique_means = rollapply(reduced_vec, width=k, mean)
start = which(!is.na(x))[k] + 1
repeater = which(is.na(x)) + 1
repeater_cut = repeater[(repeater > start-1) & (repeater <= length(x))]
final <- as.numeric(rep(NA, length(x)))
index = start:length(x)
final[setdiff(index, repeater_cut)] <- unique_means
final[(start):length(final)] <- na.locf(final)
final
}
# old implementation
roll_mean_previous_k <- function(x, k){
res <- NA
lagged_vector <- dplyr::lag(x)
lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
previous_k_values <- tail(lagged_vector_without_na, k)
if (length(previous_k_values) >= k) res <- mean(previous_k_values)
res
}
# wrapper function for the benchmarking below
roll_mean_benchmark = function(){
res = tmp.df %>%
mutate(
y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
)
return(res)
}
# some benchmarking
benchmark(roll_mean_k_efficient(x = x, k=3),
roll_mean_benchmark(),
columns=c('test','elapsed','replications'),
replications = 100)
Furthermore, I extended your example vector x to get some more reliable benchmark results via the benchmark function from the rbenchmark package.
In my case the benchmark runtimes that are printed after running the code are:
test elapsed replications
2 roll_mean_benchmark() 4.463 100
1 roll_mean_k_efficient(x = x, k = 3) 0.039 100
Without using zoo. In tidyverse fashion, you can also do it using purrr::map
tmp.df %>% mutate(y = map(seq_along(x), ~ ifelse(length(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)) ==3,
mean(tail(na.omit(tmp.df$x[1:(.x -1)]), 3)),
NA)))
x y
1 NA NA
2 1 NA
3 2 NA
4 NA NA
5 3 NA
6 4 2
7 5 3
8 NA 4
9 NA 4
10 NA 4
11 6 4
12 7 5
13 NA 6

Creating iterator for a data frame in R

I'm trying to create a function in R using an iterator from the iterator package: https://cran.r-project.org/web/packages/iterators/iterators.pdf to help iterator through each row of a data frame.
Given a table like this:
data<-data.frame(c(1,0,0,NA,NA,1,1,NA,0), ncol=3, byrow=TRUE)
>data
[,1] [,2] [,3]
[1,] 1 0 0
[2,] NA NA 1
[3,] 1 NA 0
I want it to go through each row and return the first non NA value from left to right and return NA if all values are NA. So with the above data frame, it should return 1, 1, 1.
The general idea I have right now is to use the iter() function from the package like so:
vec<-vector()
iterRow<-iter(data[x,]) #Creates iterator object for row x of data
i<-1
while(i<iterRow$length){ #iterRow$length gives # of columns essentially
temp<-nextElem(iterRow) #Set temp to the next element of the iterator
if(!is.na(temp)){ #If the value is not NA, set value in vec to the value
vec<-c(vec, temp)
}
i<-i+1
}
vec<-c(vec, NA) #Otherwise set it to NA
return(vec)
The data i'm working with will be up to millions of rows long so ideally I would like to vectorize the function. I'm stuck on how to apply that idea across the whole data frame.
Would it work to make the function like this:
iterateRows<- function(dataFrame){
...
}
with the data frame i'm working with as the argument.
I also do know c++ so if using c++ to write a similar function would be easier, I could also do that. Any help will be appreciated!
Start with a simple approach. Here's a function to do what you want to each row:
first_not_na = function(x) if(all(is.na(x))) NA else x[!is.na(x)][1]
Here's a couple simple ways to apply that to every row of data.
# apply
results = apply(data, 1, foo)
# for loop
results = numeric(nrow(data))
for (i in 1:row) results[i] = foo(data[i, ])
Here's a benchmark comparing timings on pretty large data:
row = 1e6
col = 5
data = matrix(sample(c(1, 0, NA), size = row * col, replace = T), nrow = row)
microbenchmark::microbenchmark(
apply = {results = apply(data, 1, foo)},
loop = {
results = numeric(row)
for (i in 1:row) results[i] = foo(data[i, ])
},
times = 5
)
# Unit: seconds
# expr min lq mean median uq max neval cld
# apply 2.140379 2.249405 2.399239 2.480180 2.524667 2.601563 5 a
# loop 1.970481 1.982853 2.160342 2.090484 2.264797 2.493095 5 a
A simple for loop takes about 2 seconds for 1M rows and 5 columns. If you want to speed up more, you could certainly use foreach to parallelize. Only if that's still not fast enough should you look for more complex solutions like iterators or an implementation in C++.

R keep randomly generating numbers until all numbers within specified range are present

My aim is to randomly generate a vector of integers using R, which is populated by numbers between 1-8. However, I want to keep growing the vector until all the numbers between 1:8 are represented at least once, e.g. 1,4,6,2,2,3,5,1,4,7,6,8.
I am able to generate single numbers or a sequence of numbers using sample
x=sample(1:8,1, replace=T)
>x
[1] 6
I have played around with the repeat function to see how it might work with sample and I can at least get the generation to stop when one specific number occurs, e.g.
repeat {
print(x)
x = sample(1:8, 1, replace=T)
if (x == 3){
break
}
}
Which gives:
[1] 3
[1] 6
[1] 6
[1] 6
[1] 6
[1] 6
[1] 2
I am struggling now to work out how to stop number generation once all numbers between 1:8 are present. Additionally, I know that the above code is only printing the sequence as it is generated and not storing it as a vector. Any advice pointing me in the right direction would be really appreciated!
This is fine for 1:8 but might not always be a good idea.
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:8, 1))
if(all(1:8 %in% foo)) break
}
foo
# [1] 8 8 3 7 6 5 6 2 6 6 4 6 8 3 4 8 8 1
If you have more than 1:8, it may be better to obtain the average number of tries (N) required to get all the numbers at least once and then sample N numbers such that all numbers are sampled at least once.
set.seed(42)
vec = 1:8
N = ceiling(sum(length(vec)/(1:length(vec))))
foo = sample(c(vec, sample(vec, N - length(vec), TRUE)))
foo
# [1] 3 6 8 3 8 8 6 4 5 6 1 6 4 6 6 3 5 7 2 2 7 8
Taking cue off of d.b, here's a slightly more verbose method that is more memory-efficient (and a little faster too, though I doubt speed is your issue):
Differences:
pre-allocate memory in chunks (size 100 here), mitigates the problem with extend-by-one vector work; allocating and extending 100 (or even 1000) at a time is much lower cost
compare only the newest number instead of all numbers each time (the first n-1 numbers have already been tabulated, no need to do that again)
Code:
microbenchmark(
r2evans = {
emptyvec100 <- integer(100)
counter <- 0
out <- integer(0)
unseen <- seq_len(n)
set.seed(42)
repeat {
if (counter %% 100 == 0) out <- c(out, emptyvec100)
counter <- counter+1
num <- sample(n, size=1)
unseen <- unseen[unseen != num]
out[counter] <- num
if (!length(unseen)) break
}
out <- out[1:counter]
},
d.b = {
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:n, 1))
if(all(1:n %in% foo)) break
}
}, times = 100, unit = 'us')
# Unit: microseconds
# expr min lq mean median uq max neval
# r2evans 1090.007 1184.639 1411.531 1228.947 1320.845 11344.24 1000
# d.b 1242.440 1372.264 1835.974 1441.916 1597.267 14592.74 1000
(This is intended neither as code-golf nor speed-optimization. My primary goal is to argue against extend-by-one vector work, and suggest a more efficient comparison technique.)
As d.b further suggested, this works fine for 1:8 but may run into trouble with larger numbers. If we extend n up:
(Edit: with d.b's code changes, the execution times are much closer, and not nearly as exponential looking. Apparently the removal of unique had significant benefits to his code.)

Customly calculate deviation between columns in R

I am a newbie in R.
I am trying to calculate the deviation between columns, and expect several rules applied:
deviation is calculated by current value minus previous value
if current value is NA, then return NA without calculation
if previous value is NA, then current value minus the value before previous value, until minus a valid value
the value in the first column is always valid
For example:
start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
test
start a b
1 1 2 4
2 2 NA 5
3 3 5 NA
4 4 6 8
Expected:
result
a_delta b_delta
1 1 2
2 NA 3
3 2 NA
4 2 2
Note:
cell (2, 1) in result is NA because cell (2, 2) in test is NA
cell (2, 2) in result is 3 because cell (2, 3) minus cell (2, 1) in result gets 3
Here is my broken code. Any suggestions are welcomed:
f <- function(data){
cn <- colnames(data)
cl <- ncol(data)
for (i in 2:cl)){
if (is.na(data$i)) {a <- NA}
else if (!is.na(data$(i-1))) {paste(cn[i], "_delta") <- data$cn[i] - data$cn[i-1]}
else { # check if previous value is NA repeatively
t < i - 1
while (is.na(data$cn[t])) {
t <- t - 1
}
paste(cn[i], "_delta") <- data$cn[i] - data$cn[t]
}
}
}
f(test)
Main function:
CalcDev <- function(x){
x <- unlist(x)
if (!any(is.na(x))){
return(diff(x, 1))
} else {
tmp <- x[-1]
res <- diff(na.omit(x), 1)
tmp[!is.na(tmp)] <- res
return(tmp)
}
}
And how to use:
start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
plyr::adply(test, 1, CalcDev)[, -1]
Result:
a b
1 1 2
2 NA 3
3 2 NA
4 2 2
You just need to rename columns.
I was unable to run your code, so no benchmark.
EDIT:
Answering your comment, you can use CalcDev function in dplyr chain if you vecorize it:
CalcDev.Vect <- Vectorize(CalcDev)
test %>%
CalcDev.Vect %>%
.[, -1] %>%
as.data.frame
You will get similar results, and it will be much faster, especially for bigger data sets.
There is two alternatives: using CalcDev inside do({}) or adply directly in chain, but both will be slower solutions.
Benchmarks for small data set :
expr min lq mean median uq max neval cld
foo.plyr(test) 2240.34 2392.08 2511.3 2490.13 2577.32 3199.16 100 b
foo.do_dplyr(test) 2680.34 2933.70 3104.4 3015.15 3109.48 5771.83 100 d
foo.plyr_in_dplyr(test) 2471.51 2635.04 2805.7 2702.99 2802.29 9422.46 100 c
foo.Vect(test) 441.55 490.58 539.7 539.92 564.74 928.41 100 a
And for bigger data sets difference in evaluation time will be more drastic.

R Improve performance of function(s)

This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.

Resources