"Truncate" number to lie in a range quickly in R - r

Is there any simple built-in R function or quick hack to set a given number to lie in a given range if it isn't already? I want something that does this:
truncate <- function(x,a,b) min(max(x,a),b)
which is of course simple but seems like it should be built in. I know truncation has a different meaning in R (cutting off decimals), but it has my meaning in, e.g., probability and statistics.

Some (vectorised) options, including microbenchmark
set.seed(2019)
x <- sample(1:1000, 10^6, replace = T)
# Option 1: Using replace
truncate1 <- function(x, a, b) {
replace(replace(x, x < a, a), x > b, b)
}
# Option 2: Direct indexing
truncate2 <- function(x, a, b) {
x[x < a] <- a
x[x > b] <- b
x
}
# Option 3: Using vectorised min/max
truncate3 <- function(x, a, b) pmin(pmax(x, a), b)
# Option 4: Rcpp version
library(Rcpp)
cppFunction("
NumericVector truncate4(NumericVector x, double a, double b) {
x[x < a] = a;
x[x > b] = b;
return(x);
}
")
library(microbenchmark)
res <- microbenchmark(
truncate1 = truncate1(x, 20, 50),
truncate2 = truncate2(x, 20, 50),
truncate3 = truncate3(x, 20, 50),
truncate4 = truncate4(x, 20, 50))
#Unit: milliseconds
# expr min lq mean median uq max neval
# truncate1 21.11311 24.72446 28.53179 26.93042 30.36175 67.24833 100
# truncate2 18.92472 23.04515 28.72290 26.33326 29.27687 76.96418 100
# truncate3 15.25429 17.89847 23.96331 21.31570 24.56693 63.58878 100
# truncate4 24.37332 27.33767 32.30241 29.97932 33.41832 72.65265 100

Related

More efficient function and for loop

I am trying to do a more efficient for loop. I know the existence of sapply, laaply, etc. but I don't know how to implement it in my code.
I have my function which I don't know if it is very efficient. I think I should improve this but I don't know how.
myfun <- function(a, b, c) {
sum <- 0
iter <- 0
while (sum < c) {
nr <- runif(1, a, b)
sum <- sum + nr
iter <- iter + 1
}
return(iter)
}
And here is the part which I would like to use an sapply or something similar.
a <- 0
b <- 1
c <- 2
x <- 0
for (i in 1:10^9) {
x <- x + myfun(a, b, c)
}
Also, I need to make a sapply similar to this
sapply(1:10^9, functie(a ,b ,c))
But the sapply uses 1:10^9 as parameters, instead of a, b, c.
I think replicate() is what you may be looking for (I changed your n to something smaller).
set.seed(1234)
n <- 10^2
y <- replicate(n, myfun(a,b,c))
sum(y)
# [1] 462
This matches your prior result.
set.seed(1234)
a <-0
b <-1
c <-2
x <-0
for (i in 1:n){
x <- x + myfun(a,b,c)
}
x
# [1] 462
Recursion
Here is a recursive function f() that does the same job as myfun().
f <- function(s=0) {
if (s[length(s)] >= 2) {
return(length(s) - 1L)
} else {
f(c(s, s[length(s)] + runif(1, 0L, 1L)))
}
}
set.seed(42)
f()
# [1] 3
replicate(8, f())
# [1] 4 5 4 4 3 5 3 5
stopifnot(all.equal({set.seed(42);f()}, {set.seed(42);myfun(0, 1, 2)}))
However (and most likely for that reason), it's just cooler, not faster.
Rcpp
Learning from that, we may define the while loop in Rcpp.
library(Rcpp)
cppFunction('
double myfun_cpp() {
double s = 0;
int i = 0;
while (s < 2) {
s = s + R::runif(0, 1);
i++;
}
return i;
}
')
set.seed(42)
myfun_cpp()
# [1] 3
replicate(8, myfun_cpp())
# [1] 4 5 4 4 3 5 3 5
stopifnot(all.equal({set.seed(42);myfun_cpp()}, {set.seed(42);myfun(0, 1, 2)}))
Now it's lightning fast:
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f 22.244076 22.639439 25.345203 22.927777 24.089196 35.82683 100 c
# myfun_cpp 3.204448 3.260542 3.843632 3.294618 3.347971 13.71213 100 a
# myfun 16.823981 17.125346 20.605663 17.516248 27.385791 28.63267 100 b
set.seed(42); R <- 1e3
microbenchmark::microbenchmark(
f=replicate(R, f()),
myfun_cpp=replicate(R, myfun_cpp()),
myfun=replicate(R, myfun(0, 1, 2)), times=1e2L,
control=list(warmup=1e1L))
Here are some options
A base R recursion method
f_TIC <- function(x, y, z) ifelse(z <= 0, 0, f_TIC(x, y, z - runif(1, x, y)) + 1)
Rcpp implementation of f_TIC
library(Rcpp)
cppFunction("
int f_TIC_cpp(double x, double y, double z) {
if (z <= 0) {
return 0;
} else {
return f_TIC_cpp(x, y, z- R::runif(0,1))+1;
}
}
")
Benchmarking
library(Rcpp)
f <- function(s = 0) {
if (s[length(s)] >= 2) {
return(length(s) - 1L)
} else {
f(c(s, s[length(s)] + runif(1, 0L, 1L)))
}
}
f_TIC <- function(x, y, z) ifelse(z <= 0, 0, f_TIC(x, y, z - runif(1, x, y)) + 1)
cppFunction("
double myfun_cpp() {
double s = 0;
int i = 0;
while (s < 2) {
s = s + R::runif(0, 1);
i++;
}
return i;
}
")
cppFunction("
int f_TIC_cpp(double x, double y, double z) {
if (z <= 0) {
return 0;
} else {
return f_TIC_cpp(x, y, z- R::runif(0,1))+1;
}
}
")
myfun <- function(a, b, c) {
sum <- 0
iter <- 0
while (sum < c) {
nr <- runif(1, a, b)
sum <- sum + nr
iter <- iter + 1
}
return(iter)
}
set.seed(42)
R <- 1e3
microbenchmark::microbenchmark(
f = replicate(R, f()),
f_TIC = replicate(R, f_TIC(0, 1, 2)),
f_TIC_cpp = replicate(R, f_TIC_cpp(0,1,2)),
myfun_cpp = replicate(R, myfun_cpp()),
myfun = replicate(R, myfun(0, 1, 2)),
times = 1e2L,
control = list(warmup = 1e1L)
)
and we will see
Unit: milliseconds
expr min lq mean median uq max neval
f 11.9342 12.50330 14.161982 13.02100 14.96575 22.7116 100
f_TIC 20.1925 21.69420 23.678240 22.28255 24.86350 34.1577 100
f_TIC_cpp 2.0293 2.10080 2.639625 2.17505 2.36190 7.9715 100
myfun_cpp 1.7351 1.79415 2.094577 1.83810 2.00495 6.7481 100
myfun 9.1408 9.45240 11.783504 10.32355 14.68815 19.5400 100
I would probably solve this using purrr::map(). E.g. like this:
c(1:1e9) %>%
purrr::map_dbl(
~ myfun(a, b, c)
) %>%
sum()
This first calls myfun() the same number of times as the length of c(1:1e9), and stores the results in a numeric vector, then it uses sum() to add the results together.
My tests shows it's a bit faster than using replicate().
You're doing it right, in my honest opinion.
Since you don't need to return a vectorized or multi-dimensional result but instead update an existing object at each iteration, the for loop you're suggesting is more than adequate.
If you want to take a look at some great discussion about this topic I suggest you to look at this link: https://r4ds.had.co.nz/iteration.html
Edit: just to address the speed argument
start <- Sys.time()
purrr::map_dbl(1:1000, function(x) y + myfun(a, b, c)) %>% sum
end <- Sys.time()
end - start
# Time difference of 0.02593184 secs
start <- Sys.time()
y <- replicate(1000, myfun(a,b,c))
cumsum(y)[1000]
end <- Sys.time()
end - start
# Time difference of 0.01755929 secs
y <- 0
start <- Sys.time()
for(i in 1:1000){
y<- y + myfun(a,b,c)
}
end <- Sys.time()
end - start
# Time difference of 0.01459098 secs

problem with using coalesce to merge 2 columns [duplicate]

Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
Example
a <- c(1, 2, NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7, 8, NA, 9, 10)
coalesce(a,b,c)
# [1] 1 2 NA 4 6
Question
Is there any efficient way to implement coalesce in R?
On my machine, using Reduce gets a 5x performance improvement:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
Looks like coalesce1 is still available
coalesce1 <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- is.na(ans)
ans[i] <- elt[i]
}
ans
}
which is faster still (but more-or-less a hand re-write of Reduce, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100
coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100
coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
showing that which() can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+ coalesce1a(aa, bb, cc),
+ coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
expr min lq median uq max neval
coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10
coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10
coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
From data.table >= 1.12.3 you can use fcoalesce.
library(data.table)
fcoalesce(a, b, c)
# [1] 1 2 NA 4 6
fcoalesce can also take "a single plain list, data.table or data.frame". Thus, if the vectors above were columns in a data.frame (or a data.table), we could simply supply the name of the data set:
d = data.frame(a, b, c)
# or d = data.table(a, b, c)
fcoalesce(d)
# [1] 1 2 NA 4 6
For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.
Using dplyr package:
library(dplyr)
coalesce(a, b, c)
# [1] 1 2 NA 4 6
Benchamark, not as fast as accepted solution:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
microbenchmark::microbenchmark(
coalesce(a, b, c),
coalesce2(a, b, c)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b
# coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a
But on a larger dataset, it is comparable:
aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)
microbenchmark::microbenchmark(
coalesce(aa, bb, cc),
coalesce2(aa, bb, cc))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a
# coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a
I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest.
It will also work for vectors of different length, and has a special treatment for vectors of length one:
expr min lq median uq max neval
coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10
coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10
coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10
coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10
coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10
Here's the code:
coalesce.na <- function(x, ...) {
x.len <- length(x)
ly <- list(...)
for (y in ly) {
y.len <- length(y)
if (y.len == 1) {
x[is.na(x)] <- y
} else {
if (x.len %% y.len != 0)
warning('object length is not a multiple of first object length')
pos <- which(is.na(x))
x[pos] <- y[(pos - 1) %% y.len + 1]
}
}
x
}
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.
A very simple solution is to use the ifelse function from the base package:
coalesce3 <- function(x, y) {
ifelse(is.na(x), y, x)
}
Although it appears to be slower than coalesce2 above:
test <- function(a, b, func) {
for (i in 1:10000) {
func(a, b)
}
}
system.time(test(a, b, coalesce2))
user system elapsed
0.11 0.00 0.10
system.time(test(a, b, coalesce3))
user system elapsed
0.16 0.00 0.15
You can use Reduce to make it work for an arbitrary number of vectors:
coalesce4 <- function(...) {
Reduce(coalesce3, list(...))
}
Here is my solution:
coalesce <- function(x){
y <- head( x[is.na(x) == F] , 1)
return(y)
}
It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:
column_names <- c("col1", "col2", "col3")
how to use:
ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]
Another apply method, with mapply.
mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1] 1 2 NA 4 6
This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.
Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.
unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
dots=list(a, b, c), MoreArgs=NULL))
[1] 1 2 NA 4 6
.mapplydiffers in important ways from its non-dotted cousin.
it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
Another option is to use do.call and pmin:
do.call(pmin, c(list(a,b,c), list(na.rm=TRUE)))
Output
[1] 1 2 NA 4 6

combine duplicate columns into one column data frame r [duplicate]

Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
Example
a <- c(1, 2, NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7, 8, NA, 9, 10)
coalesce(a,b,c)
# [1] 1 2 NA 4 6
Question
Is there any efficient way to implement coalesce in R?
On my machine, using Reduce gets a 5x performance improvement:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
Looks like coalesce1 is still available
coalesce1 <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- is.na(ans)
ans[i] <- elt[i]
}
ans
}
which is faster still (but more-or-less a hand re-write of Reduce, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100
coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100
coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
showing that which() can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+ coalesce1a(aa, bb, cc),
+ coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
expr min lq median uq max neval
coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10
coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10
coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
From data.table >= 1.12.3 you can use fcoalesce.
library(data.table)
fcoalesce(a, b, c)
# [1] 1 2 NA 4 6
fcoalesce can also take "a single plain list, data.table or data.frame". Thus, if the vectors above were columns in a data.frame (or a data.table), we could simply supply the name of the data set:
d = data.frame(a, b, c)
# or d = data.table(a, b, c)
fcoalesce(d)
# [1] 1 2 NA 4 6
For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.
Using dplyr package:
library(dplyr)
coalesce(a, b, c)
# [1] 1 2 NA 4 6
Benchamark, not as fast as accepted solution:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
microbenchmark::microbenchmark(
coalesce(a, b, c),
coalesce2(a, b, c)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b
# coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a
But on a larger dataset, it is comparable:
aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)
microbenchmark::microbenchmark(
coalesce(aa, bb, cc),
coalesce2(aa, bb, cc))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a
# coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a
I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest.
It will also work for vectors of different length, and has a special treatment for vectors of length one:
expr min lq median uq max neval
coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10
coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10
coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10
coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10
coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10
Here's the code:
coalesce.na <- function(x, ...) {
x.len <- length(x)
ly <- list(...)
for (y in ly) {
y.len <- length(y)
if (y.len == 1) {
x[is.na(x)] <- y
} else {
if (x.len %% y.len != 0)
warning('object length is not a multiple of first object length')
pos <- which(is.na(x))
x[pos] <- y[(pos - 1) %% y.len + 1]
}
}
x
}
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.
A very simple solution is to use the ifelse function from the base package:
coalesce3 <- function(x, y) {
ifelse(is.na(x), y, x)
}
Although it appears to be slower than coalesce2 above:
test <- function(a, b, func) {
for (i in 1:10000) {
func(a, b)
}
}
system.time(test(a, b, coalesce2))
user system elapsed
0.11 0.00 0.10
system.time(test(a, b, coalesce3))
user system elapsed
0.16 0.00 0.15
You can use Reduce to make it work for an arbitrary number of vectors:
coalesce4 <- function(...) {
Reduce(coalesce3, list(...))
}
Here is my solution:
coalesce <- function(x){
y <- head( x[is.na(x) == F] , 1)
return(y)
}
It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:
column_names <- c("col1", "col2", "col3")
how to use:
ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]
Another apply method, with mapply.
mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1] 1 2 NA 4 6
This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.
Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.
unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
dots=list(a, b, c), MoreArgs=NULL))
[1] 1 2 NA 4 6
.mapplydiffers in important ways from its non-dotted cousin.
it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
Another option is to use do.call and pmin:
do.call(pmin, c(list(a,b,c), list(na.rm=TRUE)))
Output
[1] 1 2 NA 4 6

Blend vectors in R [duplicate]

Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
Example
a <- c(1, 2, NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7, 8, NA, 9, 10)
coalesce(a,b,c)
# [1] 1 2 NA 4 6
Question
Is there any efficient way to implement coalesce in R?
On my machine, using Reduce gets a 5x performance improvement:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
Looks like coalesce1 is still available
coalesce1 <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- is.na(ans)
ans[i] <- elt[i]
}
ans
}
which is faster still (but more-or-less a hand re-write of Reduce, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100
coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100
coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
showing that which() can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+ coalesce1a(aa, bb, cc),
+ coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
expr min lq median uq max neval
coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10
coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10
coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
From data.table >= 1.12.3 you can use fcoalesce.
library(data.table)
fcoalesce(a, b, c)
# [1] 1 2 NA 4 6
fcoalesce can also take "a single plain list, data.table or data.frame". Thus, if the vectors above were columns in a data.frame (or a data.table), we could simply supply the name of the data set:
d = data.frame(a, b, c)
# or d = data.table(a, b, c)
fcoalesce(d)
# [1] 1 2 NA 4 6
For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.
Using dplyr package:
library(dplyr)
coalesce(a, b, c)
# [1] 1 2 NA 4 6
Benchamark, not as fast as accepted solution:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
microbenchmark::microbenchmark(
coalesce(a, b, c),
coalesce2(a, b, c)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b
# coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a
But on a larger dataset, it is comparable:
aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)
microbenchmark::microbenchmark(
coalesce(aa, bb, cc),
coalesce2(aa, bb, cc))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a
# coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a
I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest.
It will also work for vectors of different length, and has a special treatment for vectors of length one:
expr min lq median uq max neval
coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10
coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10
coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10
coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10
coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10
Here's the code:
coalesce.na <- function(x, ...) {
x.len <- length(x)
ly <- list(...)
for (y in ly) {
y.len <- length(y)
if (y.len == 1) {
x[is.na(x)] <- y
} else {
if (x.len %% y.len != 0)
warning('object length is not a multiple of first object length')
pos <- which(is.na(x))
x[pos] <- y[(pos - 1) %% y.len + 1]
}
}
x
}
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.
A very simple solution is to use the ifelse function from the base package:
coalesce3 <- function(x, y) {
ifelse(is.na(x), y, x)
}
Although it appears to be slower than coalesce2 above:
test <- function(a, b, func) {
for (i in 1:10000) {
func(a, b)
}
}
system.time(test(a, b, coalesce2))
user system elapsed
0.11 0.00 0.10
system.time(test(a, b, coalesce3))
user system elapsed
0.16 0.00 0.15
You can use Reduce to make it work for an arbitrary number of vectors:
coalesce4 <- function(...) {
Reduce(coalesce3, list(...))
}
Here is my solution:
coalesce <- function(x){
y <- head( x[is.na(x) == F] , 1)
return(y)
}
It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:
column_names <- c("col1", "col2", "col3")
how to use:
ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]
Another apply method, with mapply.
mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1] 1 2 NA 4 6
This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.
Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.
unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
dots=list(a, b, c), MoreArgs=NULL))
[1] 1 2 NA 4 6
.mapplydiffers in important ways from its non-dotted cousin.
it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
Another option is to use do.call and pmin:
do.call(pmin, c(list(a,b,c), list(na.rm=TRUE)))
Output
[1] 1 2 NA 4 6

How to implement coalesce efficiently in R

Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
Example
a <- c(1, 2, NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7, 8, NA, 9, 10)
coalesce(a,b,c)
# [1] 1 2 NA 4 6
Question
Is there any efficient way to implement coalesce in R?
On my machine, using Reduce gets a 5x performance improvement:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
Looks like coalesce1 is still available
coalesce1 <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- is.na(ans)
ans[i] <- elt[i]
}
ans
}
which is faster still (but more-or-less a hand re-write of Reduce, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100
coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100
coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
showing that which() can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+ coalesce1a(aa, bb, cc),
+ coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
expr min lq median uq max neval
coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10
coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10
coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
From data.table >= 1.12.3 you can use fcoalesce.
library(data.table)
fcoalesce(a, b, c)
# [1] 1 2 NA 4 6
fcoalesce can also take "a single plain list, data.table or data.frame". Thus, if the vectors above were columns in a data.frame (or a data.table), we could simply supply the name of the data set:
d = data.frame(a, b, c)
# or d = data.table(a, b, c)
fcoalesce(d)
# [1] 1 2 NA 4 6
For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.
Using dplyr package:
library(dplyr)
coalesce(a, b, c)
# [1] 1 2 NA 4 6
Benchamark, not as fast as accepted solution:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
microbenchmark::microbenchmark(
coalesce(a, b, c),
coalesce2(a, b, c)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b
# coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a
But on a larger dataset, it is comparable:
aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)
microbenchmark::microbenchmark(
coalesce(aa, bb, cc),
coalesce2(aa, bb, cc))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a
# coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a
I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest.
It will also work for vectors of different length, and has a special treatment for vectors of length one:
expr min lq median uq max neval
coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10
coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10
coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10
coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10
coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10
Here's the code:
coalesce.na <- function(x, ...) {
x.len <- length(x)
ly <- list(...)
for (y in ly) {
y.len <- length(y)
if (y.len == 1) {
x[is.na(x)] <- y
} else {
if (x.len %% y.len != 0)
warning('object length is not a multiple of first object length')
pos <- which(is.na(x))
x[pos] <- y[(pos - 1) %% y.len + 1]
}
}
x
}
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.
A very simple solution is to use the ifelse function from the base package:
coalesce3 <- function(x, y) {
ifelse(is.na(x), y, x)
}
Although it appears to be slower than coalesce2 above:
test <- function(a, b, func) {
for (i in 1:10000) {
func(a, b)
}
}
system.time(test(a, b, coalesce2))
user system elapsed
0.11 0.00 0.10
system.time(test(a, b, coalesce3))
user system elapsed
0.16 0.00 0.15
You can use Reduce to make it work for an arbitrary number of vectors:
coalesce4 <- function(...) {
Reduce(coalesce3, list(...))
}
Here is my solution:
coalesce <- function(x){
y <- head( x[is.na(x) == F] , 1)
return(y)
}
It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:
column_names <- c("col1", "col2", "col3")
how to use:
ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]
Another apply method, with mapply.
mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1] 1 2 NA 4 6
This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.
Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.
unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
dots=list(a, b, c), MoreArgs=NULL))
[1] 1 2 NA 4 6
.mapplydiffers in important ways from its non-dotted cousin.
it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
Another option is to use do.call and pmin:
do.call(pmin, c(list(a,b,c), list(na.rm=TRUE)))
Output
[1] 1 2 NA 4 6

Resources