How to best combine unique and match in R? - r

I found myself often writing code such as
#' #param x input vector
#' #param ... passed to [slow_fun()]
fast_fun <- function(x, ...) {
u <- unique(x)
i <- match(x, u)
v <- slow_fun(u, ...)
v[i]
}
To accelerate a slow vectorized "pure" function where each input entry could theoretically be computed individually and where input is expected to contain many duplicates.
Now I wonder whether this is the best way to achieve such a speedup or is there some function (preferrably in base R or the tidyverse) which does something like unique and match at the same time?
Benchmarks so far
Thanks for the provided answers. I've written a small benchmark suite to compare the approaches:
method <- list(
brute = slow_fun,
unique_match = function(x, ...) {
u <- unique(x)
i <- match(x, u)
v <- slow_fun(u, ...)
v[i]
},
unique_factor = function(x, ...) {
if (is.character(x)) {
x <- factor(x)
i <- as.integer(x)
u <- levels(x)
} else {
u <- unique(x)
i <- as.integer(factor(x, levels = u))
}
v <- slow_fun(u, ...)
v[i]
},
unique_match_df = function(x, ...) {
u <- unique(x)
i <- if (is.numeric(x)) {
match(data.frame(t(round(x, 10))), data.frame(t(round(u, 10))))
} else {
match(data.frame(t(x)), data.frame(t(u)))
}
v <- slow_fun(u, ...)
v[i]
},
rcpp_uniquify = function(x, ...) {
iu <- uniquify(x)
v <- slow_fun(iu[["u"]], ...)
v[iu[["i"]]]
}
)
exprs <- lapply(method, function(fun) substitute(fun(x), list(fun = fun)))
settings$bench <- lapply(seq_len(nrow(settings)), function(i) {
cat("\rBenchmark ", i, " / ", nrow(settings), sep = "")
x <- switch(
settings$type[i],
integer = sample.int(
n = settings$n_distinct[i],
size = settings$n_total[i],
replace = TRUE
),
double = sample(
x = runif(n = settings$n_distinct[i]),
size = settings$n_total[i],
replace = TRUE
),
character = sample(
x = stringi::stri_rand_strings(
n = settings$n_distinct[i],
length = 20L
),
size = settings$n_total[i],
replace = TRUE
)
)
microbenchmark::microbenchmark(
list = exprs
)
})
library(tidyverse)
settings %>%
mutate(
bench = map(bench, summary)
) %>%
unnest(bench) %>%
group_by(n_distinct, n_total, type) %>%
mutate(score = median / min(median)) %>%
group_by(expr) %>%
summarise(mean_score = mean(score)) %>%
arrange(mean_score)
Currently, the rcpp-based approach is best in all tested settings on my machine but barely manages to exceed the unique-then-match method.
I suspect a greater advantage in performance the longer x becomes, because unique-then-match needs two passes over the data while uniquify() only needs one pass.
|expr | mean_score|
|:---------------|----------:|
|rcpp_uniquify | 1.018550|
|unique_match | 1.027154|
|unique_factor | 5.024102|
|unique_match_df | 36.613970|
|brute | 45.106015|

Maybe you can try factor + as.integer like below
as.integer(factor(x))

I found a cool, and fast, answer recently,
match(data.frame(t(x)), data.frame(t(y)))
As always, beware when working with floats. I recommend something like
match(data.frame(t(round(x,10))), data.frame(t(round(y))))
in such cases.

I've finally managed to beat unique() and match() using Rcpp to hand-code the algorithm in C++ using a std::unordered_map as core bookkeeping data structure.
Here is the source code, which can be used in R by writing it into a file and running Rcpp::sourceCpp on it.
#include <Rcpp.h>
using namespace Rcpp;
template <int T>
List uniquify_impl(Vector<T> x) {
IntegerVector idxes(x.length());
typedef typename Rcpp::traits::storage_type<T>::type storage_t;
std::unordered_map<storage_t, int> unique_map;
int n_unique = 0;
// 1. Pass through x once
for (int i = 0; i < x.length(); i++) {
storage_t curr = x[i];
int idx = unique_map[curr];
if (idx == 0) {
unique_map[curr] = ++n_unique;
idx = n_unique;
}
idxes[i] = idx;
}
// 2. Sort unique_map by its key
Vector<T> uniques(unique_map.size());
for (auto &pair : unique_map) {
uniques[pair.second - 1] = pair.first;
}
return List::create(
_["u"] = uniques,
_["i"] = idxes
);
}
// [[Rcpp::export]]
List uniquify(RObject x) {
switch (TYPEOF(x)) {
case INTSXP: {
return uniquify_impl(as<IntegerVector>(x));
}
case REALSXP: {
return uniquify_impl(as<NumericVector>(x));
}
case STRSXP: {
return uniquify_impl(as<CharacterVector>(x));
}
default: {
warning(
"Invalid SEXPTYPE %d (%s).\n",
TYPEOF(x), type2name(x)
);
return R_NilValue;
}
}
}

Related

Rewriting R's cummin() function using Rcpp and allowing for NAs

I'm learning Rcpp. In this example, I'm attempting to roll my own cummin() function like base R's cummin(), except I'd like my version to have a na.rm argument. This is my attempt
cummin.cpp
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cummin_cpp(NumericVector x, bool narm = false){
// Given a numeric vector x, returns a vector of the
// same length representing the cumulative minimum value
// if narm = true, NAs will be ignored (The result may
// contain NAs if the first values of x are NA.)
// if narm = false, the resulting vector will return the
// cumulative min until the 1st NA value is encountered
// at which point all subsequent entries will be NA
if(narm){
// Ignore NAs
for(int i = 1; i < x.size(); i++){
if(NumericVector::is_na(x[i]) | (x[i-1] < x[i])) x[i] = x[i-1];
}
} else{
// Don't ignore NAs
for(int i = 1; i < x.size(); i++){
if(NumericVector::is_na(x[i-1]) | NumericVector::is_na(x[i])){
x[i] = NA_REAL;
} else if(x[i-1] < x[i]){
x[i] = x[i-1];
}
}
}
return x;
}
foo.R
library(Rcpp)
sourceCpp("cummin.cpp")
x <- c(3L, 1L, 2L)
cummin(x) # 3 1 1
cummin_cpp(x) # 3 1 1
class(cummin(x)) # integer
class(cummin_cpp(x)) # numeric
I have a few questions..
R's standard variable name is na.rm, not narm as I've done. However, it seems I can't use a dot in the c++ variable name. Is there a way around this so I can be consistent with R's convention?
I don't know ahead of time if the user's input is going to be a numeric vector or an integer vector, so I've used Rcpp's NumericVector type. Unfortunately, if the input is integer, the output is cast to numeric unlike base R's cummin() behavior. How do people usually deal with this issue?
The line if(NumericVector::is_na(x[i]) | (x[i-1] < x[i])) x[i] = x[i-1]; seems silly, but I don't know a better way to do this. Suggestions here?
I would use this:
template<typename T, int RTYPE>
Vector<RTYPE> cummin_cpp2(Vector<RTYPE> x, bool narm){
Vector<RTYPE> res = clone(x);
int i = 1, n = res.size();
T na;
if(narm){
// Ignore NAs
for(; i < n; i++){
if(ISNAN(res[i]) || (res[i-1] < res[i])) res[i] = res[i-1];
}
} else{
// Do not ignore NAs
for(; i < n; i++){
if(ISNAN(res[i-1])) {
na = res[i-1];
break;
} else if(res[i-1] < res[i]){
res[i] = res[i-1];
}
}
for(; i < n; i++){
res[i] = na;
}
}
return res;
}
// [[Rcpp::export]]
SEXP cummin_cpp2(SEXP x, bool narm = false) {
switch (TYPEOF(x)) {
case INTSXP: return cummin_cpp2<int, INTSXP>(x, narm);
case REALSXP: return cummin_cpp2<double, REALSXP>(x, narm);
default: Rcpp::stop("SEXP Type Not Supported.");
}
}
Try this on:
x <- c(NA, 7, 5, 4, NA, 2, 4)
x2 <- as.integer(x)
cummin_cpp(x, narm = TRUE)
x
cummin_cpp(x2)
x2
x <- c(NA, 7, 5, 4, NA, 2, 4)
x2 <- as.integer(x)
x3 <- replace(x, is.na(x), NaN)
cummin_cpp2(x, narm = TRUE)
x
cummin_cpp2(x2)
x2
cummin_cpp2(x3)
x3
Explanation:
Joran's advice is good, just wrap that in an R function
I use a dispatcher as Joseph Wood suggested
Beware that x is passed by reference and is modified if of the same type of what you declared (see these 2 slides)
You need to handle NA as well as NaN
You can use || instead of | to evaluate only the first condition if it is true.

Output to pdf not working with ReferenceClasses methods in R?

Output to pdf not working with ReferenceClasses methods in R?
This is an example taken from the ReferenceClasses R doc, with some minor
modification:
mEdit = setRefClass("mEdit", fields = list(data="matrix", edits="list"))
mEdit$methods(
edit = function(i, j, value) {
backup = list(i, j, data[i, j])
data[i, j] <<- value
edits <<- c(edits, list(backup))
invisible(value)
}
)
mEdit$methods(
undo = function() {
prev = edits
if(length(prev)) {
prev = prev[[length(prev)]]
}
else {
stop("No more edits to undo!")
}
edit(prev[[1]], prev[[2]], prev[[3]])
length(edits) <<- length(edits) - 2
invisible(prev)
}
)
mEdit$methods(
show = function() {
message("ClassName: ", classLabel(class(.self)))
message("Data:")
methods::show(data)
message("Undo list length: ", length(edits))
}
)
mEdit$methods(
.DollarNames.mEdit = function(x, pattern) {
grep(pattern, getRefClass(class(x))$methods(), value=TRUE)
}
)
x = matrix(1:24, 3, 8)
xx = mEdit(data=x)
xx$edit(2,2,0)
xx$show()
xx$edit(3, 5, 1)
xx$show()
xx$undo()
xx$show()
mv = setRefClass(
"matrixViewer",
fields=c("viewerDevice", "viewerFile"),
contains="mEdit"
)
mv$methods(
.DollarNames.mEdit = function(x, pattern) {
grep(pattern, getRefClass(class(x))$methods(), value=TRUE)
}
)
mv$methods(
view = function() {
## dd = dev.cur();
## dev.set(viewerDevice)
## devAskNewPage(FALSE)
image(
data,
main=paste("After", length(edits), "edits")
)
## dev.set(dd)
}
)
mv$methods(
edit = function(i,j, value) {
callSuper(i,j, value)
view()
}
)
mv$methods(
initialize = function(file="./mv.pdf", ...) {
viewerFile <<- file
## pdf(viewerFile)
## viewerDevice <<- dev.cur()
## dev.set(dev.prev())
callSuper(...)
}
)
mv$methods(
finalize = function() {
dev.off(viewerDevice)
}
)
x = matrix(rnorm(64, 0, 34), 8, 8)
xx = mv(file="/tmp/x.pdf", data=x)
xx$edit(2,2,0)
xx$edit(3, 5, 1)
xx$edit(4, 4, 2.3)
xx$undo()
xx$view()
Note that I have commented out those lines concerning switch
of output devices, so it uses the default device all through,
otherwise when the view method
is called, the plot is not written to the pdf file at all.
Any idea why this is happening?
Call rm on xx and then call garbage collection. finalize will then be called which will invoke dev.off and the pdf will be written. This assumes everything is uncommented.
rm(xx)
gc()
Also your .DollarNames should be
.DollarNames.mEdit = function(x, pattern) {
grep(pattern, getRefClass(class(x))$methods(), value=TRUE)
}
.DollarNames.matrixViewer = function(x, pattern) {
grep(pattern, getRefClass(class(x))$methods(), value=TRUE)
}
and are not methods of the Reference class. They are external functions seperate to the Reference classes.
So the main takeaway here is that finalize is not called until the object is garbage collected.

How to make my program faster for pattern matching?

I don't like the way I have coded this and I think there should be far faster algorithms than this or R might have facilities to work this out.
four.in.a.row = function(player, v, debug=TRUE) {
if (debug) {
cat(sep="", "four.in.a.row(player=", player, ", v=", v, ")\n")
}
for (i in 1:(length(v)-4))
if (v[i]==player)
if (v[i]==v[i+1])
if (v[i+1]==v[i+2])
if (v[i+2]==v[i+3])
return(TRUE)
# ...
return(FALSE)
}
# four.in.a.row("X", c("O","X","X","X","X","O"))
# is TRUE, while
# four.in.a.row("O", c("O","X","X","X","X","O"))
# is FALSE.
Please let me know how the complexity can be improved. Also can you please tell me what is the complexity of the current algorithm?
Also when I use the following I receive an error:
four.in.a.row = function(player, v, debug=TRUE) {
if (debug) {
cat(sep="", "four.in.a.row(player=", player, ", v=", v, ")\n")
}
for (i in 1:(length(v)-4))
{
count=1
if (v[i]==player)
for (j in i+1:i+3)
if (v[i]==v[j])
count=count+1
if (count==4)
return(TRUE)
}
# ...
return(FALSE) # correct this return() statement
}
Error in if (v[i] == v[j]) count = count + 1 :
missing value where TRUE/FALSE needed
You can use rle:
four.in.a.row <- function(player, v) {
with(rle(v), any(lengths >= 4 & values == player))
}
Here is a more cryptic approach that should be even faster:
four.in.a.row <- function(player, v) {
x <- player == v
in.a.row <- seq_along(x) - cummax(seq_along(x) * !x)
any(in.a.row >= 4)
}
Finally, you could use regular expressions:
four.in.a.row = function(player, v) {
x <- paste(v, collapse = "")
pattern <- paste(rep(player, 4), collapse = "")
grepl(pattern, x)
}
You would have to test on your data to see which approach is faster. If fast enough, I would go with the rle approach for its conciseness.

Right (or left) side trimmed mean

Using:
mean (x, trim=0.05)
Removes 2.5% from each side of the distribution, which is fine for symmetrical two-tailed data. But if I have one tailed or highly asymmetric data I would like to be able to remove just one side of the distribution. Is there a function for this or do I have write myself a new one? If so, how?
Just create a modified mean.default. First look at mean.default:
mean.default
Then modify it to accept a new argument:
mean.default <-
function (x, trim = 0, na.rm = FALSE, ..., side="both")
{
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
return(NA_real_)
}
if (na.rm)
x <- x[!is.na(x)]
if (!is.numeric(trim) || length(trim) != 1L)
stop("'trim' must be numeric of length one")
n <- length(x)
if (trim > 0 && n) {
if (is.complex(x))
stop("trimmed means are not defined for complex data")
if (any(is.na(x)))
return(NA_real_)
if (trim >= 0.5)
return(stats::median(x, na.rm = FALSE))
lo <- if( side=="both" || side=="right" ){ floor(n * trim) + 1 }else{1}
hi <- if( side=="both" || side=="left" ){ n + 1 - (floor(n * trim) + 1 ) }else{ n}
x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
cat(c(length(x), lo , hi) )
}
.Internal(mean(x))
}
I don't know of a function. Something like the following would trim off the upper tail of the distribution before taking the mean.
upper.trim.mean <- function(x,trim) {
x <- sort(x)
mean(x[1:floor(length(x)*(1-trim))])
}
This should account for either side, or both sides for trimming.
trim.side.mean <- function(x, trim, type="both"){
if (type == "both") {
mean(x,trim)}
else if (type == "right") {
x <- sort(x)
mean(x[1:floor(length(x)*(1-trim))])}
else if (type == "left"){
x <- sort(x)
mean(x[max(1,floor(length(x)*trim)):length(x)])}}
one.sided.trim.mean <- function(x, trim, upper=T) {
if(upper) trim = 1-trim
data <- mean(x[x<quantile(x, trim)])
}
I found that all the answers posted do not match when checked manually. So I created one of my own. Its long but simple enough to understand
get_trim <- function(x,trim,type)
{
x <- sort(x)
ans<-0
if (type=="both")
{
for (i in (trim+1):(length(x)-trim))
{
ans=ans+x[i];
}
print(ans/(length(x)-(2*trim)))
}
else if(type=="left")
{
for (i in (trim+1):(length(x)))
{
ans=ans+x[i];
}
print(ans/(length(x)-trim))
}
else if (type=="right")
{
for (i in 1:(length(x)-trim))
{
ans=ans+x[i];
}
print(ans/(length(x)-trim))
}
}

Efficiently replace a fixed position substring with a string of equal or larger length

What would be an efficient way of replacing a fixed position substring with another string of equal or larger length?
For example, the following replaces the substring "abc" by finding the position of "abc" first and then replacing it:
sub("abc", "123", "iabc.def", fixed = TRUE)
#[1] "i123.def"
sub("abc", "1234", "iabc.def", fixed = TRUE)
#[1] "i1234.def"
However, we know that the substring "abc" is ALWAYS in character positions 2, 3 and 4. In this case, is there a way of specifying those positions so that the string matching doesn't need to be performed and the character indices used instead?
I did try to use substr() but it didn't work as I had hoped when the replacement string is larger than the substring being replaced:
x <- "iabc.def"
substr(x, 2, 4) <- "123"
#[1] "i123.def"
x <- "iabc.def"
substr(x, 2, 4) <- "1234"
#[1] "i123.def"
Many thanks in advance for your time,
Tony Breyal
P.S. The above may be the most efficient way of doing what I want but I thought I would ask just in case there is a better way :)
===== TIMINGS =====
# test elapsed relative
# 7 francois.fx_wb(x, replacement) 0.94 1.000000
# 1 f(x) 1.56 1.659574
# 6 francois.fx(x, replacement) 2.23 2.372340
# 5 Sobala(x) 3.89 4.138298
# 2 Hong.Ooi(x) 4.41 4.691489
# 3 DWin(x) 5.57 5.925532
# 4 hadley(x) 9.47 10.074468
The above timings were generated from the code below:
library(rbenchmark)
library(stringr)
library(Rcpp)
library(inline)
f <- function(x, replacement = "1234") sub("abc", replacement, x, fixed = TRUE)
Hong.Ooi <- function(x, replacement = "1234") paste(substr(x, 1, 1), replacement, substr(x, 5, nchar(x)), sep = "")
DWin <- function(x, replacement = paste("\\1", "1234", sep = "")) sub("^(.)abc", replacement, x)
Sobala <- function(x, replacement = paste("\\1", "1234", sep = "")) sub("^(.).{3}", replacement, x, perl=TRUE)
hadley <- function(x, replacement = "1234") {
str_sub(x, 2, 4) <- replacement
return(x)
}
francois.fx <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '
const char* rep =as<const char*>(rep_) ;
CharacterVector x(x_) ;
int nrep = strlen( rep ) ;
int n = x.size() ;
CharacterVector res(n) ;
char buffer[1024] ;
for(int i=0; i<n; i++) {
const char* xi = x[i] ;
if( strncmp( xi+1, "abc", 3 ) ) {
res[i] = x[i] ;
} else{
buffer[0] = xi[0] ;
strncpy( buffer + 1, rep, nrep ) ;
strcpy( buffer + 1 + nrep, xi + 4 ) ;
res[i] = buffer ;
}
}
return res ;
', plugin = "Rcpp" )
francois.fx_wb <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '
const char* rep =as<const char*>(rep_) ;
int nrep = strlen( rep ) ;
int n=Rf_length(x_) ;
SEXP res = PROTECT( Rf_allocVector( STRSXP, n ) ) ;
char buffer[1024] ;
for(int i=0; i<n; i++) {
const char* xi = char_get_string_elt(x_, i) ;
if( strncmp( xi+1, "abc", 3 ) ) {
set_string_elt( res, i, get_string_elt(x_,i)) ;
} else{
buffer[0] = xi[0] ;
strncpy( buffer + 1, rep, nrep ) ;
strcpy( buffer + 1 + nrep, xi + 4 ) ;
char_set_string_elt(res, i, buffer ) ;
}
}
UNPROTECT(1) ;
return res ;
', plugin = "Rcpp" )
x <- rep("iabc.def", 1e6)
replacement <- "1234"
benchmark(f(x), Hong.Ooi(x), DWin(x), hadley(x), Sobala(x), francois.fx(x, replacement), francois.fx_wb(x, replacement),
columns = c("test", "elapsed", "relative"),
order = "relative",
replications = 10)
Here is one solution based on Rcpp.
fx <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '
const char* rep =as<const char*>(rep_) ;
CharacterVector x(x_) ;
int nrep = strlen( rep ) ;
int n = x.size() ;
CharacterVector res(n) ;
char buffer[1024] ;
for(int i=0; i<n; i++) {
const char* xi = x[i] ;
if( strncmp( xi+1, "abc", 3 ) ) {
res[i] = x[i] ;
} else{
buffer[0] = xi[0] ;
strncpy( buffer + 1, rep, nrep ) ;
strcpy( buffer + 1 + nrep, xi + 4 ) ;
res[i] = buffer ;
}
}
return res ;
', plugin = "Rcpp" )
it does not improve much on the simple sub solution because write access to strings in R are protected by the write barrier. I get better results if I cheat on the write barrier, but I'm not fully aware of the consequences, so I should probably advise against it :/
fx_wb <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '
const char* rep =as<const char*>(rep_) ;
int nrep = strlen( rep ) ;
int n=Rf_length(x_) ;
SEXP res = PROTECT( Rf_allocVector( STRSXP, n ) ) ;
char buffer[1024] ;
for(int i=0; i<n; i++) {
const char* xi = char_get_string_elt(x_, i) ;
if( strncmp( xi+1, "abc", 3 ) ) {
set_string_elt( res, i, get_string_elt(x_,i)) ;
} else{
buffer[0] = xi[0] ;
strncpy( buffer + 1, rep, nrep ) ;
strcpy( buffer + 1 + nrep, xi + 4 ) ;
char_set_string_elt(res, i, buffer ) ;
}
}
UNPROTECT(1) ;
return res ;
', plugin = "Rcpp" )
Write barrier
The R Internals manual describes the write barrier:
A generational collector needs to efficiently ‘age’ the objects,
especially list-like objects (including STRSXPs). This is done by
ensuring that the elements of a list are regarded as at least as old
as the list when they are assigned. This is handled by the functions
SET_VECTOR_ELT and SET_STRING_ELT, which is why they are functions and
not macros. Ensuring the integrity of such operations is termed the
write barrier and is done by making the SEXP opaque and only providing
access via functions (which cannot be used as lvalues in assignments
in C).
All code in R extensions is by default behind the write barrier.
And Luke Tierney's document describes the logic behind why:
The generational collector divides allocated nodes into generations
based on some notion of age. Younger generations are collected more
frequently than older ones. For this to work correctly, any younger
nodes that are reachable only from older nodes must be handled
properly. This is accomplished by a write barrier that monitors each
assignment and takes appropriate action when a reference to a new node
is placed in an older one.
You can still use regex with a placeholder like this:
> sub("^(.)abc", "\\1xyz", c("aabcdef", "xxxxxxx"))
[1] "axyzdef" "xxxxxxx"
Most straightforward way I can think of:
x <- paste(substr(x, 1, 1), "1234", substr(x, 5, nchar(x)), sep="")
Some improvement of DWin function.
function(x, replacement = paste("\\1", "1234", sep = ""))
sub("^(.).{3}", replacement, x,perl=TRUE)

Resources