Efficient way to get location of match between vectors - r

I am in need of efficiency for finding the indexes (not the logical vector) between two vectors. I can do this with:
which(c("a", "q", "f", "c", "z") %in% letters[1:10])
In the same way it is better to find the position of the maximum number with which.max:
which(c(1:8, 10, 9) %in% max(c(1:8, 10, 9)))
which.max(c(1:8, 10, 9))
I am wondering if I have the most efficient way of finding the position of matching terms in the 2 vectors.
EDIT:
Per the questions/comments below. I am operating on a list of vectors. The problem involves operating on sentences that have been broken into a bag of words as seen below. The list may contain 10000-20000 or more character vectors. Then based on that index I will grab 4 words before and 4 words after the index and calculate a score.
x <- list(c('I', 'like', 'chocolate', 'cake'), c('chocolate', 'cake', 'is', 'good'))
y <- rep(x, 5000)
lapply(y, function(x) {
which(x %in% c("chocolate", "good"))
})

Here's a relatively faster way using data.table:
require(data.table)
vv <- vapply(y, length, 0L)
DT <- data.table(y = unlist(y), id = rep(seq_along(y), vv), pos = sequence(vv))
setkey(DT, y)
# OLD CODE which will not take care of no-match entries (commented)
# DT[J(c("chocolate", "good")), list(list(pos)), by=id]$V1
setkey(DT[J(c("chocolate", "good"))], id)[J(seq_along(vv)), list(list(pos))]$V1
The idea:
First we unlist your list into a column of DT named y. In addition, we create two other columns named id and pos. id tells the index in the list and pos tells the position within that id. Then, by creating a key column on id, we can do fast subsetting. With this subsetting we'll get corresponding pos values for each id. Before we collect all pos for each id in a list and then just output the list column (V1), we take care of those entries where there was no match for our query by setting key to id after first subsetting and subsetting on all possible values of id (as this'll result in NA for non-existing entries.
Benchmarking with the lapply code on your post:
x <- list(c('I', 'like', 'chocolate', 'cake'), c('chocolate', 'cake', 'is', 'good'))
y <- rep(x, 5000)
require(data.table)
arun <- function() {
vv <- vapply(y, length, 0L)
DT <- data.table(y = unlist(y), id = rep(seq_along(y), vv), pos = sequence(vv))
setkey(DT, y)
setkey(DT[J(c("chocolate", "good"))], id)[J(seq_along(vv)), list(list(pos))]$V1
}
tyler <- function() {
lapply(y, function(x) {
which(x %in% c("chocolate", "good"))
})
}
require(microbenchmark)
microbenchmark(a1 <- arun(), a2 <- tyler(), times=50)
Unit: milliseconds
expr min lq median uq max neval
a1 <- arun() 30.71514 31.92836 33.19569 39.31539 88.56282 50
a2 <- tyler() 626.67841 669.71151 726.78236 785.86444 955.55803 50
> identical(a1, a2)
# [1] TRUE

The C++ answer was faster comparing single characters, but I think using a vector of strings introduced enough overhead that now it's slower:
char1 <- c("a", "q", "f", "c", "z")
char2 <- letters[1:10]
library(inline)
cpp_whichin_src <- '
Rcpp::CharacterVector xa(a);
Rcpp::CharacterVector xb(b);
int n_xa = xa.size();
int n_xb = xb.size();
NumericVector res(n_xa);
std::vector<std::string> sa = Rcpp::as< std::vector<std::string> >(xa);
std::vector<std::string> sb = Rcpp::as< std::vector<std::string> >(xb);
for(int i=0; i < n_xa; i++) {
for(int j=0; j<n_xb; j++) {
if( sa[i] == sb[j] ) res[i] = i+1;
}
}
return res;
'
cpp_whichin <- cxxfunction(signature(a="character",b="character"), cpp_whichin_src, plugin="Rcpp")
which.in_cpp <- function(char1, char2) {
idx <- cpp_whichin(char1,char2)
idx[idx!=0]
}
which.in_naive <- function(char1, char2) {
which(char1 %in% char2)
}
which.in_CW <- function(char1, char2) {
unlist(sapply(char2,function(x) which(x==char1)))
}
which.in_cpp(char1,char2)
which.in_naive(char1,char2)
which.in_CW(char1,char2)
** Benchmarks **
library(microbenchmark)
microbenchmark(
which.in_cpp(char1,char2),
which.in_naive(char1,char2),
which.in_CW(char1,char2)
)
set.seed(1)
cmb <- apply(combn(letters,2), 2, paste,collapse="")
char1 <- sample( cmb, 100 )
char2 <- sample( cmb, 100 )
Unit: microseconds
expr min lq median uq max
1 which.in_cpp(char1, char2) 114.890 120.023 126.6930 135.5630 537.011
2 which.in_CW(char1, char2) 697.505 725.826 766.4385 813.8615 8032.168
3 which.in_naive(char1, char2) 17.391 20.289 22.4545 25.4230 76.826
# Same as above, but with 3 letter combos and 1000 sampled
Unit: microseconds
expr min lq median uq max
1 which.in_cpp(char1, char2) 8505.830 8715.598 8863.3130 8997.478 9796.288
2 which.in_CW(char1, char2) 23430.493 27987.393 28871.2340 30032.450 31926.546
3 which.in_naive(char1, char2) 129.904 135.736 158.1905 180.260 3821.785

Related

Random rearrangement of a vector in R [duplicate]

I want to permute a vector so that an element can't be in the same place after permutation, as it was in the original. Let's say I have a list of elements like this: AABBCCADEF
A valid shuffle would be: BBAADEFCCA
But these would be invalid: BAACFEDCAB or BCABFEDCAB
The closest answer I could find was this: python shuffle such that position will never repeat. But that's not quite what I want, because there are no repeated elements in that example.
I want a fast algorithm that generalizes that answer in the case of repetitions.
MWE:
library(microbenchmark)
set.seed(1)
x <- sample(letters, size=295, replace=T)
terrible_implementation <- function(x) {
xnew <- sample(x)
while(any(x == xnew)) {
xnew <- sample(x)
}
return(xnew)
}
microbenchmark(terrible_implementation(x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
terrible_implementation(x) 479.5338 2346.002 4738.49 2993.29 4858.254 17005.05 10
Also, how do I determine if a sequence can be permuted in such a way?
EDIT: To make it perfectly clear what I want, the new vector should satisfy the following conditions:
1) all(table(newx) == table(x))
2) all(x != newx)
E.g.:
newx <- terrible_implementation(x)
all(table(newx) == table(x))
[1] TRUE
all(x != newx)
[1] TRUE
#DATA
set.seed(1)
x <- sample(letters, size=295, replace=T)
foo = function(S){
if(max(table(S)) > length(S)/2){
stop("NOT POSSIBLE")
}
U = unique(S)
done_chrs = character(0)
inds = integer(0)
ans = character(0)
while(!identical(sort(done_chrs), sort(U))){
my_chrs = U[!U %in% done_chrs]
next_chr = my_chrs[which.min(sapply(my_chrs, function(x) length(setdiff(which(!S %in% x), inds))))]
x_inds = which(S %in% next_chr)
candidates = setdiff(seq_along(S), union(x_inds, inds))
if (length(candidates) == 1){
new_inds = candidates
}else{
new_inds = sample(candidates, length(x_inds))
}
inds = c(inds, new_inds)
ans[new_inds] = next_chr
done_chrs = c(done_chrs, next_chr)
}
return(ans)
}
ans_foo = foo(x)
identical(sort(ans_foo), sort(x)) & !any(ans_foo == x)
#[1] TRUE
library(microbenchmark)
microbenchmark(foo(x))
#Unit: milliseconds
# expr min lq mean median uq max neval
# foo(x) 19.49833 22.32517 25.65675 24.85059 27.96838 48.61194 100
I think this satisfies all your conditions. The idea is to order by the frequency, start with the most common element and shift the value to the next value in the frequency table by the number of times the most common element appears. This will guarantee all elements will be missed.
I've written in data.table, as it helped me during debugging, without losing too much performance. It's a modest improvement performance-wise.
library(data.table)
library(magrittr)
library(microbenchmark)
permute_avoid_same_position <- function(y) {
DT <- data.table(orig = y)
DT[, orig_order := .I]
count_by_letter <-
DT[, .N, keyby = orig] %>%
.[order(N)] %>%
.[, stable_order := .I] %>%
.[order(-stable_order)] %>%
.[]
out <- copy(DT)[count_by_letter, .(orig, orig_order, N), on = "orig"]
# Dummy element
out[, new := first(y)]
origs <- out[["orig"]]
nrow_out <- nrow(out)
maxN <- count_by_letter[["N"]][1]
out[seq_len(nrow_out) > maxN, new := head(origs, nrow_out - maxN)]
out[seq_len(nrow_out) <= maxN, new := tail(origs, maxN)]
DT[out, j = .(orig_order, orig, new), on = "orig_order"] %>%
.[order(orig_order)] %>%
.[["new"]]
}
set.seed(1)
x <- sample(letters, size=295, replace=T)
testthat::expect_true(all(table(permute_avoid_same_position(x)) == table(x)))
testthat::expect_true(all(x != permute_avoid_same_position(x)))
microbenchmark(permute_avoid_same_position(x), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max
# permute_avoid_same_position(x) 5.650378 5.771753 5.875116 5.788618 5.938604 6.226228
x <- sample(1:1000, replace = TRUE, size = 1e6)
testthat::expect_true(all(table(permute_avoid_same_position(x)) == table(x)))
testthat::expect_true(all(x != permute_avoid_same_position(x)))
microbenchmark(permute_avoid_same_position(x), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max
# permute_avoid_same_position(x) 239.7744 385.4686 401.521 438.2999 440.9746 503.0875
We could extract substrings by the boundary of the repeating elements, sample and replicate
library(stringr)
sapply(replicate(10, sample(str_extract_all(str1, "([[:alpha:]])\\1*")[[1]]),
simplify = FALSE), paste, collapse="")
#[1] "BBAAEFDCCA" "AAAFBBEDCC" "BBAAAEFCCD" "DFACCBBAAE" "AAFCCBBEAD"
#[6] "DAAAECCBBF" "AAFCCDBBEA" "CCEFADBBAA" "BBAAEADCCF" "AACCBBDFAE"
data
str1 <- "AABBCCADEF"

Which lines of a matrix are equal to a certain vector

I have a piece of code searching which lines of a matrix boxes are equal to a given vector x. This codes uses the apply function, and i wonder if it can be optimized more ?
x = floor(runif(4)*10)/10
boxes = as.matrix(do.call(expand.grid, lapply(1:4, function(x) {
seq(0, 1 - 1/10, length = 10)
})))
# can the following line be more optimised ? :
result <- which(sapply(1:nrow(boxes),function(i){all(boxes[i,] == x)}))
I did not manage to get rid of the apply function myself but maybe you'll have better ideas than me :)
One option is which(colSums(t(boxes) == x) == ncol(boxes)).
Vectors are recycled column-wise, so we need to transpose boxes before comparing to x with ==. Then we can pick which column (transposed row) has a sum of ncol(boxes), i.e. all TRUE values.
Here's a benchmark for this (possibly not representative) example
Irnv <- function() which(sapply(1:nrow(boxes),function(i){all(boxes[i,] == x)}))
ICT <- function() which(colSums(t(boxes) == x) == ncol(boxes))
RS <- function() which(rowSums(mapply(function(i, j) boxes[, i] == j, seq_len(ncol(boxes)), x)) == length(x))
RS2 <- function(){
boxes <- data.frame(boxes)
which(rowSums(mapply(`==`, boxes, x)) == length(x))
}
akrun <- function() which(rowSums((boxes == x[col(boxes)])) == ncol(boxes))
microbenchmark(Irnv(), ICT(), RS(), RS2(), akrun())
# Unit: microseconds
# expr min lq mean median uq max neval
# Irnv() 19218.470 20122.2645 24182.2337 21882.8815 24949.1385 66387.719 100
# ICT() 300.308 323.2830 466.0395 342.3595 430.1545 7878.978 100
# RS() 566.564 586.2565 742.4252 617.2315 688.2060 8420.927 100
# RS2() 698.257 772.3090 1017.0427 842.2570 988.9240 9015.799 100
# akrun() 442.667 453.9490 579.9102 473.6415 534.5645 6870.156 100
We can also use rowSums on a replicated 'x' to make the lengths same
which(rowSums((boxes == x[col(boxes)])) == ncol(boxes))
Or use the rep
which(rowSums(boxes == rep(x, each = nrow(boxes))) == ncol(boxes))
Or with sweep and rowSums
which(rowSums(sweep(boxes, 2, x, `==`)) == ncol(boxes))
which(sapply(1:nrow(boxes),function(i){all(boxes[i,] == x)}))
#[1] 5805
A variation to your answer using mapply.
which(rowSums(mapply(function(i, j) boxes[, i] == j, seq_len(ncol(boxes)), x)) == length(x))
#[1] 5805
We can simplify (only reducing the key strokes, see ICT's benchmarks) the above version if boxes is allowed to be dataframe.
boxes <- data.frame(boxes)
which(rowSums(mapply(`==`, boxes, x)) == length(x))
#[1] 5805
Benchmarks on my system for various answers on a fresh R session
Irnv <- function() which(sapply(1:nrow(boxes),function(i){all(boxes[i,] == x)}))
ICT <- function() which(colSums(t(boxes) == x) == ncol(boxes))
RS <- function() which(rowSums(mapply(function(i, j) boxes[, i] == j, seq_len(ncol(boxes)), x)) == length(x))
RS2 <- function(){
boxes <- data.frame(boxes)
which(rowSums(mapply(`==`, boxes, x)) == length(x))
}
akrun <- function() which(rowSums((boxes == x[col(boxes)])) == ncol(boxes))
akrun2 <- function() which(rowSums(boxes == rep(x, each = nrow(boxes))) == ncol(boxes))
akrun3 <- function() which(rowSums(sweep(boxes, 2, x, `==`)) == ncol(boxes))
library(microbenchmark)
microbenchmark(Irnv(), ICT(), RS(), RS2(), akrun(), akrun2(), akrun3())
#Unit: microseconds
# expr min lq mean median uq max neval
#Irnv() 16335.205 16720.8905 18545.0979 17640.7665 18691.234 49036.793 100
#ICT() 195.068 215.4225 444.9047 233.8600 329.288 4635.817 100
#RS() 527.587 577.1160 1344.3033 639.7180 1373.426 36581.216 100
#RS2() 648.996 737.6870 1810.3805 847.9865 1580.952 35263.632 100
#akrun() 384.498 402.1985 761.0542 421.5025 1176.129 4102.214 100
#akrun2() 840.324 853.9825 1415.9330 883.3730 1017.014 34662.084 100
#akrun3() 399.645 459.7685 1186.7605 488.3345 1215.601 38098.927 100
data
set.seed(3251)
x = floor(runif(4)*10)/10
boxes = as.matrix(do.call(expand.grid, lapply(1:4, function(x) {
seq(0, 1 - 1/10, length = 10)
})))

Memory efficient creation of sparse matrix

I have a list of 50000 string vectors, consisting of various combinations of 6000 unique strings.
Goal: I want to transform them in "relative frequencies" (table(x)/length(x)) and store them in a
sparse matrix. Low memory consumption is more important than speed. Currently memory is the bottleneck.
(Even though source data has about ~50 mb and data in target format ~10mb --> Transformation seems to be inefficient,...)
Generate sample data
dims <- c(50000, 6000)
nms <- paste0("A", 1:dims[2])
lengths <- sample(5:30, dims[1], replace = T)
data <- lapply(lengths, sample, x = nms, replace = T)
Possible attempts:
1) sapply() with simplify to sparse matrix?
library(Matrix)
sparseRow <- function(stringVec){
relFreq <- c(table(factor(stringVec, levels = nms)) / length(stringVec))
Matrix(relFreq, 1, dims[2], sparse = TRUE)
}
sparseRows <- sapply(data[1:5], sparseRow)
sparseMat <- do.call(rbind, sparseRows)
Problem: My bottleneck seems to be the sparseRows as the rows are not directly combined to a sparse matrix.
(If i run the code above on the full sample, i get an Error: cannot allocate vector of size 194 Kb
Error during wrapup: memory exhausted (limit reached?) - my hardware has 8 GB RAM.)
Obviously there is more memory consumption for creating a list of rows, before combining them instead of filling
the sparse matrix directly.
--> so using (s/l)apply is not memory friendly in my case?
object.size(sparseRows)
object.size(sparseMat)
2) Dirty workaround(?)
My goal seems to be to create an empty sparse matrix and fill it row wise. Below is a dirty way to do it (which works
on my hardware).
indxs <- lapply(data, function(data) sapply(data, function(x) which(x == nms),
USE.NAMES = FALSE))
relFreq <- lapply(indxs, function(idx) table(idx)/length(idx))
mm <- Matrix(0, nrow = dims[1], ncol = dims[2])
for(idx in 1:dims[1]){
mm[idx, as.numeric(names(relFreq[[idx]]))] <- as.numeric(relFreq[[idx]])
}
#sapply(1:dims[1], function(idx) mm[idx,
# as.numeric(names(relFreq[[idx]]))] <<- as.numeric(relFreq[[idx]]))
I would like to ask if there is a more elegant/efficient way to achieve that with lowest amount of RAM possible.
I would convert to data.table and then do the necessary calculations:
ld <- lengths(data)
D <- data.table(val = unlist(data),
id = rep(1:length(data), times = ld),
Ntotal = rep(ld, times = ld))
D <- D[, .N, keyby = .(id, val, Ntotal)]
D[, freq := N/Ntotal]
ii <- data.table(val = nms, ind = seq_along(nms))
D <- ii[D, on = 'val']
sp <- with(D, sparseMatrix(i = id, j = ind, x = freq,
dims = c(max(id), length(nms))))
Benchmarks for n = 100
data2 <- data[1:100]
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 102.150200 106.235148 113.117848 109.98310 116.79734 142.859832 10 b
F. Privé 122.314496 123.804442 149.999595 126.76936 164.97166 233.034447 10 c
minem 5.617658 5.827209 6.307891 6.10946 6.15137 9.199257 10 a
user20650 11.012509 11.752350 13.580099 12.59034 14.31870 21.961725 10 a
Benchmarks on all data
Lets benchmark 3 of the fastest functions, because rest of them (OP's, user20650_v1 and F.Privé's) would be to slow on all of the data.
user20650_v2 <- function(x) {
dt2 = data.table(lst = rep(1:length(x), lengths(x)),
V1 = unlist(x))
dt2[, V1 := factor(V1, levels = nms)]
x3 = xtabs(~ lst + V1, data = dt2, sparse = TRUE)
x3/rowSums(x3)
}
user20650_v3 <- function(x) {
x3 = xtabs(~ rep(1:length(x), lengths(x)) + factor(unlist(x), levels = nms),
sparse = TRUE)
x3/rowSums(x3)
}
minem <- function(x) {
ld <- lengths(x)
D <- data.table(val = unlist(x), id = rep(1:length(x), times = ld),
Ntotal = rep(ld, times = ld))
D <- D[, .N, keyby = .(id, val, Ntotal)]
D[, freq := N/Ntotal]
ii <- data.table(val = nms, ind = seq_along(nms))
D <- ii[D, on = 'val']
sparseMatrix(i = D$id, j = D$ind, x = D$freq,
dims = c(max(D$id), length(nms)))
}
Compare the results of minem and user20650_v3:
x1 <- minem(data)
x2 <- user20650_v3(data)
all.equal(x1, x2)
# [1] "Component “Dimnames”: names for current but not for target"
# [2] "Component “Dimnames”: Component 1: target is NULL, current is character"
# [3] "Component “Dimnames”: Component 2: target is NULL, current is character"
# [4] "names for target but not for current"
x2 has additional names. remove them:
dimnames(x2) <- names(x2#x) <- NULL
all.equal(x1, x2)
# [1] TRUE # all equal
Timings:
x <- bench::mark(minem(data),
user20650_v2(data),
user20650_v3(data),
iterations = 5, check = F)
as.data.table(x)[, 1:10]
# expression min mean median max itr/sec mem_alloc n_gc n_itr total_time
# 1: minem(data) 324ms 345ms 352ms 371ms 2.896187 141MB 7 5 1.73s
# 2: user20650_v2(data) 604ms 648ms 624ms 759ms 1.544380 222MB 10 5 3.24s
# 3: user20650_v3(data) 587ms 607ms 605ms 633ms 1.646977 209MB 10 5 3.04s
relating memory:
OPdirty <- function(x) {
indxs <- lapply(x, function(x) sapply(x, function(x) which(x == nms),
USE.NAMES = FALSE))
relFreq <- lapply(indxs, function(idx) table(idx)/length(idx))
dims <- c(length(indxs), length(nms))
mm <- Matrix(0, nrow = dims[1], ncol = dims[2])
for (idx in 1:dims[1]) {
mm[idx, as.numeric(names(relFreq[[idx]]))] <- as.numeric(relFreq[[idx]])
}
mm
}
xx <- data[1:1000]
all.equal(OPdirty(xx), minem(xx))
# true
x <- bench::mark(minem(xx),
FPrive(xx),
OPdirty(xx),
iterations = 3, check = T)
as.data.table(x)[, 1:10]
expression min mean median max itr/sec mem_alloc n_gc n_itr total_time
1: minem(xx) 12.69ms 14.11ms 12.71ms 16.93ms 70.8788647 3.04MB 0 3 42.33ms
2: FPrive(xx) 1.46s 1.48s 1.47s 1.52s 0.6740317 214.95MB 4 3 4.45s
3: OPdirty(xx) 2.12s 2.14s 2.15s 2.16s 0.4666106 914.91MB 9 3 6.43s
See column mem_alloc...
Use a loop to fill a pre-allocated sparse matrix column-wise (and then transpose it):
res <- Matrix(0, dims[2], length(data), sparse = TRUE)
for (i in seq_along(data)) {
ind.match <- match(data[[i]], nms)
tab.match <- table(ind.match)
res[as.integer(names(tab.match)), i] <- as.vector(tab.match) / length(data[[i]])
}
# Verif
stopifnot(identical(t(res), sparseMat))
Benchmark:
data2 <- data[1:50]
microbenchmark::microbenchmark(
OP = {
sparseMat <- do.call(rbind, sapply(data2, sparseRow))
},
ME = {
res <- Matrix(0, dims[2], length(data2), sparse = TRUE)
for (i in seq_along(data2)) {
ind.match <- match(data2[[i]], nms)
tab.match <- table(ind.match)
res[as.integer(names(tab.match)), i] <- as.vector(tab.match) / length(data2[[i]])
}
res2 <- t(res)
}
)
stopifnot(identical(res2, sparseMat))
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 56.28020 59.61689 63.24816 61.16986 62.80294 206.18689 100 b
ME 46.60318 48.27268 49.77190 49.50714 50.92287 55.23727 100 a
So, it's memory-efficient and not that slow.

Optimizing speed of nearest search function in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm trying to make this function faster (ideally with RcppAmadillo or some other alternative). myfun takes a matrix, mat, that can get quite large, but is always two columns. myfun finds the closest rows for each row in the matrix that are +1 or -1 away in absolute value from each row
As an example below, the first row of mat is 3,3. Therefore, myfun will output a list with rows 2 and 3 being closest to row 1, but not row 5, which is +2 away.
library(microbenchmark)
dim(mat)
[1] 1000 2
head(mat)
x y
[1,] 3 3
[2,] 3 4
[3,] 3 2
[4,] 7 3
[5,] 4 4
[6,] 10 1
output
[[1]]
[1] 2 3
[[2]]
[1] 1
[[3]]
[1] 1
[[4]]
integer(0)
[[5]]
integer(0)
[[6]]
integer(0)
microbenchmark( myfun(mat), times = 100) #mat of 1000 rows
# Unit: milliseconds
# expr min lq mean median uq max neval
# myfun(mat) 89.30126 90.28618 95.50418 90.91281 91.50875 180.1505 100
microbenchmark( myfun(mat), times = 100) #mat of 10,000 rows
# Unit: seconds
# expr min lq mean median uq max neval
# myfun(layout.old) 5.912633 5.912633 5.912633 5.912633 5.912633 5.912633 1
This is what myfun looks like
myfun = function(x){
doo <- function(j) {
j.mat <- matrix(rep(j, length = length(x)), ncol = ncol(x), byrow = TRUE)
j.abs <- abs(j.mat - x)
return(which(rowSums(j.abs) == 1))
}
return(apply(x, 1, doo))
}
Below, I have a base R solution that is much faster than myfun provided by the OP.
myDistOne <- function(m) {
v1 <- m[,1L]; v2 <- m[,2L]
rs <- rowSums(m)
lapply(seq_along(rs), function(x) {
t1 <- which(abs(rs[x] - rs) == 1)
t2 <- t1[which(abs(v1[x] - v1[t1]) <= 1)]
t2[which(abs(v2[x] - v2[t2]) <= 1)]
})
}
Here are some benchmarks:
library(microbenchmark)
set.seed(9711)
m1 <- matrix(sample(50, 2000, replace = TRUE), ncol = 2) ## 1,000 rows
microbenchmark(myfun(m1), myDistOne(m1))
Unit: milliseconds
expr min lq mean median uq max neval cld
myfun(m1) 78.61637 78.61637 80.47931 80.47931 82.34225 82.34225 2 b
myDistOne(m1) 27.34810 27.34810 28.18758 28.18758 29.02707 29.02707 2 a
identical(myfun(m1), myDistOne(m1))
[1] TRUE
m2 <- matrix(sample(200, 20000, replace = TRUE), ncol = 2) ## 10,000 rows
microbenchmark(myfun(m2), myDistOne(m2))
Unit: seconds
expr min lq mean median uq max neval cld
myfun(m2) 5.219318 5.533835 5.758671 5.714263 5.914672 7.290701 100 b
myDistOne(m2) 1.230721 1.366208 1.433403 1.419413 1.473783 1.879530 100 a
identical(myfun(m2), myDistOne(m2))
[1] TRUE
Here is a very large example:
m3 <- matrix(sample(1000, 100000, replace = TRUE), ncol = 2) ## 50,000 rows
system.time(testJoe <- myDistOne(m3))
user system elapsed
26.963 10.988 37.973
system.time(testUser <- myfun(m3))
user system elapsed
148.444 33.297 182.639
identical(testJoe, testUser)
[1] TRUE
I'm sure there is a faster solution. Maybe by sorting the rowSums upfront and working from there could see an improvement (it could also get very messy).
Update
As I predicted, working from a sorted rowSums is much faster (and uglier!)
myDistOneFast <- function(m) {
v1 <- m[,1L]; v2 <- m[,2L]
origrs <- rowSums(m)
mySort <- order(origrs)
rs <- origrs[mySort]
myDiff <- c(0L, diff(rs))
brks <- which(myDiff > 0L)
lenB <- length(brks)
n <- nrow(m)
myL <- vector("list", length = n)
findRows <- function(v, s, r, u1, u2) {
lapply(v, function(x) {
sx <- s[x]
tv1 <- s[r]
tv2 <- tv1[which(abs(u1[sx] - u1[tv1]) <= 1)]
tv2[which(abs(u2[sx] - u2[tv2]) <= 1)]
})
}
t1 <- brks[1L]; t2 <- brks[2L]
## setting first index in myL
myL[mySort[1L:(t1-1L)]] <- findRows(1L:(t1-1L), mySort, t1:(t2-1L), v1, v2)
k <- t0 <- 1L
while (k < (lenB-1L)) {
t1 <- brks[k]; t2 <- brks[k+1L]; t3 <- brks[k+2L]
vec <- t1:(t2-1L)
if (myDiff[t1] == 1L) {
if (myDiff[t2] == 1L) {
myL[mySort[vec]] <- findRows(vec, mySort, c(t0:(t1-1L), t2:(t3-1L)), v1, v2)
} else {
myL[mySort[vec]] <- findRows(vec, mySort, t0:(t1-1L), v1, v2)
}
} else if (myDiff[t2] == 1L) {
myL[mySort[vec]] <- findRows(vec, mySort, t2:(t3-1L), v1, v2)
}
if (myDiff[t2] > 1L) {
if (myDiff[t3] > 1L) {
k <- k+2L; t0 <- t2
} else {
k <- k+1L; t0 <- t1
}
} else {k <- k+1L; t0 <- t1}
}
## setting second to last index in myL
if (k == lenB-1L) {
t1 <- brks[k]; t2 <- brks[k+1L]; t3 <- n+1L; vec <- t1:(t2-1L)
if (myDiff[t1] == 1L) {
if (myDiff[t2] == 1L) {
myL[mySort[vec]] <- findRows(vec, mySort, c(t0:(t1-1L), t2:(t3-1L)), v1, v2)
} else {
myL[mySort[vec]] <- findRows(vec, mySort, t0:(t1-1L), v1, v2)
}
} else if (myDiff[t2] == 1L) {
myL[mySort[vec]] <- findRows(vec, mySort, t2:(t3-1L), v1, v2)
}
k <- k+1L; t0 <- t1
}
t1 <- brks[k]; vec <- t1:n
if (myDiff[t1] == 1L) {
myL[mySort[vec]] <- findRows(vec, mySort, t0:(t1-1L), v1, v2)
}
myL
}
The results are not even close. myDistOneFast is over 100x faster than the OP's original myfun on very large matrices and also scales well. Below are some benchmarks:
microbenchmark(OP = myfun(m1), Joe = myDistOne(m1), JoeFast = myDistOneFast(m1))
Unit: milliseconds
expr min lq mean median uq max neval
OP 57.60683 59.51508 62.91059 60.63064 61.87141 109.39386 100
Joe 22.00127 23.11457 24.35363 23.87073 24.87484 58.98532 100
JoeFast 11.27834 11.99201 12.59896 12.43352 13.08253 15.35676 100
microbenchmark(OP = myfun(m2), Joe = myDistOne(m2), JoeFast = myDistOneFast(m2))
Unit: milliseconds
expr min lq mean median uq max neval
OP 4461.8201 4527.5780 4592.0409 4573.8673 4633.9278 4867.5244 100
Joe 1287.0222 1316.5586 1339.3653 1331.2534 1352.3134 1524.2521 100
JoeFast 128.4243 134.0409 138.7518 136.3929 141.3046 172.2499 100
system.time(testJoeFast <- myDistOneFast(m3))
user system elapsed
0.68 0.00 0.69 ### myfun took over 100s!!!
To test equality, we have to sort each vector of indices. We also can't use identical for comparison as myL is initialized as an empty list, thus some of the indices contain NULL values (these correspond to integer(0) in the result from myfun and myDistOne).
testJoeFast <- lapply(testJoeFast, sort)
all(sapply(1:50000, function(x) all(testJoe[[x]]==testJoeFast[[x]])))
[1] TRUE
unlist(testJoe[which(sapply(testJoeFast, is.null))])
integer(0)
Here is an example with 500,000 rows:
set.seed(42)
m4 <- matrix(sample(2000, 1000000, replace = TRUE), ncol = 2)
system.time(myDistOneFast(m4))
user system elapsed
10.84 0.06 10.94
Here is an overview of how the algorithm works:
Calculate rowSums
Order the rowSums (i.e. returns the indices from the original vector of the sorted vector)
Call diff
Mark each non-zero instance
Determine which indices in small range satisfy the OP's request
Use the ordered vector calculated in 2 to determine original index
This is much faster than comparing one rowSum to all of the rowSum every time.

Split string fixed width [duplicate]

I have an object containing a text string:
x <- "xxyyxyxy"
and I want to split that into a vector with each element containing two letters:
[1] "xx" "yy" "xy" "xy"
It seems like the strsplit should be my ticket, but since I have no regular expression foo, I can't figure out how to make this function chop the string up into chunks the way I want it. How should I do this?
Using substring is the best approach:
substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))
But here's a solution with plyr:
library("plyr")
laply(seq(1, nchar(x), 2), function(i) substr(x, i, i+1))
Here is a fast solution that splits the string into characters, then pastes together the even elements and the odd elements.
x <- "xxyyxyxy"
sst <- strsplit(x, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
Benchmark Setup:
library(microbenchmark)
GSee <- function(x) {
sst <- strsplit(x, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
Shane1 <- function(x) {
substring(x, seq(1,nchar(x),2), seq(2,nchar(x),2))
}
library("plyr")
Shane2 <- function(x) {
laply(seq(1,nchar(x),2), function(i) substr(x, i, i+1))
}
seth <- function(x) {
strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]
}
geoffjentry <- function(x) {
idx <- 1:nchar(x)
odds <- idx[(idx %% 2) == 1]
evens <- idx[(idx %% 2) == 0]
substring(x, odds, evens)
}
drewconway <- function(x) {
c<-strsplit(x,"")[[1]]
sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
}
KenWilliams <- function(x) {
n <- 2
sapply(seq(1,nchar(x),by=n), function(xx) substr(x, xx, xx+n-1))
}
RichardScriven <- function(x) {
regmatches(x, gregexpr("(.{2})", x))[[1]]
}
Benchmark 1:
x <- "xxyyxyxy"
microbenchmark(
GSee(x),
Shane1(x),
Shane2(x),
seth(x),
geoffjentry(x),
drewconway(x),
KenWilliams(x),
RichardScriven(x)
)
# Unit: microseconds
# expr min lq median uq max neval
# GSee(x) 8.032 12.7460 13.4800 14.1430 17.600 100
# Shane1(x) 74.520 80.0025 84.8210 88.1385 102.246 100
# Shane2(x) 1271.156 1288.7185 1316.6205 1358.5220 3839.300 100
# seth(x) 36.318 43.3710 45.3270 47.5960 67.536 100
# geoffjentry(x) 9.150 13.5500 15.3655 16.3080 41.066 100
# drewconway(x) 92.329 98.1255 102.2115 105.6335 115.027 100
# KenWilliams(x) 77.802 83.0395 87.4400 92.1540 163.705 100
# RichardScriven(x) 55.034 63.1360 65.7545 68.4785 108.043 100
Benchmark 2:
Now, with bigger data.
x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace=TRUE), collapse="")
microbenchmark(
GSee(x),
Shane1(x),
Shane2(x),
seth(x),
geoffjentry(x),
drewconway(x),
KenWilliams(x),
RichardScriven(x),
times=3
)
# Unit: milliseconds
# expr min lq median uq max neval
# GSee(x) 29.029226 31.3162690 33.603312 35.7046155 37.805919 3
# Shane1(x) 11754.522290 11866.0042600 11977.486230 12065.3277955 12153.169361 3
# Shane2(x) 13246.723591 13279.2927180 13311.861845 13371.2202695 13430.578694 3
# seth(x) 86.668439 89.6322615 92.596084 92.8162885 93.036493 3
# geoffjentry(x) 11670.845728 11681.3830375 11691.920347 11965.3890110 12238.857675 3
# drewconway(x) 384.863713 438.7293075 492.594902 515.5538020 538.512702 3
# KenWilliams(x) 12213.514508 12277.5285215 12341.542535 12403.2315015 12464.920468 3
# RichardScriven(x) 11549.934241 11730.5723030 11911.210365 11989.4930080 12067.775651 3
How about
strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]
Basically, add a separator (here " ") and then use strsplit
strsplit is going to be problematic, look at a regexp like this
strsplit(z, '[[:alnum:]]{2}')
it will split at the right points but nothing is left.
You could use substring & friends
z <- 'xxyyxyxy'
idx <- 1:nchar(z)
odds <- idx[(idx %% 2) == 1]
evens <- idx[(idx %% 2) == 0]
substring(z, odds, evens)
Here's one way, but not using regexen:
a <- "xxyyxyxy"
n <- 2
sapply(seq(1,nchar(a),by=n), function(x) substr(a, x, x+n-1))
ATTENTION with substring, if string length is not a multiple of your requested length, then you will need a +(n-1) in the second sequence:
substring(x,seq(1,nchar(x),n),seq(n,nchar(x)+n-1,n))
Total hack, JD, but it gets it done
x <- "xxyyxyxy"
c<-strsplit(x,"")[[1]]
sapply(seq(2,nchar(x),by=2),function(y) paste(c[y-1],c[y],sep=""))
[1] "xx" "yy" "xy" "xy"
A helper function:
fixed_split <- function(text, n) {
strsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}
fixed_split(x, 2)
[[1]]
[1] "xx" "yy" "xy" "xy"
Using C++ one can be even faster. Comparing with GSee's version:
GSee <- function(x) {
sst <- strsplit(x, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
rstub <- Rcpp::cppFunction( code = '
CharacterVector strsplit2(const std::string& hex) {
unsigned int length = hex.length()/2;
CharacterVector res(length);
for (unsigned int i = 0; i < length; ++i) {
res(i) = hex.substr(2*i, 2);
}
return res;
}')
x <- "xxyyxyxy"
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> GSee(x) 4.272 4.4575 41.74284 4.5855 4.7105 3702.289 100
#> rstub(x) 1.710 1.8990 139.40519 2.0665 2.1250 13722.075 100
set.seed(42)
x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace = TRUE), collapse = "")
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> GSee(x) 17.931801 18.431504 19.282877 18.738836 19.47943 27.191390 100
#> rstub(x) 3.197587 3.261109 3.404973 3.341099 3.45852 4.872195 100
Well, I used the following pseudo-code to fulfill this task:
Insert a special sequence at each chunk of length n.
Split the string by said sequence.
In code, I did
chopS <- function( text, chunk_len = 2, seqn)
{
# Specify select and replace patterns
insert <- paste("(.{",chunk_len,"})", sep = "")
replace <- paste("\\1", seqn, sep = "")
# Insert sequence with replaced pattern, then split by the sequence
interp_text <- gsub( pattern, replace, text)
strsplit( interp_text, seqn)
}
This returns a list with the split vector inside, though, not a vector.
From my testing, the code below is faster than the previous methods that were benchmarked. stri_sub is pretty fast, and seq.int is better than seq. It's also easy to change the size of the strings by changing all the 2Ls to something else.
library(stringi)
split_line <- function(x) {
row_length <- stri_length(x)
stri_sub(x, seq.int(1L, row_length, 2L), seq.int(2L, row_length, 2L))
}
I didn't notice a difference when string chunks were 2 characters long, but for bigger chunks this is slightly better.
split_line <- function(x) {
stri_sub(x, seq.int(1L, stri_length(x), 109L), length = 109L)
}
I set out looking for a vectorised solution to this, in order to avoid
lapply()ing one of the single string solutions across long vectors. Failing
to find an existing solution, I somehow fell down a rabbit hole of
painstakingly writing one in C. It ended up hilariously complicated compared
to the many one-line R solutions shown here (no thanks to me deciding to also
want to handle Unicode strings to match the R versions), but I thought I’d
share the result, in case it somehow someday helps somebody. Here’s what
eventually became of that:
#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>
// Find the width (in bytes) of a UTF-8 character, given its first byte
size_t utf8charw(char b) {
if (b == 0x00) return 0;
if ((b & 0x80) == 0x00) return 1;
if ((b & 0xe0) == 0xc0) return 2;
if ((b & 0xf0) == 0xe0) return 3;
if ((b & 0xf8) == 0xf0) return 4;
return 1; // Really an invalid character, but move on
}
// Find the number of UTF-8 characters in a string
size_t utf8nchar(const char* str) {
size_t nchar = 0;
while (*str != '\0') {
str += utf8charw(*str); nchar++;
}
return nchar;
}
SEXP C_str_chunk(SEXP x, SEXP size_) {
// Allocate a list to store the result
R_xlen_t n = Rf_xlength(x);
SEXP result = PROTECT(Rf_allocVector(VECSXP, n));
int size = Rf_asInteger(size_);
for (R_xlen_t i = 0; i < n; i++) {
const char* str = Rf_translateCharUTF8(STRING_ELT(x, i));
// Figure out number of chunks
size_t nchar = utf8nchar(str);
size_t nchnk = nchar / size + (nchar % size != 0);
SEXP chunks = PROTECT(Rf_allocVector(STRSXP, nchnk));
for (size_t j = 0, nbytes = 0; j < nchnk; j++, str += nbytes) {
// Find size of next chunk in bytes
nbytes = 0;
for (int cp = 0; cp < size; cp++) {
nbytes += utf8charw(str[nbytes]);
}
// Assign to chunks vector as an R string
SET_STRING_ELT(chunks, j, Rf_mkCharLenCE(str, nbytes, CE_UTF8));
}
SET_VECTOR_ELT(result, i, chunks);
}
// Clean up
UNPROTECT(n);
UNPROTECT(1);
return result;
}
I then put this monstrosity into a file called str_chunk.c, and compiled with R CMD SHLIB str_chunk.c.
To try it out, we need some set-up on the R side:
str_chunk <- function(x, n) {
.Call("C_str_chunk", x, as.integer(n))
}
# The (currently) accepted answer
str_chunk_one <- function(x, n) {
substring(x, seq(1, nchar(x), n), seq(n, nchar(x), n))
}
dyn.load("str_chunk.dll")
So what we’ve achieved with the C version is to take a vector inputs and return a list:
str_chunk(rep("0123456789AB", 2), 2)
#> [[1]]
#> [1] "01" "23" "45" "67" "89" "AB"
#>
#> [[2]]
#> [1] "01" "23" "45" "67" "89" "AB"
Now off we go with benchmarking.
We start off strong with a 200x improvement for a long(ish) vector of
short strings:
x <- rep("0123456789AB", 1000)
microbenchmark::microbenchmark(
accepted = lapply(x, str_chunk_one, 2),
str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> accepted 229.5826 216.8246 182.5449 203.785 182.3662 25.88823 100
#> str_chunk(x, 2) 1.0000 1.0000 1.0000 1.000 1.0000 1.00000 100
… which then shrinks to a distinctly less impressive 3x improvement for
large strings.
x <- rep(strrep("0123456789AB", 1000), 10)
microbenchmark::microbenchmark(
accepted = lapply(x, str_chunk_one, 2),
str_chunk(x, 2)
) |> print(unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> accepted 2.77981 2.802641 3.304573 2.787173 2.846268 13.62319 100
#> str_chunk(x, 2) 1.00000 1.000000 1.000000 1.000000 1.000000 1.00000 100
dyn.unload("str_chunk.dll")
So, was it worth it? Well, absolutely not considering how long it took to
actually get working properly – But if this was in a package, it would have
saved quite a lot of time in my use-case (short strings, long vectors).
Here is one option using stringi::stri_sub(). Try:
x <- "xxyyxyxy"
stringi::stri_sub(x, seq(1, stringi::stri_length(x), by = 2), length = 2)
# [1] "xx" "yy" "xy" "xy"

Resources