I have the same question as Resample with replacement by cluster, i.e. I want to do cluster bootstrapping. The best answer's approach to that question using rbindlist(lapply(resampled_ids, function(resampled_id) df[df$id == resampled_id,])) works, but because I have a big dataset, this resampling step is rather slow. My question is, is it possible to speed this up?
Use sequence to index. Demonstrated with a larger data.frame:
df <- data.frame(id = rep.int(1:1e2, sample(100:200, 1e2, replace = TRUE))[1:1e4], X = rnorm(1e4))
resampled_ids <- sample(unique(df$id), replace = TRUE)
idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(id = df$id[idx], X = df$X[idx])
Benchmarking against the rbindlist solution:
library(data.table)
library(microbenchmark)
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> rbindlist 9480.4 9921.95 11470.567 10431.05 12555.35 31178.2 100
#> sequence 406.7 444.55 564.873 498.10 545.70 2818.4 100
Note that creating a new data.frame from indexed vectors is much faster than row-indexing the original data.frame. The difference is much less pronounced if a data.table is used, but, surprisingly, the rbindlist solution becomes even slower:
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.frame(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> rbindlist 9431.9 9957.7 11101.545 10508.15 12395.25 15363.3 100
#> sequence1 4284.5 4550.3 4866.891 4674.80 5009.90 8350.1 100
#> sequence2 414.1 455.6 541.590 508.40 551.40 2881.1 100
setDT(df)
microbenchmark(rbindlist = rbindlist(lapply(resampled_ids, function(x) df[df$id %in% x,])),
sequence1 = df[sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids]),],
sequence2 = {idx <- sequence(tabulate(df$id)[resampled_ids], match(unique(df$id), df$id)[resampled_ids])
data.table(id = df$id[idx], X = df$X[idx])})
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> rbindlist 14877.4 15878.30 17181.572 16348.50 18527.6 22520.9 100
#> sequence1 795.0 1016.80 1187.266 1101.95 1326.7 2566.5 100
#> sequence2 386.4 441.75 556.226 473.70 500.9 3373.6 100
Update
To address the comment from jay.sf:
lens <- tabulate(df$id)[resampled_ids]
idx <- sequence(lens, match(unique(df$id), df$id)[resampled_ids])
s <- data.frame(cluster = rep.int(seq_along(resampled_ids), lens), id = df$id[idx], X = df$X[idx])
cluster corresponds to the index of resampled_ids.
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
Try this:
ind_id <- split(seq_along(f$id), f$id)
samp_id <- sample(names(ind_id), replace = TRUE)
f[unlist(ind_id[samp_id]), ]
Related
I want to permute a vector so that an element can't be in the same place after permutation, as it was in the original. Let's say I have a list of elements like this: AABBCCADEF
A valid shuffle would be: BBAADEFCCA
But these would be invalid: BAACFEDCAB or BCABFEDCAB
The closest answer I could find was this: python shuffle such that position will never repeat. But that's not quite what I want, because there are no repeated elements in that example.
I want a fast algorithm that generalizes that answer in the case of repetitions.
MWE:
library(microbenchmark)
set.seed(1)
x <- sample(letters, size=295, replace=T)
terrible_implementation <- function(x) {
xnew <- sample(x)
while(any(x == xnew)) {
xnew <- sample(x)
}
return(xnew)
}
microbenchmark(terrible_implementation(x), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
terrible_implementation(x) 479.5338 2346.002 4738.49 2993.29 4858.254 17005.05 10
Also, how do I determine if a sequence can be permuted in such a way?
EDIT: To make it perfectly clear what I want, the new vector should satisfy the following conditions:
1) all(table(newx) == table(x))
2) all(x != newx)
E.g.:
newx <- terrible_implementation(x)
all(table(newx) == table(x))
[1] TRUE
all(x != newx)
[1] TRUE
#DATA
set.seed(1)
x <- sample(letters, size=295, replace=T)
foo = function(S){
if(max(table(S)) > length(S)/2){
stop("NOT POSSIBLE")
}
U = unique(S)
done_chrs = character(0)
inds = integer(0)
ans = character(0)
while(!identical(sort(done_chrs), sort(U))){
my_chrs = U[!U %in% done_chrs]
next_chr = my_chrs[which.min(sapply(my_chrs, function(x) length(setdiff(which(!S %in% x), inds))))]
x_inds = which(S %in% next_chr)
candidates = setdiff(seq_along(S), union(x_inds, inds))
if (length(candidates) == 1){
new_inds = candidates
}else{
new_inds = sample(candidates, length(x_inds))
}
inds = c(inds, new_inds)
ans[new_inds] = next_chr
done_chrs = c(done_chrs, next_chr)
}
return(ans)
}
ans_foo = foo(x)
identical(sort(ans_foo), sort(x)) & !any(ans_foo == x)
#[1] TRUE
library(microbenchmark)
microbenchmark(foo(x))
#Unit: milliseconds
# expr min lq mean median uq max neval
# foo(x) 19.49833 22.32517 25.65675 24.85059 27.96838 48.61194 100
I think this satisfies all your conditions. The idea is to order by the frequency, start with the most common element and shift the value to the next value in the frequency table by the number of times the most common element appears. This will guarantee all elements will be missed.
I've written in data.table, as it helped me during debugging, without losing too much performance. It's a modest improvement performance-wise.
library(data.table)
library(magrittr)
library(microbenchmark)
permute_avoid_same_position <- function(y) {
DT <- data.table(orig = y)
DT[, orig_order := .I]
count_by_letter <-
DT[, .N, keyby = orig] %>%
.[order(N)] %>%
.[, stable_order := .I] %>%
.[order(-stable_order)] %>%
.[]
out <- copy(DT)[count_by_letter, .(orig, orig_order, N), on = "orig"]
# Dummy element
out[, new := first(y)]
origs <- out[["orig"]]
nrow_out <- nrow(out)
maxN <- count_by_letter[["N"]][1]
out[seq_len(nrow_out) > maxN, new := head(origs, nrow_out - maxN)]
out[seq_len(nrow_out) <= maxN, new := tail(origs, maxN)]
DT[out, j = .(orig_order, orig, new), on = "orig_order"] %>%
.[order(orig_order)] %>%
.[["new"]]
}
set.seed(1)
x <- sample(letters, size=295, replace=T)
testthat::expect_true(all(table(permute_avoid_same_position(x)) == table(x)))
testthat::expect_true(all(x != permute_avoid_same_position(x)))
microbenchmark(permute_avoid_same_position(x), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max
# permute_avoid_same_position(x) 5.650378 5.771753 5.875116 5.788618 5.938604 6.226228
x <- sample(1:1000, replace = TRUE, size = 1e6)
testthat::expect_true(all(table(permute_avoid_same_position(x)) == table(x)))
testthat::expect_true(all(x != permute_avoid_same_position(x)))
microbenchmark(permute_avoid_same_position(x), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max
# permute_avoid_same_position(x) 239.7744 385.4686 401.521 438.2999 440.9746 503.0875
We could extract substrings by the boundary of the repeating elements, sample and replicate
library(stringr)
sapply(replicate(10, sample(str_extract_all(str1, "([[:alpha:]])\\1*")[[1]]),
simplify = FALSE), paste, collapse="")
#[1] "BBAAEFDCCA" "AAAFBBEDCC" "BBAAAEFCCD" "DFACCBBAAE" "AAFCCBBEAD"
#[6] "DAAAECCBBF" "AAFCCDBBEA" "CCEFADBBAA" "BBAAEADCCF" "AACCBBDFAE"
data
str1 <- "AABBCCADEF"
This question already has answers here:
How to extract the first n rows per group?
(4 answers)
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 4 years ago.
What is the idiomatic way to do the action below in data.table?
library(dplyr)
df %>%
group_by(b) %>%
slice(1:10)
I can do
library(data.table)
df[, .SD[1:10]
, by = b]
but that appears much slower. Is there a better way?
set.seed(0)
df <- rep(1:500, sample(500:1000, 500, T)) %>%
data.table(a = runif(length(.))
,b = .)
f1 <- function(df){
df %>%
group_by(b) %>%
slice(1:10)
}
f2 <- function(df){
df[, .SD[1:10]
, by = b]
}
library(microbenchmark)
microbenchmark(f1(df), f2(df))
#Unit: milliseconds
# expr min lq mean median uq max neval
# f1(df) 17.67435 19.50381 22.06026 20.50166 21.42668 78.3318 100
# f2(df) 69.69554 79.43387 119.67845 88.25585 106.38661 581.3067 100
========== Benchmarks with suggested methods ==========
set.seed(0)
df <- rep(1:500, sample(500:1000, 500, T)) %>%
data.table(a = runif(length(.))
,b = .)
use.slice <- function(df){
df %>%
group_by(b) %>%
slice(1:10)
}
IndexSD <- function(df){
df[, .SD[1:10]
, by = b]
}
Index.I <- function(df) {
df[df[, .I[seq_len(10)], by = b]$V1]
}
use.head <- function(df){
df[, head(.SD, 10)
, by = b]
}
library(microbenchmark)
microbenchmark(use.slice(df)
, IndexSD(df)
, Index.I(df)
, use.head(df)
, unit = "relative"
, times = 100L)
#Unit: relative
# expr min lq mean median uq max neval
# use.slice(df) 9.804549 10.269234 9.167413 8.900060 8.782862 6.520270 100
# IndexSD(df) 38.881793 42.548555 39.044095 38.636523 39.942621 18.981748 100
# Index.I(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
# use.head(df) 3.666898 4.033038 3.728299 3.408249 3.545258 3.951565 100
We can use .I to extract the row index and should be faster
out <- df[df[, .I[seq_len(10)], by = b]$V1]
dim(out)
#[1] 5000 2
Checking if there are NAs (as the OP commented)
any(out[, Reduce(`|`, lapply(.SD, is.na))])
#[1] FALSE
dim(df)
#[1] 374337 2
Benchmarks
f3 <- function(df) {
df[df[, .I[seq_len(10)], by = b]$V1]
}
microbenchmark(f1(df), f2(df), f3(df), unit = "relative", times = 10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# f1(df) 5.727822 5.480741 4.945486 5.672206 4.317531 5.10003 10 b
# f2(df) 24.572633 23.774534 17.842622 23.070634 16.099822 11.58287 10 c
# f3(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a
I have a string matrix (my_data) of dimensions 9000000x10 with each value being a single character string. I want to transform it to a numeric matrix using the function utf8ToInt, but it takes a long time and crashes my session.
new_matrix <- apply(my_data, 1:2, "utf8ToInt")
The result is what I expect, but I need a more efficient way of doing that.
Any help is deeply appreciated.
Imagine my data is:
my_data <- matrix(c("a","b","c","d"), ncol = 2)
but it is actually 9000000x10 instead of 2x2.
stringi::stri_enc_toutf32 may be an alternative.
From ?stri_enc_toutf32:
This function is roughly equivalent to a vectorized call to utf8ToInt(enc2utf8(str))
On a 1e3 * 2 matrix, stri_enc_toutf32 is about 10 and 20 times faster than vapply / apply + utf8ToInt respectively:
library(stringi)
library(microbenchmark)
nr = 1e3
nc = 2
m = matrix(sample(letters, nr*nc, replace = TRUE), nrow = nr, ncol = nc)
microbenchmark(
f_apply = apply(m, 1:2, utf8ToInt),
f_vapply = structure(vapply(m, utf8ToInt, numeric(1)), dim=dim(m)),
f = matrix(unlist(stri_enc_toutf32(m), use.names = FALSE), nrow = nrow(m)),
times = 10L, check = "equal")
# Unit: microseconds
# expr min lq mean median uq max neval
# f_apply 2283.4 2297.2 2351.17 2325.40 2354.5 2583.6 10
# f_vapply 1276.1 1298.0 1348.88 1322.00 1353.4 1611.3 10
# f 87.6 92.3 108.53 105.15 111.0 163.8 10
Using vapply would be almost twice as fast. Since vapply returns a vector, it is necessary to re-establish the matrix format (here with structure).
library(microbenchmark)
my_data <- matrix(sample(letters, 2*100, replace = TRUE), ncol = 2)
microbenchmark(
apply = apply(my_data, 1:2, utf8ToInt),
vapply = structure(vapply(my_data, utf8ToInt, numeric(1)), dim=dim(my_data)),
times = 500L, check = 'equal'
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> apply 199.201 208.001 224.811 213.801 220.1515 1560.400 500
#> vapply 111.000 115.501 136.343 120.401 124.9505 1525.901 500
Created on 2021-03-06 by the reprex package (v1.0.0)
I am trying to match a vector of months to the appropriate quarter in R. Unfortunately, the code I inherited contains the quarters in a list with the appropriate months as a vector of each list element (this least is supposed to be adaptable such that you could do quarters, trimesters, or semesters if desired). Currently, I am using sapply to loop through the vector and match the appropriate quarter to each month as follows:
month.vec <- sample(1:12, 100, replace=T)
quarters.list <- list(`1` = 1:3, `2` = 4:6, `3` = 7:9, `4` = 10:12)
month.to.quarter <- function(months, quarters) {
sapply(months, FUN=function(x) {
as.numeric(substr(names(which(x == unlist(quarters))),0,1))
})
}
month.to.quarter(month.vec, quarters.list)
This works great for vectors of about length(month.vec) < 1e5 or so, but is a bit time consuming after that (see code below). Does anyone have an elegant solution for this sort of matching at vectors longer than this?
Script to show how processing time increases with vector length. NOTE: this takes a few seconds (<10)
times <- NULL
for (i in c(10 %o% 10^(2:5))) {
month.vec <- sample(1:12, i, replace=T)
quarters.list <- list(`1` = 1:3, `2` = 4:6, `3` = 7:9, `4` = 10:12)
t <- system.time(a <- month.to.quarter(month.vec, quarters.list))[3]
time <- data.frame(n = i, time = t)
times <- rbind(times, time)
}
plot(time ~ n, times)
I wonder whether it would be faster to invert the quarters list, so that the quarter can just be looked up using month as the index. Something like the following...
quarters <- as.numeric(substr(names(sort(unlist(quarters.list))),1,1))
This only needs to be done once, and then you can just do
quarters.vec <- quarters[month.vec]
It is about 2000 times faster...
microbenchmark::microbenchmark(quarters[month.vec],month.to.quarter(month.vec, quarters.list))
Unit: microseconds
expr min lq mean median uq max neval
quarters[month.vec] 199.836 202.629 235.3968 227.763 233.9695 554.823 100
month.to.quarter(month.vec, quarters.list) 439466.006 456649.059 495957.5722 469543.098 499346.5020 935046.664 100
Try this:
(month.vec - 1) %/% 3 + 1
This is the first method I came up with. I think I saw it in Hudley's book. It uses names for the element of a vector.
month.vec <- sample(1:12, 10000, replace=T)
quarters.list <- list(`1` = 1:3, `2` = 4:6, `3` = 7:9, `4` = 10:12)
# your method
month.to.quarter <- function(months, quarters) {
sapply(months, FUN=function(x) {
as.numeric(substr(names(which(x == unlist(quarters))),0,1))
})
}
out1 <-month.to.quarter(month.vec, quarters.list)
# my method
vec <- rep(1:4, each = 3)
names(vec) <- 1:12
out2 <- vec[month.vec]
names(out2) <- NULL
all.equal(out1, out2) # this will return TRUE
Benchmark is really different.
month.vec <- sample(1:12, 10000, replace=T)
microbenchmark::microbenchmark(vec[month.vec],
month.to.quarter(month.vec, quarters.list))
## Unit: microseconds
## expr min lq mean median uq max neval
## vec[month.vec] 108.503 112.433 119.3982 116.916 119.983 183.467 100
## month.to.quarter(month.vec, quarters.list) 78859.160 84036.995 87956.6532 86960.269 89975.668 140797.487 100
The new method is about 800 times faster.
If you want to make it a function it's like this and still quite fast
month.to.quarter2 <- function(months) {
vec <- rep(1:4, each = 3)
names(vec) <- 1:12
out <- vec[months]
names(out) <- NULL
return(out)
}
microbenchmark::microbenchmark(vec[month.vec],
month.to.quarter(month.vec, quarters.list),
month.to.quarter2(month.vec))
## Unit: microseconds
## expr min lq mean median uq max neval
## vec[month.vec] 109.222 111.6345 121.3035 115.604 117.916 706.034 100
## month.to.quarter(month.vec, quarters.list) 77292.742 83032.7425 85770.6963 84690.500 87243.327 138531.309 100
## month.to.quarter2(month.vec) 117.264 120.3555 127.6535 127.021 133.474 153.556 100
I have a data.table with variables as columns and one column storing a different function for each row as character. I would like to simply apply each function to its row and store output in a new column without using a for loop as I need it to be fast and have 25000 rows.
If I take a simplified example, let's say I have a data.table dt:
dt <- data.table(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9),d=c("a+b+c","a*b*c","c/a*b"))
dt
a b c d
1: 1 4 7 a+b+c
2: 2 5 8 a*b*c
3: 3 6 9 c/a*b
I would like to have this as a result:
a b c d e
1: 1 4 7 a+b+c 12
2: 2 5 8 a*b*c 80
3: 3 6 9 c/a*b 18
So far the only solution I have found is a for loop, but it's slow for my 25000 rows and 32 variables:
for (i in 1:nrow(dt)){
dt[i,e:=eval(parse(text=dt[i,d]))]
}
I have been searching for quite a while (tried with eval, sapply,...) but so far with no success, I would really appreciate any suggestions.
interpret <- function(expr, .SD) eval(parse(text = expr[1]), envir = .SD)
dt[, e := interpret(d,.SD), by = d, .SDcols = c("a", "b", "c")]
dt
returns:
> dt
a b c d e
1: 1 4 7 a+b+c 12
2: 2 5 8 a*b*c 80
3: 3 6 9 c/a*b 18
A dummy bench :
l <- lapply(1:1e5, function(i) dt)
bigdt <- rbindlist(l)
bigdt[, e:=interpret(d,.SD), by=d, .SDcols=c("a", "b", "c")]
bigdt
microbenchmark(for (i in 1:nrow(dt)){
bigdt[i,e:=eval(parse(text=bigdt[i,d]))]
}, bigdt[, e:=interpret(d,.SD), by=d, .SDcols=c("a", "b", "c")])
gave me
Unit: milliseconds
expr
for (i in 1:nrow(dt)) { bigdt[i, `:=`(e, eval(parse(text = bigdt[i, d])))] }
bigdt[, `:=`(e, interpret(d, .SD)), by = d, .SDcols = c("a", "b", "c")]
min lq mean median uq max neval cld
2.693427 2.833544 3.240561 3.043713 3.150880 6.212202 100 a
6.891739 7.280915 9.988198 8.496646 8.721075 69.666926 100 b
>
invoke_map() from package purrr is designed to iterate over a list of functions and a list of parameters to each function.
Here is an alternative, slightly long winded, idea how to solve this problem.
dt <- data.frame(a=rep(c(1,2,3, 5), 10),b= rep(c(4,5,6, 5),10),c=rep(c(7,8,9, 5), 10),d=rep(c("a+b+c","a*b*c","c/a*b", "a+b+c"), 10), stringsAsFactors = FALSE)
Create functions in the environment based on column d
funs_map <- data.frame()
for(i in 1:length(unique(dt$d))){
eval(parse(text = paste('f', i, '<- function(', 'a, b, c', ') { return(' , unique(dt$d)[i] , ')}', sep='')))
funs_map[i,1] <- unique(dt$d)[i]
funs_map[i,2] <- paste('f', i, sep="")
}
Create a list of functions to iterate over - this will be the .f argument to invoke_map
funs_list <- as.list(funs_map$V2[match(dt$d , funs_map$V1)])
Last column will not be necessary anymore
dt <- dt[-4]
Create a list of parameters for each function - this appears to be the most time consuming step
params <-vector(mode = "list", length = nrow(dt))
for(i in 1:nrow(dt)){
params[[i]] <- as.list(dt[i,])
}
Iterate over functions
result <- invoke_map(funs_list, params)
Putting this code into a function and benchmarking:
microbenchmark(apply_funs(dt))
Unit: milliseconds
expr min lq mean median uq max neval
apply_funs(dt) 19.27345 20.34213 21.09592 20.66714 21.63639 26.83376 100
Original code:
Unit: milliseconds
expr min
for (i in 1:nrow(dt)) { dt[i, `:=`(e, eval(parse(text = dt[i, d])))] } 353.7435
lq mean median uq max neval
358.0244 362.6764 360.3644 362.9175 439.9213 100
And tokiloutok solution (fastest):
Unit: milliseconds
expr min
dt[, `:=`(e, interpret(d, .SD)), by = d, .SDcols = c("a", "b", "c")] 0.780877
lq mean median uq max neval
0.8148745 0.8432403 0.822787 0.8480175 1.203817 100