I have a vector, say x which contains only the integer numbers 0,1 and 2. For example;
x <- c(0,1,0,2,0,0,1,0,0,1,0,0,0,1,0)
From this I would like to extract how many times zero occurs in each "pattern". In this simple example it occurs three times on it own, twice as 00 and exactly once as 000, so I would like to output something like:
0 3
00 2
000 1
My actual dataset is quite large (1000-2000 elements in the vector) and at least in theory the maximum number of consecutive zeros is length(x)
1) rle Use rle and table like this. No packages are needed.
tab <- with(rle(x), table(lengths[values == 0]))
giving:
> tab
1 2 3
3 2 1
or
> as.data.frame(tab)
Var1 Freq
1 1 3
2 2 2
3 3 1
That is, there are 3 runs of one zero, 2 runs of two zeros and 1 run of three zeros.
The output format in the question is not really feasible if there are very long runs but just for fun here it is:
data.frame(Sequence = strrep(0, names(tab)), Freq = as.numeric(tab))
giving:
Sequence Freq
1 0 3
2 00 2
3 000 1
2) gregexpr Another possibility is to use a regular expression:
tab2 <- table(attr(gregexpr("0+", paste(x, collapse = ""))[[1]], "match.length"))
giving:
> tab2
1 2 3
3 2 1
Other output formats could be derived as in (1).
Note
I checked the speed with a length(x) of 2000 and (1) took about 1.6 ms on my laptop and (2) took about 9 ms.
1) We can use rleid from data.table
data.table(x)[, strrep(0, sum(x==0)) ,rleid(x == 0)][V1 != "",.N , V1]
# V1 N
#1: 0 3
#2: 00 2
#3: 000 1
2) or we can use tidyverse
library(tidyverse)
tibble(x) %>%
group_by(grp = cumsum(x != 0)) %>%
filter(x == 0) %>%
count(grp) %>%
ungroup %>%
count(n)
# A tibble: 3 x 2
# n nn
# <int> <int>
#1 1 3
#2 2 2
#3 3 1
3) Or we can use tabulate with rleid
tabulate(tabulate(rleid(x)[x==0]))
#[1] 3 2 1
Benchmarks
By checking with system.time on #SymbolixAU's dataset
system.time({
tabulate(tabulate(rleid(x2)[x2==0]))
})
# user system elapsed
# 0.03 0.00 0.03
Comparing with the Rcpp function, the above is not that bad
system.time({
m <- zeroPattern(x2)
m[m[,2] > 0, ]
})
# user system elapsed
# 0.01 0.01 0.03
With microbenchmark, removed the methods that are consuming more time (based on #SymbolixAU's comparisons) and initiated a new comparison. Note that here also, it is not exactly apples to apples but it is still a lot more similar as in the previous comparison there is an overhead of data.table along with some formatting to replicate the OP's expected output
microbenchmark(
akrun = {
tabulate(tabulate(rleid(x2)[x2==0]))
},
G = {
with(rle(x2), table(lengths[values == 0]))
},
sym = {
m <- zeroPattern(x2)
m[m[,2] > 0, ]
},
times = 5, unit = "relative"
)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 5 a
# G 6.049181 8.272782 5.353175 8.106543 7.527412 2.905924 5 b
# sym 1.385976 1.338845 1.661294 1.399635 3.845435 1.211131 5 a
You mention a 'quite large' data set, so you can make use of C++ through Rcpp to speed this up (however, the benchmarking shows base rle solution is fairly quick anyway)
A function could be
library(Rcpp)
cppFunction('Rcpp::NumericMatrix zeroPattern(Rcpp::NumericVector x) {
int consecutive_counter = 0;
Rcpp::IntegerVector iv = seq(1, x.length());
Rcpp::NumericMatrix m(x.length(), 2);
m(_, 0) = iv;
for (int i = 0; i < x.length(); i++) {
if (x[i] == 0) {
consecutive_counter++;
} else if (consecutive_counter > 0) {
m(consecutive_counter-1, 1)++;
consecutive_counter = 0;
}
}
if (consecutive_counter > 0) {
m(consecutive_counter-1, 1)++;
}
return m;
}')
Which gives you a matrix of the counts of consecutive zeros
x <- c(0,1,0,2,0,0,1,0,0,1,0,0,0,1,0)
zeroPattern(x)
m <- zeroPattern(x)
m[m[,2] > 0, ]
# [,1] [,2]
# [1,] 1 3
# [2,] 2 2
# [3,] 3 1
On a larger data set we notice the speed improvements
set.seed(20180411)
x2 <- sample(x, 1e6, replace = T)
m <- zeroPattern(x2)
m[m[,2] > 0, ]
library(microbenchmark)
library(data.table)
microbenchmark(
akrun = {
data.table(x2)[, strrep(0, sum(x2==0)) ,rleid(x2 == 0)][V1 != "",.N , V1]
},
G = {
with(rle(x2), table(lengths[values == 0]))
},
sym = {
m <- zeroPattern(x2)
m[m[,2] > 0, ]
},
times = 5
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# akrun 3727.66899 3782.19933 3920.9151 3887.6663 4048.2275 4158.8132 5
# G 236.69043 237.32251 258.4320 246.1470 252.1043 319.8956 5
# sym 97.54988 98.76986 190.3309 225.2611 237.5781 292.4955 5
Note:
Mine and G's functions are returning a 'table'-style answer. Akrun has formatted his to include padded zeros, so will incur a slight cost.
Related
I have the following dataframe:
a <- seq(0, 5, by = 0.25)
b <-seq(0, 20, by = 1)
df <- data.frame(a, b)
and I'd like to create a new column "value", based on columns a and b, and the conversion table below:
a_min <- c(0,2, 0,2)
a_max <- c(2,5,2,5)
b_min <- c(0,0,10,10)
b_max <- c(10,10,30,30)
output <-c(1,2,3,4)
conv <- data.frame(a_min, a_max, b_min, b_max, output)
I've tried to do it using dplyr::mutate without much success...
require(dplyr)
mutate(df, value = calcula(conv, a, b))
longer object length is not a multiple of shorter object length
My expectation would be to obtain a dataframe like the 'df' above with the additional column value as per below:
df$value <- c(rep(1,8), rep(2,2), rep(4,11))
A possible relatively simple and very efficient data.table solution using binary non-equi joins
library(data.table) # v1.10.0
setDT(conv)[setDT(df), output, on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4
As a side note, if output column is just the row index within conv, you could make this join even more efficient by just asking for the row indices by specifying which = TRUE
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 4 4
One more option, this time with matrices.
with(df, with(conv, output[max.col(
outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))
## [1] 1 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
outer compares each element of the vector from df from the one from conv, producing a matrix of Booleans for each call. Since TRUE is 1, if you add all four matrices, the index you want will be the column with the most TRUEs, which you can get with max.col. Subset output, and you've got your result.
The benefit of working with matrices is that they're fast. Using #Phann's benchmarks on 1,000 rows:
Unit: microseconds
expr min lq mean median uq max neval cld
alistaire 276.099 320.4565 349.1045 339.8375 357.2705 941.551 100 a
akr1 830.934 966.6705 1064.8433 1057.6610 1152.3565 1507.180 100 ab
akr2 11431.246 11731.3125 12835.5229 11947.5775 12408.4715 36767.488 100 d
Pha 11985.129 12403.1095 13330.1465 12660.4050 13044.9330 29653.842 100 d
Ron 71132.626 74300.3540 81136.9408 78034.2275 88952.8765 98950.061 100 e
Dav1 2506.205 2765.4095 2971.6738 2948.6025 3082.4025 4065.368 100 c
Dav2 2104.481 2272.9180 2480.9570 2478.8775 2575.8740 3683.896 100 bc
and on 100,000 rows:
Unit: milliseconds
expr min lq mean median uq max neval cld
alistaire 30.00677 36.49348 44.28828 39.43293 54.28207 64.36581 100 a
akr1 36.24467 40.04644 48.46986 41.59644 60.15175 77.34415 100 a
Dav1 51.74218 57.23488 67.70289 64.11002 68.86208 382.25182 100 c
Dav2 48.48227 54.82818 60.25256 59.81041 64.92611 91.20212 100 b
We can try with Map with na.locf
library(zoo)
f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 4
Or another way to write the Map solution is to pass the 'a' and 'b' columns as arguments, and then do the logical evaluation with columns of 'conv' to extract the 'output' value and unlist the list output
unlist(Map(function(x, y)
with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]),
df$a, df$b))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
NOTE: The second solution should be slower as we are looping through the rows of the dataset while the first solution loops through the 'conv' rows (which we assume should not be many rows)
Another approach using apply:
df$value <- unlist(apply(df, 1, function(x){
ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
}))
EDIT:
Because there are several answers so far, I checked the time needed to process the data. I created a little bit bigger example (similar to the given one with random numbers):
set.seed(23563)
a <- runif(1000, 0, 5)
b <- runif(1000, 0, 20)
df <- data.frame(a, b)
require(microbenchmark)
library(zoo)
require(data.table)
microbenchmark(
akr1 = { #akrun 1
f1 <- function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
},
akr2 = { #akrun 2
unlist(Map(function(x, y)
with(conv, output[x >= a_min & a_max > x & y >= b_min & b_max > y]),
df$a, df$b))
},
Pha = { #Phann
df$value <- unlist(apply(df, 1, function(x){
ifelse(length(OUT <- output[which(x[1] >= a_min & x[1] <= a_max & x[2] >= b_min & x[2] <= b_max)]) > 0, OUT, 0)
}))
},
Ron = { #Ronak Shah
unlist(mapply(function(x, y)
conv$output[x >= conv$a_min & conv$a_max > x & y >= conv$b_min & conv$b_max > y],
df$a, df$b))
},
Dav1 ={ #David Arenburg 1
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]$output
},
Dav2 = { #David Arenburg 2
setDT(conv)[setDT(df), on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b), which = TRUE]
},
times = 100L
)
With 1000 random numbers:
# Unit: milliseconds
# expr min lq mean median uq max neval
# akr1 4.267206 4.749576 6.259695 5.351494 6.843077 54.39187 100
# akr2 33.437853 39.912785 49.932875 47.416888 57.070369 91.55602 100
# Pha 30.433779 36.939692 48.205592 46.393800 55.800204 83.91640 100
# Ron 174.765021 199.648315 227.493117 223.314661 240.579057 370.26929 100
# Dav1 6.944759 7.814469 10.685460 8.536694 11.974102 44.47915 100
# Dav2 6.106978 6.706424 8.961821 8.161707 10.376085 28.91255 100
With 10000 random numbers (same seed), I get:
# Unit: milliseconds
# expr min lq mean median uq max neval
# akr1 23.48180 24.03962 26.16747 24.46897 26.19565 41.83238 100
# akr2 357.38290 398.69965 434.92052 409.15385 440.98210 829.85113 100
# Pha 320.39285 347.66632 376.98118 361.76852 383.08231 681.28500 100
# Ron 1661.50669 1788.06228 1873.70929 1837.28187 1912.04123 2499.23235 100
# Dav1 20.91486 21.60953 23.12278 21.94707 22.42773 44.71900 100
# Dav2 19.69506 20.22077 21.63715 20.55793 21.27578 38.96819 100
Here is another attempt to utilize findIntervals efficiency on both memory and speed. A more convenient format of the conv "data.frame" could be
(i) a "list" of the intervals for each variable which are not overlapping:
vecs = list(a = unique(c(conv$a_min, conv$a_max)),
b = unique(c(conv$b_min, conv$b_max)))
vecs
#$a
#[1] 0 2 5
#
#$b
#[1] 0 10 30
and, (ii) a lookup structure that contains the group of each paired interval between the two variables:
maps = xtabs(output ~ a_min + b_min)
maps
# b_min
#a_min 0 10
# 0 1 3
# 2 2 4
where, for example, we note that the first interval of "a" && second of "b" are assigned a "3" etc.
Then we can use:
maps[mapply(findInterval, df, vecs, all.inside = TRUE)]
# [1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4 4
And extending the benchmarks of Phann and alistaire (re-written, partly, for convenience):
n = 1e6
set.seed(23563); a = runif(n, 0, 5); b = runif(n, 0, 20); df = data.frame(a, b)
library(microbenchmark); library(zoo); library(data.table)
alistaire = function() {
with(df, with(conv, output[max.col(
outer(a, a_min, `>=`) + outer(a, a_max, `<=`) +
outer(b, b_min, `>=`) + outer(b, b_max, `<=`))]))
}
david = function() {
as.data.table(conv)[setDT(df), output, on = .(a_min <= a, a_max >= a, b_min <= b, b_max >= b)]
}
akrun = function() {
f1 = function(u, v, x, y, z) z * NA^!((with(df, a >= u & a <v) & (b >=x & b <y)))
na.locf(do.call(pmax, c(do.call(Map, c(f=f1, unname(conv))), na.rm = TRUE)))
}
alex = function() {
vecs = list(a = unique(c(conv$a_min, conv$a_max)), b = unique(c(conv$b_min, conv$b_max)))
maps = xtabs(output ~ a_min + b_min)
maps[mapply(findInterval, df, vecs, all.inside = TRUE)]
}
identical(alistaire(), david())
#[1] TRUE
identical(david(), akrun())
#[1] TRUE
identical(akrun(), alex())
#[1] TRUE
microbenchmark(alistaire(), david(), akrun(), alex(), times = 20)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# alistaire() 592.46700 718.07148 799.28933 792.98107 860.16414 1136.4489 20 b
# david() 1363.76196 1375.43935 1398.53515 1385.11747 1425.69837 1457.1693 20 d
# akrun() 824.11962 850.88831 903.58723 906.21007 958.04310 995.2129 20 c
# alex() 70.82439 72.65993 82.87961 76.77627 81.20356 179.7669 20 a
We can use mapply on two variables a and b and find the correct output variable based on the range
unlist(mapply(function(x, y)
conv$output[x >= conv$a_min & conv$a_max > x & y >= conv$b_min & conv$b_max > y],
df$a, df$b))
#[1] 1 1 1 1 1 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
Let's say I have something like this:
set.seed(0)
the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
Within each x, starting from n==2 (going from small to large), I want to set val to 0 if the previous val (in terms of n) is 0; otherwise, leave it as is.
For example, in the subset x=="b", I first ignore the two rows where n < 2. Now, in Row 7, because the previous val is 0 (the.df$val[the.df$x=="b" & the.df$n==1]), I set val to 0 (the.df$val[the.df$x=="b" & the.df$n==2] <- 0). Then on Row 8, now that val for the previous n is 0 (we just set it), I also want to set val here to 0 (the.df$val[the.df$x=="b" & the.df$n==3] <- 0).
Imagine that the data.frame is not sorted. Therefore procedures that depend on the order would require a sort. I also can't assume that adjacent rows exist (e.g., the row the.df[the.df$x=="a" & the.df$n==1, ] might be missing).
The trickiest part seems to be evaluating val in sequence. I can do this using a loop but I imagine that it would be inefficient (I have millions of rows). Is there a way I can do this more efficiently?
EDIT: wanted output
the.df
x n val wanted
1 a 0 1 1
2 a 1 0 0
3 a 2 0 0
4 a 3 1 0
5 b 0 1 1
6 b 1 0 0
7 b 2 1 0
8 b 3 1 0
9 c 0 1 1
10 c 1 1 1
11 c 2 0 0
12 c 3 0 0
Also, I don't mind making new columns (e.g., putting the wanted values there).
Using data.table I would try the following
library(data.table)
setDT(the.df)[order(n),
val := if(length(indx <- which(val[2:.N] == 0L)))
c(val[1:(indx[1L] + 1L)], rep(0L, .N - (indx[1L] + 1L))),
by = x]
the.df
# x n val
# 1: a 0 1
# 2: a 1 0
# 3: a 2 0
# 4: a 3 0
# 5: b 0 1
# 6: b 1 0
# 7: b 2 0
# 8: b 3 0
# 9: c 0 1
# 10: c 1 1
# 11: c 2 0
# 12: c 3 0
This will simultaneously order the data by n (as you said it's not ordered in real life) and recreate val by condition (meaning that if condition not satisfied, val will be untouched).
Hopefully in the near future this will be implemented and then the code could potentially be
setDT(the.df)[order(n), val[n > 2] := if(val[2L] == 0) 0L, by = x]
Which could be a great improvement both performance and syntax wise
A base R approach might be
df <- the.df[order(the.df$x, the.df$n),]
df$val <- ave(df$val, df$x, FUN=fun)
As for fun, #DavidArenburg's answer in plain R and written a bit more poetically might be
fun0 <- function(v) {
idx <- which.max(v[2:length(v)] == 0L) + 1L
if (length(idx))
v[idx:length(v)] <- 0L
v
}
It seems like a good idea to formulate the solution as an independent function first, because then it is easy to test. fun0 fails for some edge cases, e.g.,
> fun0(0)
[1] 0 0 0
> fun0(1)
[1] 0 0 0
> fun0(c(1, 1))
[1] 1 0
A better version is
fun1 <- function(v) {
tst <- tail(v, -1) == 0L
if (any(tst)) {
idx <- which.max(tst) + 1L
v[idx:length(v)] <- 0L
}
v
}
And even better, following #Arun
fun <- function(v)
if (length(v) > 2) c(v[1], cummin(v[-1])) else v
This is competitive (same order of magnitude) with the data.table solution, with ordering and return occurring in less than 1s for the ~10m row data.frame of #m-dz 's timings. At a second for millions of rows, it doesn't seem worth while to pursue further optimization.
Nonetheless, when there are a very large number of small groups (e.g., 2M each of size 5) an improvement is to avoid the tapply() function call by using group identity to offset the minimum. For instance,
df <- df[order(df$x, df$n),]
grp <- match(df$x, unique(df$x)) # strictly sequential groups
keep <- duplicated(grp) # ignore the first of each group
df$val[keep] <- cummin(df$val[keep] - grp[keep]) + grp[keep]
Hmmm, should be pretty efficient if you switch to data.table...
library(data.table)
# Define the.df as a data.table (or use data.table::setDT() function)
set.seed(0)
the.df <- data.table(
x = rep(letters[1:3], each = 4),
n = rep(0:3, 3),
val = round(runif(12))
)
m_dz <- function() {
setorder(the.df, x, n)
repeat{
# Get IDs of rows to change
# ids <- which(the.df[, (n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0)])
ids <- the.df[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
# If no IDs break
if(length(ids) == 0){
break
}
# Set val to 0
# for (i in ids) set(the.df, i = i, j = "val", value = 0)
set(the.df, i = ids, j = "val", value = 0)
}
return(the.df)
}
Edit: Above function is slightly modified thanks to #jangorecki's, i.e. uses which = TRUE and set(the.df, i = ids, j = "val", value = 0), which made the timings much more stable (no very high max timings).
Edit: timing comparison with #David Arenburgs's answer on a slightly bigger table, m-dz() updated (#FoldedChromatin's answer skipped because of diffrent results).
My function is slightly faster in terms of median and upper quantile, but there is quite a big spread in timings (see max...), I cannot figure out why. Hopefully the timing methodology is correct (returning the result to different object etc.).
Anything bigger will kill my PC :(
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e7/size1)
the.df1 <- data.table(
x = rep(groups_ids, each = size2), # 52 * 500 = 26000
n = rep(0:(size2-1), size1),
val = round(runif(size1*size2))
)
the.df2 <- copy(the.df1)
# m-dz
m_dz <- function() {
setorder(df1, x, n)
repeat{
ids <- df1[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
if(length(ids) == 0){
break
}
set(df1, i = ids, j = "val", value = 0)
}
return(df1)
}
# David Arenburg
DavidArenburg <- function() {
setorder(df2, x, n)
df2[, val := if(length(indx <- which.max(val[2:.N] == 0) + 1L)) c(val[1:indx], rep(0L, .N - indx)), by = x]
return(df2)
}
library(microbenchmark)
microbenchmark(
res1 <- m_dz(),
res2 <- DavidArenburg(),
times = 100
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 247.4136 268.5005 363.0117 288.4216 312.7307 7071.0960 100 a
# res2 <- DavidArenburg() 270.6074 281.3935 314.7864 303.5229 328.1210 525.8095 100 a
identical(res1, res2)
# [1] TRUE
Edit: (Old) results for even bigger table:
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e8/size1)
# Unit: seconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 5.599855 5.800264 8.773817 5.923721 6.021132 289.85107 100 a
# res2 <- m_dz2() 5.571911 5.836191 9.047958 5.970952 6.123419 310.65280 100 a
# res3 <- DavidArenburg() 9.183145 9.519756 9.714105 9.723325 9.918377 10.28965 100 a
Why not just use by
> set.seed(0)
> the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
> the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
> Mod.df<-by(the.df,INDICES=the.df$x,function(x){
x$val[x$n==2]=0
Which=which(x$n==2 & x$val==0)+1
x$val[Which]=0
x})
> do.call(rbind,Mod.df)
x n val
a.1 a 0 1
a.2 a 1 0
a.3 a 2 0
a.4 a 3 0
b.5 b 0 1
b.6 b 1 0
b.7 b 2 0
b.8 b 3 0
c.9 c 0 1
c.10 c 1 1
c.11 c 2 0
c.12 c 3 0
I am trying to determine the vector where an element is coming from in a list I have created. I'll give a repeatable example here:
set.seed(101)
a <- runif(10, min=0, max=100)
b <- runif(10, min=0, max=100)
c <- runif(10, min=0, max=100)
d <- runif(10, min=0, max=100)
information <- list(a, b, c, d)
information.wanted <- mean(do.call(pmax, information))
The code to get the information.wanted works just fine. What I am now trying to find is the individual vector in the list where each of the maximum values comes from. For example, value 1 in information.wanted (87.97...) comes from vector b in the information list. I would like to create another piece of code that gives the vector where the information.wanted comes from.
> information.wanted
[1] 87.97957 95.68375 73.19726 93.16344 92.33189 91.34787 82.04361 81.42830 62.20120
[10] 92.48044
I have no idea how to do this though. None of the code that I've tried has gotten me anywhere close.
postition.of.information.wanted <- ??
I'm looking to get something like this. A numeric vector is fine. I can supplement the values in later.
> position.of.informaiton.wanted
[1] 2 3 ...
Any help would be greatly appreciated. Thanks.
You need to apply which.max to each "i" index of each element in "information":
f1 = function(x)
sapply(seq_along(x[[1]]), function(i) which.max(sapply(x, "[[", i)))
f1(information)
# [1] 2 3 2 2 3 4 2 4 1 4
mapply already provides that kind of "parallel" functionality:
f2 = function(x)
unlist(.mapply(function(...) which.max(c(...)), x, NULL))
f2(information)
# [1] 2 3 2 2 3 4 2 4 1 4
Or, instead of concatenating "information" in chunks, convert to a "matrix" -as David Arenburg notes in the comments- at start and apply which.max to its rows:
f3a = function(x)
apply(do.call(cbind, x), 1, which.max)
f3a(information)
# [1] 2 3 2 2 3 4 2 4 1 4
or its columns:
f3b = function(x)
apply(do.call(rbind, x), 2, which.max)
f3b(information)
# [1] 2 3 2 2 3 4 2 4 1 4
also, max.col is convenient for a "matrix":
f4 = function(x)
max.col(do.call(cbind, x), "first")
f4(information)
# [1] 2 3 2 2 3 4 2 4 1 4
If it wasn't R, then a simple loop over the elements would provide both which.max and max ...but R, also, handles vectors:
f5 = function(x)
{
ans = rep_len(1L, length(x[[1]]))
maxs = x[[1]]
for(i in 2:length(x)) {
wh = x[[i]] > maxs
maxs[wh] = x[[i]][wh]
ans[wh] = i
}
ans #or '(data.frame(i = ans, val = maxs)' for both
}
f5(information)
# [1] 2 3 2 2 3 4 2 4 1 4
It had to end with a benchmark:
set.seed(007)
dat = replicate(13, runif(1e4), FALSE)
identical(f1(dat), f2(dat))
#[1] TRUE
identical(f2(dat), f3a(dat))
#[1] TRUE
identical(f3a(dat), f3b(dat))
#[1] TRUE
identical(f3b(dat), f4(dat))
#[1] TRUE
identical(f4(dat), f5(dat))
#[1] TRUE
microbenchmark::microbenchmark(f1(dat), f2(dat), f3a(dat), f3b(dat), f4(dat), f5(dat), do.call(pmax, dat), times = 50)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# f1(dat) 274995.963 298662.210 339279.948 318937.172 350822.539 723673.972 50 d
# f2(dat) 94619.397 100079.205 114664.776 107479.127 114619.439 226733.260 50 c
# f3a(dat) 19767.925 23423.688 26382.919 25795.499 29215.839 40100.656 50 b
# f3b(dat) 20351.872 22829.997 28889.845 25090.446 30503.100 140311.058 50 b
# f4(dat) 975.102 1109.431 1546.571 1169.462 1361.733 8954.100 50 a
# f5(dat) 2427.665 2470.816 5299.386 2520.755 3197.793 112986.612 50 a
# do.call(pmax, dat) 1477.618 1530.166 1627.934 1551.046 1602.898 2814.295 50 a
I have grouped data collected at different time steps. Within each time step, there are several registrations of values. Each value may occur one or more times within and among time steps.
Some toy data:
df <- data.frame(grp = rep(1:2, each = 8),
time = c(rep(1, 3), rep(2, 2), rep(3, 3)),
val = c(1, 2, 1, 2, 3, 2, 3, 4, 1, 2, 3, 1, 1, 1, 2, 3))
df
# grp time val
# 1 1 1 1
# 2 1 1 2
# 3 1 1 1
# 4 1 2 2
# 5 1 2 3
# 6 1 3 2
# 7 1 3 3
# 8 1 3 4
# 9 2 1 1
# 10 2 1 2
# 11 2 1 3
# 12 2 2 1
# 13 2 2 1
# 14 2 3 1
# 15 2 3 2
# 16 2 3 3
Objectives
I wish to do some calculations within an expanding time window, i.e. within time step 1, within time 1 and 2 together, within 1, 2, and 3 together, and so on. Within each window, I wish to calculate the number of unique values, the number of values which have occurred more than once, and the proportion of values which have occurred more than once.
For example, in my toy data, in group (grp) 1, in the second time window (time = 1 & 2 together) three unique values (val 1, 2, 3) have been registered (n_val = 3). Two of them (1, 2) occur more than once (n_re = 2), resulting in a "re_rate" of 0.67 (see below).
My data.table code produces the desired result. On a small data set it is slower than my base attempt, which I believe is fair enough, given some possible overhead in the data.table code. With a larger data set, the data.table code catches up, but is still slower. I expected (hoped) that the benefits would show up earlier.
Thus, what made me post this question is that I believe that the relative performance of my code is a strong indicator of me abusing data.table (I am sure the reason is not data.table performance itself). Thus, the main objective of my question is to get some advice on how to code this in a more data.table-esque way. For example, is it possible to avoid the loop over time windows altogether by vectorizing the calculations, as shown e.g. in the nice answer by #Khashaa here. If not, are there ways to make the loop and assignment more efficient?
My data.table code:
library(data.table)
f_dt <- function(df){
setDT(df, key = c("grp", "time", "val"))[ , {
# key or not only affects speed marginally
# unique time steps
times <- .SD[ , unique(time)]
# index vector to loop over
idx <- seq_along(times)
# pre-allocate data table
d2 <- data.table(time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
# loop to generate expanding window
for(i in idx){
# number of registrations per val
n <- .SD[time %in% times[seq_len(i)], .(n = .N), by = val][ , n]
# number of unique val
set(x = d2, i = i, j = 2L, length(n))
# number of val registered more than once
set(x = d2, i = i, j = 3L, sum(n > 1))
}
# proportion values registered more than once
d2[ , re_rate := round(n_re / n_val, 2)]
d2
}
, by = grp]
}
...which gives the desired result:
f_dt(df)
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.50
# 2: 1 2 3 2 0.67
# 3: 1 3 4 3 0.75
# 4: 2 1 3 0 0.00
# 5: 2 2 3 1 0.33
# 6: 2 3 3 3 1.00
Corresponding base code:
f_by <- function(df){
do.call(rbind,
by(data = df, df$grp, function(d){
times <- unique(d$time)
idx <- seq_along(times)
d2 <- data.frame(grp = d$grp[1],
time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
for(i in idx){
dat <- d[d$time %in% times[seq_len(i)], ]
tt <- table(dat$val)
n_re <- sum(tt > 1)
n_val <- length(tt)
re_rate <- round(n_re / n_val, 2)
d2[i, ] <- data.frame(d2$grp[1], time = times[i], n_val, n_re, re_rate)
}
d2
})
)
}
Timings:
Tiny toy data from above:
library(microbenchmark)
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.481724 1.450203 1.474037 1.452887 1.521378 1.502686 10
Some larger data:
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.094424 1.099642 1.107821 1.096997 1.097693 1.194983 10
No, the data is still not large, but I would expect data.table to catch up by now. If coded properly... I believe this suggests that there is a large potential for improvement of my code. Any advice is highly appreciated.
f <- function(df){
setDT(df)[, n_val := cumsum(!duplicated(val)), grp
][, occ := 1:.N, .(grp, val)
][, occ1 := cumsum(occ == 1) - cumsum(occ == 2), grp
][, n_re := n_val - occ1,
][, re_rate := round(n_re/n_val, 2),
][, .(n_val = n_val[.N], n_re = n_re[.N], re_rate =re_rate[.N]), .(grp, time)]
}
where
cumsum(!duplicated(val)) counts the (cumulative) occurrences of the unique values, n_val,
occ counts the cumulative occurrences each value (note that it is grouped by val).
occ1 then counts the number of elements in val occurred only once so far.
The number of values occurred only once increases by 1 when occ==1, decreases by 1 when occ==2; hence cumsum(occ == 1) - cumsum(occ == 2).
The number of values which have occurred more than once is n_val-occ1
Speed Comparison
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
system.time(f(df))
# user system elapsed
# 0.038 0.000 0.038
system.time(f_dt(df))
# user system elapsed
# 16.617 0.013 16.727
system.time(f_by(df))
# user system elapsed
# 16.077 0.040 16.122
Hope this helps.
Was looking for a better way to code expanding window of non-duplicated groups and came across this question.
This question seems to be more about expanding window where the group (i.e. time in the question) is duplicated. Below is a solution making use of between.
#expanding group by where groups are duplicated
library(data.table)
setDT(df)
df[ , {
#get list of unique time groups to be used in the expanding group
uniqt <- unique(time)
c(list(time=uniqt), #output time as well
#expanding window of each unique time group
do.call(rbind, lapply(uniqt, function(n) {
#tabulate the occurrences
x <- table(val[between(time, uniqt[1L], n)])
#calculate desired values
n_val <- length(x)
n_re <- sum(x > 1)
data.frame(n_val=n_val, n_re=n_re, re_rate=n_re/n_val)
})))
}, by=grp]
result:
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.5000000
# 2: 1 2 3 2 0.6666667
# 3: 1 3 4 3 0.7500000
# 4: 2 1 3 0 0.0000000
# 5: 2 2 3 1 0.3333333
# 6: 2 3 3 3 1.0000000
I was unable to find in which version of data.table was between first released and hence, between might be released after this question was posted.
I have a vector like the following:
xx <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)
I want to find the indexes that have ones and combine them together. In this case, I want the output to look like 1 6 and 11 14 in a 2x2 matrix. My vector is actually very long so I can't do this by hand. Can anyone help me with this? Thanks.
Since the question originally had a tag 'bioinformatics' I'll mention the Bioconductor package IRanges (and it's companion for ranges on genomes GenomicRanges)
> library(IRanges)
> xx <- c(1,1,1,1,1,1,0,0,0,0,1,1,1,1)
> sl = slice(Rle(xx), 1)
> sl
Views on a 14-length Rle subject
views:
start end width
[1] 1 6 6 [1 1 1 1 1 1]
[2] 11 14 4 [1 1 1 1]
which could be coerced to a matrix, but that would often not be convenient for whatever the next step is
> matrix(c(start(sl), end(sl)), ncol=2)
[,1] [,2]
[1,] 1 6
[2,] 11 14
Other operations might start on the Rle, e.g.,
> xx = c(2,2,2,3,3,3,0,0,0,0,4,4,1,1)
> r = Rle(xx)
> m = cbind(start(r), end(r))[runValue(r) != 0,,drop=FALSE]
> m
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 11 12
[4,] 13 14
See the help page ?Rle for the full flexibility of the Rle class; to go from a matrix like that above to a new Rle as asked in the comment below, one might create a new Rle of appropriate length and then subset-assign using an IRanges as index
> r = Rle(0L, max(m))
> r[IRanges(m[,1], m[,2])] = 1L
> r
integer-Rle of length 14 with 3 runs
Lengths: 6 4 4
Values : 1 0 1
One could expand this to a full vector
> as(r, "integer")
[1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1
but often it's better to continue the analysis on the Rle. The class is very flexible, so one way of going from xx to an integer vector of 1's and 0's is
> as(Rle(xx) > 0, "integer")
[1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1
Again, though, it often makes sense to stay in Rle space. And Arun's answer to your separate question is probably best of all.
Performance (speed) is important, although in this case I think the Rle class provides a lot of flexibility that would weigh against poor performance, and ending up at a matrix is an unlikely end-point for a typical analysis. Nonetheles the IRanges infrastructure is performant
eddi <- function(xx)
matrix(which(diff(c(0,xx,0)) != 0) - c(0,1),
ncol = 2, byrow = TRUE)
iranges = function(xx) {
sl = slice(Rle(xx), 1)
matrix(c(start(sl), end(sl)), ncol=2)
}
iranges.1 = function(xx) {
r = Rle(xx)
cbind(start(r), end(r))[runValue(r) != 0, , drop=FALSE]
}
with
> xx = sample(c(0, 1), 1e5, TRUE)
> microbenchmark(eddi(xx), iranges(xx), iranges.1(xx), times=10)
Unit: milliseconds
expr min lq median uq max neval
eddi(xx) 45.88009 46.69360 47.67374 226.15084 234.8138 10
iranges(xx) 112.09530 114.36889 229.90911 292.84153 294.7348 10
iranges.1(xx) 31.64954 31.72658 33.26242 35.52092 226.7817 10
Something like this, maybe?
if (xx[1] == 1) {
rr <- cumsum(c(0, rle(xx)$lengths))
} else {
rr <- cumsum(rle(xx)$lengths)
}
if (length(rr) %% 2 == 1) {
rr <- head(rr, -1)
}
oo <- matrix(rr, ncol=2, byrow=TRUE)
oo[, 1] <- oo[, 1] + 1
[,1] [,2]
[1,] 1 6
[2,] 11 14
This edit takes care of cases where 1) the vector starts with a "0" rather than a "1" and 2) where the number of consecutive occurrences of 1's are odd/even. For ex: xx <- c(1,1,1,1,1,1,0,0,0,0).
Another, short one:
cbind(start = which(diff(c(0, xx)) == +1),
end = which(diff(c(xx, 0)) == -1))
# start end
# [1,] 1 6
# [2,] 11 14
I tested on a very long vector and it is marginally slower than using rle. But more readable IMHO. If speed were really a concern, you could also do:
xx.diff <- diff(c(0, xx, 0))
cbind(start = which(head(xx.diff, -1) == +1),
end = which(tail(xx.diff, -1) == -1))
# start end
# [1,] 1 6
# [2,] 11 14
Here's another solution that's built upon the others' ideas, and is a bit shorter and faster:
matrix(which(diff(c(0,xx,0)) != 0) - c(0,1), ncol = 2, byrow = T)
# [,1] [,2]
#[1,] 1 6
#[2,] 11 14
I didn't test the non-base solution, but here's a comparison of base ones:
xx = sample(c(0,1), 1e5, T)
microbenchmark(arun(xx), flodel(xx), flodel.fast(xx), eddi(xx))
#Unit: milliseconds
# expr min lq median uq max neval
# arun(xx) 14.021134 14.181134 14.246415 14.332655 15.220496 100
# flodel(xx) 12.885134 13.186254 13.248334 13.432974 14.367695 100
# flodel.fast(xx) 9.704010 9.952810 10.063691 10.211371 11.108171 100
# eddi(xx) 7.029448 7.276008 7.328968 7.439528 8.361609 100