I have a list of vectors of integers, for example:
vec_list <- replicate(100, sample(1:10000000, size=sample(1:10000, 100)), simplify=FALSE)
And a vector of integers, for example:
vec <- sample(1:10000000, size=10000)
How can I count the number of intergers in each vector in vec_list that appear in the vector vec? I can do this using a for loop. For example:
total_match <- rep(NA, length(vec_list))
for (i in 1:length(vec_list)){
total_match[i] <- length(which(vec_list[[i]] %in% vec))
However, the list and vector I am trying to apply this too are very large, and this is slow. Please help with suggestions on how to improve performance.
Using data.table is much faster, but does not return 0's when there are no matches. For example:
DT <- data.table(repid=rep(1:length(vec_list), sapply(vec_list, length)), val=unlist(vec_list))
total_match2 <- DT[.(vec), on=.(val), nomatch=0L, .N, keyby=.(repid)]$N
What about:
sapply(vec_list, function(x) sum(x %in% vec))
Maybe try:
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
Another, a variant of #chinsoon's:
nvec = 5000
max_size = 10000
nv = 10000000
vec_list <- replicate(nvec, sample(nv, size=sample(max_size, 1)), simplify=FALSE)
vec <- sample(nv, size=max_size)
res <- rbindlist(lapply(vec_list, list), id=TRUE)[.(vec), on=.(V1), nomatch=0, .N, keyby=.id]
# user system elapsed
# 0.86 0.20 0.47
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
res2 <- DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
# user system elapsed
# 1.03 0.45 1.00
identical(res2[res2 != 0], res$N) # TRUE
I have a list of data frames with different sets of columns. I would like to combine them by rows into one data frame. I use plyr::rbind.fill to do that. I am looking for something that would do this more efficiently, but is similar to the answer given here
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
ll <- replicate(1e4, sample.fun())
UPDATE: See this updated answer instead.
UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:
DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)
rbind(DT1, DT2, fill = TRUE)
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 NA 1
#4: 4 NA 2
FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables
Note 1:
This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.
Note 2:
rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.
And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):
rbind.fill.DT <- function(ll) {
# changed sapply to lapply to return a list always
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
rbind.fill.PLYR <- function(ll) {
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10
# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
setcolorder(t2, unique(unlist(sapply(ll, names))))
identical(t1, t2) # [1] TRUE
It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.
Benchmarking plot:
Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.
Benchmarking gist:
Here's the gist for benchmarking, in case anyone wants to replicate the results.
Now that rbindlist (and rbind) for data.table has improved functionality and speed with the recent changes/commits in v1.9.3 (development version), and dplyr has a faster version of plyr's rbind.fill, named rbind_all, this answer of mine seems a bit too outdated.
Here's the relevant NEWS entry for rbindlist:
o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
-> use.names by default is FALSE for backwards compatibility (doesn't bind by
names by default)
-> rbind(...) now just calls rbindlist() internally, except that 'use.names'
is TRUE by default, for compatibility with base (and backwards compatibility).
-> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
-> At least one item of the input list has to have non-null column names.
-> Duplicate columns are bound in the order of occurrence, like base.
-> Attributes that might exist in individual items would be lost in the bound result.
-> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
-> And incredibly fast ;).
-> Documentation updated in much detail. Closes DR #5158.
So, I've benchmarked the newer (and faster versions) on relatively bigger data below.
New Benchmark:
We'll create a total of 10,000 data.tables with columns ranging from 200-300 with the total number of columns after binding to be 500.
Functions to create data:
require(data.table) ## 1.9.3 commit 1267
require(dplyr) ## commit 1504 devel
names = paste0("V", 1:500)
foo <- function() {
cols = sample(200:300, 1)
data = setDT(lapply(1:cols, function(x) sample(10)))
setnames(data, sample(names)[1:cols])
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
And here are the timings:
## Updated timings on data.table v1.9.5 - three consecutive runs:
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.993 0.106 2.107
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.644 0.092 1.744
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.297 0.088 1.389
## dplyr's rbind_all - Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
# user system elapsed
# 9.525 0.121 9.761
# user system elapsed
# 9.194 0.112 9.370
# user system elapsed
# 8.665 0.081 8.780
identical(ans1, setDT(ans2)) # [1] TRUE
There is still something to be gained if you parallelize both rbind.fill and rbindlist.
The results are done with data.table version 1.8.8 as version 1.8.9 got bricked when I tried it with the parallelized function. So the results aren't identical between data.table and plyr, but they are identical within data.table or plyr solution. Meaning parallel plyr matches to unparallel plyr, and vice versa.
Here's the benchmark/scripts. The parallel.rbind.fill.DT looks horrible, but that's the fastest one I could pull.
# data.table::rbindlist solutions
rbind.fill.DT <- function(ll) {
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
parallel.rbind.fill.DT <- function(ll, cluster=NULL){
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = cores+1))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbindlist(clusterApply(cluster, ll, function(ll, unq.names){
rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_]
setcolorder(tt, unq.names)
}, ll=ll, unq.names=unq.names))
}, unq.names=unq.names))
# plyr::rbind.fill solutions
rbind.fill.PLYR <- function(ll) {
parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){
if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst)))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbind.fill(parLapply(cluster, ll, rbind.fill))
# Function to generate sample data of varying list length
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
ll <- replicate(10000, sample.fun())
cl <- makeCluster(4, type="SOCK")
clusterEvalQ(cl, library(data.table))
clusterEvalQ(cl, library(plyr))
benchmark(t1 <- rbind.fill.PLYR(ll),
t2 <- rbind.fill.DT(ll),
t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400),
t4 <- parallel.rbind.fill.DT(ll, cluster=cl),
# Results for rbinding 10000 dataframes
# done with 4 cores, i5 3570k and 16gb memory
# test reps elapsed relative
# rbind.fill.PLYR 5 321.80 16.682
# rbind.fill.DT 5 26.10 1.353
# parallel.rbind.fill.PLYR 5 28.00 1.452
# parallel.rbind.fill.DT 5 19.29 1.000
# checking are results equal
t1 <- as.matrix(t1)
t2 <- as.matrix(t2)
t3 <- as.matrix(t3)
t4 <- as.matrix(t4)
t1 <- t1[order(t1[, 1], t1[, 2]), ]
t2 <- t2[order(t2[, 1], t2[, 2]), ]
t3 <- t3[order(t3[, 1], t3[, 2]), ]
t4 <- t4[order(t4[, 1], t4[, 2]), ]
identical(t2, t4) # TRUE
identical(t1, t3) # TRUE
identical(t1, t2) # FALSE, mismatch between plyr and data.table
As you can see parallesizing rbind.fill made it comparable to data.table, and you could get marginal increase of speed by parallesizing data.table even with this low of a dataframe count.
simply dplyr::bind_rows will do the job, as
merged_list <- bind_rows(ll)
#check it
> nrow(merged_list)
[1] 100000
> ncol(merged_list)
[1] 26
Time taken
> system.time(merged_list <- bind_rows(ll))
user system elapsed
0.29 0.00 0.28
I benchmarked a few solutions for replacing missing values per column.
df <- data.frame(replicate(3, sample(c(1:5, -99), 6, rep = TRUE)))
names(df) <- letters[1:3]
fix_na <- function(x) {
x[x == -99] <- NA
for(i in seq_along(df)) df[, i] <- fix_na(df[, i]),
for(i in seq_along(df)) df[[i]] <- fix_na(df[[i]]),
df[] <- lapply(df, fix_na)
Unit: microseconds
expr min lq mean median uq max neval
for (i in seq_along(df)) df[, i] <- fix_na(df[, i]) 179.167 191.9060 206.1650 204.2335 211.630 364.497 100
for (i in seq_along(df)) df[[i]] <- fix_na(df[[i]]) 83.420 92.8715 104.5787 98.0080 109.309 204.645 100
df[] <- lapply(df, fix_na) 105.199 113.4175 128.0265 117.9385 126.979 305.734 100
Why is the [[]] operator subsetting the dataframe 2x faster than the [,] operator?
I included the two recommended calls from docendo discimus and increased the amount of data.
df1 <- data.frame(replicate(2000, sample(c(1:5, -99), 500, rep = TRUE)))
df2 <- df1
df3 <- df1
df4 <- df1
df5 <- df1
The results change yes, but my question still is there: [[]] performs faster than [,]
Unit: milliseconds
expr min lq mean median uq
for (i in seq_along(df1)) df1[, i] <- fix_na(df1[, i]) 301.06608 356.48011 377.31592 372.05625 392.73450 472.3330
for (i in seq_along(df2)) df2[[i]] <- fix_na(df2[[i]]) 238.72005 287.55364 301.35651 298.05950 314.04369 386.4288
df3[] <- lapply(df3, fix_na) 170.53264 189.83858 198.32358 193.43300 202.43855 284.1164
df4[df4 == -99] <- NA 75.05571 77.64787 85.59757 80.72697 85.16831 363.2223
is.na(df5) <- df5 == -99 74.44877 77.81799 84.22055 80.06496 83.01401 347.5798
A faster approach would be using set from data.table
for(j in seq_along(df)){
set(df, i = which(df[[j]]== -99), j=j, value = NA)
Regarding the OP's question about the benchmarking with [ and [[, the [[ extract the column without the overhead of .data.frame. But, I would benchmark on a bigger dataset to find any difference. Also, as we assign NA on the same data, it doesn't do any change when we are doing the operation again.
df1 <- data.frame(replicate(2000, sample(c(1:5, -99), 500, rep = TRUE)))
df2 <- copy(df1)
df3 <- copy(df1)
df4 <- copy(df1)
df5 <- copy(df1)
df6 <- copy(df1)
f1 <- function() for (i in seq_along(df1)) df1[, i] <- fix_na(df1[, i])
f2 <- function() for (i in seq_along(df2)) df2[[i]] <- fix_na(df1[[i]])
f3 <- function() df3[] <- lapply(df3, fix_na)
f4 <- function() df4[df4 == -99] <- NA
f5 <- function() is.na(df5) <- df5 == -99
f6 <- function() {
for(j in seq_along(df)){
set(df, i = which(df[[j]]== -99), j=j, value = NA)
t(sapply(paste0("f", 1:6), function(f) system.time(get(f)())))[,1:3]
# user.self sys.self elapsed
#f1 0.29 0 0.30
#f2 0.22 0 0.22
#f3 0.11 0 0.11
#f4 0.31 0 0.31
#f5 0.31 0 0.32
#f6 0.00 0 0.00
Here, I am using the system.time as the functions in the OP's post already replace the value of NA in the first run, so there is no point in running it again and again.
Got an answer for a very similar problem on the site suggested from Arun: adv-r.had.co.nz/Performance.html
At the section Extracting a single value from a data frame it says:
Blockquote The following microbenchmark shows seven ways to access a single value (the number in the bottom-right corner) from the built-in mtcars dataset. The variation in performance is startling: the slowest method takes 30x longer than the fastest. There’s no reason that there has to be such a huge difference in performance. It’s simply that no one has had the time to fix it.
Among the different selection methods also the two operators [[ and [ are compared with the same results as observed by me. [[ outperforms [
I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.
I can quickly get which column has the maximum value for each row using
dt <- dt[, maxCol := which.max(.SD), by=pmxid]
but trying to get the actual maximum value using
dt <- dt[, nmax := max(.SD), by=pmxid]
is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.
How come finding the maximum takes so long? Shouldn't it take the same time as which.max() or less?
Though, you are seeking a data.table solution, here is a base R solution which would be fast enough for your dataset.
indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]
On a slightly bigger dataset, system.time comparisons revealed
indx <- max.col(df1, ties.method='first')
res <- df1[cbind(1:nrow(df1), indx)]
# user system elapsed
# 2.180 0.163 2.345
df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
# user system elapsed
#1265.792 2.305 1267.836
base R method to be faster than the data.table method in the post.
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10,
replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999
For the maximum over columns in a data.table,
dt[, max:= do.call(pmax, .SD)]
is much faster then dt[, nmax:= max(.SD), by= 1:nrow(dt)], and faster than the above base R solution :
dfi <- as.data.frame(matrix(runif(ncols*nrows), ncol = ncols, nrow = nrows))
indx <- max.col(df, ties.method='first')
df$max <- df[cbind(1:nrow(df1), indx)]
# user system elapsed
# 8.89 1.37 10.45
dt <- as.data.table(dfi)
dt[, max:= do.call(pmax, .SD)]
# user system elapsed
# 3.31 0.01 3.33
Once you have calculated the Colmax index, use the index to retrieve the maximum in each row
dt[Colmax == <value>]
dt[J(<values>), on = 'Colmax']
Also, wrong syntax in
dt[, nmax := max(.SD), by = pmxid]
this collates a vector of nrow(dt) * length(.SD) length (see the Note in Description of max())
Instead try:
dt[, nmax := apply(.SD, 1, max), by = pmxid]
or, the parallel max:
dt[, nmax := pmax(.SD), by = pmxid]
I have a list x with millions of entries in it. And I want to put all the entries with a length larger than one into a new list z. How can I do this efficiently in R?
I tried this code, and R just keeps running for a long time.
for(i in 1:length(x)) {
if(length(x[[i]])!=1) z=list(z,x[[i]])
This is one case where you want to use vapply:
z <- x[vapply(x, length, integer(1)) > 1L]
Here are benchmarks comparing sapply and vapply:
A <- list( x = c(), y = c(1), z = c(1, 2))
B <- A[sample(1:3, 1e7, replace = TRUE)]
system.time(sapply(B, length))
# user system elapsed
# 55.95 0.54 56.50
system.time(vapply(B, length, integer(1)))
# user system elapsed
# 6.78 0.00 6.78
Just do:
z = x[sapply(x, length) > 1]
I have a data.frame A
and a data.frame B which contains a subset of A
How can I create a data.frame C which is data.frame A with data.frame B excluded?
Thanks for your help.
get the rows in A that aren't in B
C = A[! data.frame(t(A)) %in% data.frame(t(B)), ]
If this B data set is truly a nested version of the first data set there has to be indexing that created this data set to begin with. IMHO we shouldn't be discussing the differences between the data sets but negating the original indexing that created the B data set to begin with. Here's an example of what I mean:
A <- mtcars
B <- mtcars[mtcars$cyl==6, ]
C <- mtcars[mtcars$cyl!=6, ]
A <- data.frame(x = 1:10, y = 1:10)
#Random subset of A in B
B <- A[sample(nrow(A),3),]
#get A that is not in B
C <- A[-as.integer(rownames(B)),]
Performance test vis-a-vis mplourde's answer:
f1 <- function() A[- as.integer(rownames(B)),]
f2 <- function() A[! data.frame(t(A)) %in% data.frame(t(B)), ]
benchmark(f1(), f2(), replications = 10000,
columns = c("test", "elapsed", "relative"),
order = "elapsed"
test elapsed relative
1 f1() 1.531 1.0000
2 f2() 8.846 5.7779
Looking at the rownames is approximately 6x faster. Two calls to transpose can get expensive computationally.
If B is truly a subset of A, which you can check with:
if(!identical(A[rownames(B), , drop = FALSE], B)) stop("B is not a subset of A!")
then you can filter by rownames:
C <- A[!rownames(A) %in% rownames(B), , drop = FALSE]
C <- A[setdiff(rownames(A), rownames(B)), , drop = FALSE]
Here are two data.table solutions that will be memory and time efficient
render_markdown(strict = T)
# some biggish data
ADT <- data.table(x = seq.int(1e+07), y = seq.int(1e+07))
.rows <- sample(nrow(ADT), 30000)
# Random subset of A in B
BDT <- ADT[.rows, ]
# set keys for fast merge
setkey(ADT, x)
setkey(BDT, x)
## how CDT <- ADT[-ADT[BDT,which=T]] the data as `data.frames for fastest
## alternative
A <- copy(ADT)
setattr(A, "class", "data.frame")
B <- copy(BDT)
setattr(B, "class", "data.frame")
f2 <- function() noBDT <- ADT[-ADT[BDT, which = T]]
f3 <- function() noBDT2 <- ADT[-BDT[, x]]
f1 <- function() noB <- A[-as.integer(rownames(B)), ]
benchmark(base = f1(),DT = f2(), DT2 = f3(), replications = 3)
## test replications elapsed relative user.self sys.self
## 2 DT 3 0.92 1.108 0.77 0.15
## 1 base 3 3.72 4.482 3.19 0.52
## 3 DT2 3 0.83 1.000 0.72 0.11
This is not the fastest and is likely to be very slow but is an alternative to mplourde's that takes into account the row data and should work on mixed data which flodel critiqued. It relies on the paste2 function from the qdap package which doesn't exist yet as I plan to release it within the enxt month or 2:
Paste 2 function:
paste2 <- function(multi.columns, sep=".", handle.na=TRUE, trim=TRUE){
if (trim) multi.columns <- lapply(multi.columns, function(x) {
gsub("^\\s+|\\s+$", "", x)
if (!is.data.frame(multi.columns) & is.list(multi.columns)) {
multi.columns <- do.call('cbind', multi.columns)
m <- if(handle.na){
apply(multi.columns, 1, function(x){if(any(is.na(x))){
} else {
paste(x, collapse = sep)
} else {
apply(multi.columns, 1, paste, collapse = sep)
names(m) <- NULL
# Flodel's mixed data set:
A <- data.frame(x = 1:4, y = as.character(1:4)); B <- A[1:2, ]
# My approach:
A[!paste2(A)%in%paste2(B), ]