Speeding up searching for indices within a Large R Data Frame - r

This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.
I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame
A list of Indices from that Frame : IndexList (55000 [numIndex] unique indices)
Its a time series so there are ~ 5 Million rows for 55K unique indices.
The OriginalDataFrame has been ordered by dataIndex. All the indices in IndexList are not present in OriginalDataFrame. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame
Currently I am running this code using library(foreach):
FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}
I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.
Am I doing something exceedingly silly or are there better ways in R to do this?

Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)
library(data.table)
library(microbenchmark)
options(digits = 3)
# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))
# Data table, with index
dt <- data.table(df)
setkey(dt, "id")
ids <- sample(1e5, 1e4)
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df[match(ids, df$id), ],
dt[id %in% ids, ],
dt[match(ids, id), ],
dt[.(ids)]
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 13.61 13.99 14.69 17.26 53.81 100
# df[match(ids, df$id), ] 16.62 17.03 17.36 18.10 21.22 100
# dt[id %in% ids, ] 7.72 7.99 8.35 9.23 12.18 100
# dt[match(ids, id), ] 16.44 17.03 17.36 17.77 61.57 100
# dt[.(ids)] 1.93 2.16 2.27 2.43 5.77 100
I had originally thought you might also be able to do this with
rownames, which I thought built up a hash table and did the indexing
efficiently. But that's obviously not the case:
df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df2[as.character(ids), ],
times = 1
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 15.3 15.3 15.3 15.3 15.3 1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8 1

If you have 5M rows and you use == to identify rows to subset, then for each pass of your loop, you are performing 5M comparisons. If you instead key your data (as it inherently is) then you can increase efficiency significantly:
library(data.table)
OriginalDT <- as.data.table(OriginalDataFrame)
setkey(OriginalDT, dataIndex)
# Now inside your foreach:
OriginalDT[ .( IndexList[[i]] ) ]
Note that the setkey function uses a very fast implementation of radix sort. However if your data is already guaranteed to be sorted, #eddi or #arun had posted a nice hack to simply set the attribute to the DT. (I can't find it right now, but perhaps someone can edit this answer and link to it).
You might try just collecting all the results into a list of data.tables then using rbindlist and compare the speed against using .combine=rbind (if you do, please feel free to post benchmark results). I've never tested .combine=rbindlist but that might work as well and would be interesting to try.
edit:
If the sole task is to index the data.frame, then simply use:
dataIndex[ .( IndexList ) ]
No foreach necessary and you still leverage the key's DT

Check data.table package. It works just like data.frame but faster.
Like this (where df is your data frame):
table <- data.table(df)
and use table

Related

equivalent of melt+reshape that splits on column names

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.
Question:
How do I do the following "partial melt" in a way that memory can support?
Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.
Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.
This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.
Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?
Thoughts:
this is a little like melt-colsplit-cast
libraries common for this seem to be "dplyr", "tidyr", "reshape2", and "data.table".
tidyr's gather+separate+spread looks good, but doesn't like not having a unique row identifier
reshape2's dcast (I'm looking for 2d output) wants to aggregate
brute force loses the labels. By brute force I mean df <- rbind(df[,block1],...) where block is the first 200 column indices, block2 is the second, etcetera.
Update (dummy code):
#libraries
library(stringr)
#reproducibility
set.seed(56873504)
#geometry
Ncol <- 2e3
Nrow <- 1e6
#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
col_idx <- 1:200+200*(i-1)
if(i<26){
namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
} else {
namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
}
}
#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist
The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.
(Ugly/Manual) Brute force version:
(working on it... )
Here are two options both using data.table.
If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.
melt(dt
, measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
, value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]
#this data set used Ncol_p_grp <- 5 to help condense the data.
variable 1 2 3 4 5
1: A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
2: A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
3: A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
4: A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
5: A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
---
259996: Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997: Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998: Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999: Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000: Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571
Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.
rbindlist(
lapply(namelist_letters,
function(x) setnames(
dt[, grep(x, names(dt), value = T), with = F]
, as.character(seq_len(Ncol_p_grp)))
)
, idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]
With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.
#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
expr min lq mean median uq max neval
melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521 3
rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147 3
Data: Run this before everything above:
#packages ----
library(data.table)
library(stringr)
#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26
#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))
names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
, str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))
#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))

data.table. Fast way to count number of changes within every column

I would like to know how many times each variable changes within each group and later add the result for all groups.
I've found this way:
mi[,lapply(.SD, function(x) sum(x != shift(x),
na.rm=T) ), by = ID][,-1][,lapply(.SD,sum, na.rm=T)]
It works, it produces the proper result but it's really slow in my large datatable.
I would like to do both operations inside the same lapply (or something faster and more compact), but the first one is done by group, the second isn't.
It could be written in an easier way (maybe not always)
mi[,lapply(.SD, function(x) sum(x != shift(x),
na.rm=T) )] [,-1]-mi[,length(unique(ID))]+1
But it's still slow and needs a lot of memory.
Any other idea?
I've also tried with diffs instead of shift, but it becomes more difficult.
Here you have a dummy example:
mi <- data.table(ID=rep(1:3,each=4) , year=rep(1:4, times=3),
VREP=rep(1:3,each=4) , VDI=rep(1:4, times=3), RAN=sample(12))
mi <- rbind(mi, data.table(4,1,1,1,0), use.names=F)
Big example for benchmark
mi <- as.data.table(matrix(sample(0:100,10000000,
replace=T), nrow=100000, ncol=100))
mi[,ID := rep(1:1000,each=100)]
My problem is that the true dataset is much bigger, it's in the limit of memory size, then I've configured R to be able to use more memory using the pagefile, and it makes many operations slow.
I know I could do it splitting the file and joining it again, but sometimes that makes things more difficult or some operations are not splittable.
Your second method produces incorrect results, so is not a fair comparison point. Here's an optimized version of alexis_laz's suggestion instead:
setorder(mi, ID)
setDT(Map(`!=`, mi, shift(mi)))[,
lapply(lapply(.SD, `&`, !ID), sum, na.rm = T), .SDcols = -"ID"]
# year VREP VDI RAN
#1: 9 0 9 9
On your bigger sample:
setorder(mi, ID)
microbenchmark(method1(), alexis_laz(), eddi(), times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# method1() 7336.1830 7510.9543 7932.0476 8150.3197 8207.2181 8455.563 5
# alexis_laz() 1350.0338 1492.3793 1509.0790 1492.5426 1577.3318 1633.107 5
# eddi() 400.3999 475.6908 494.5805 504.6163 524.2077 567.988 5

How can I add a column with the names of the nth list element to each nth element of the list?

Say I have
library(dplyr)
a <- list(a=tbl_df(cars), b=tbl_df(iris))
How can I add to each element of this list a column name whose values are the name of the named element of the list?
For instance, this how the output should look like for the first element
Source: local data frame [50 x 3]
speed dist name
(dbl) (dbl) (chr)
1 4 2 a
2 4 10 a
3 7 4 a
4 7 22 a
5 8 16 a
6 9 10 a
7 10 18 a
8 10 26 a
9 10 34 a
10 11 17 a
After all this commenting, guess I'll write an answer.
You should use a for loop for this: it's quick to code, quick to execute, readable and straightforward:
for (i in seq_along(a)) a[[i]]$name = names(a)[i]
You could use map or mapply or lapply instead of the for loop. In this case, I would think it will be less readable.
You could also use mutate instead of [ for adding the column. This will be slower:
library(microbenchmark)
library(dplyr)
cars_tbl = tbl_df(cars)
mbm = microbenchmark
mbm(
mutate = {cars_tbl = mutate(cars_tbl, name = 'a')},
base = {cars_tbl['name'] = 'a'}
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# mutate 240.617 262.4730 293.29001 276.158 299.7255 813.078 100 b
# base 34.971 42.1935 55.46356 53.407 57.3980 226.932 100 a
For such a simple operation, [<- is going to be hard to beat. data.table will probably be faster, but only if the object is already a data.table. If the object is a data.frame rather than a tbl_df, then the mutate is about twice as slow. But these differences are in microseconds. Unless you are repeatedly doing this operation to lists of at least hundreds of thousands of data frames it won't matter.
This is not to say dplyr has poor performance - when you are using the grouping operations, relying on the NSE built in to dplyr, it's excellent. This is just a simple case where the simple base solution is easiest and also quickest.
If we increase the size of the data enough so that it takes a noticeable amount of time to do these operations (10 million rows, here), the differences essentially go away:
df = tbl_df(data.frame(x = rep(1, 1e7)))
mbm(
mutate = {df = mutate(df, name = 'a')},
base = {df['name'] = 'a'}
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# mutate 58.08095 59.87531 132.3180 105.22507 207.6439 261.8121 100 a
# base 52.09899 53.96386 129.9304 99.96153 203.8581 237.0084 100 a
Implementing with for loops and with map, comparing [<- and mutate
# base for loop
for (i in seq_along(a)) {
a[[i]]$name = names(a)[i]
}
# dplyr in for loop
for (i in seq_along(a)) {
a[[i]] = mutate(a[[i]], name = names(a)[i])
}
# dplyr hiding the loop in Map()
a = Map(function(x, y) mutate(x, name = y), x = a, y = names(a))
We could benchmark these (I did -- see the edit history if you want the results), but the differences are less than 1 millisecond so it shouldn't matter. Go with whatever is easiest for you to read, write, and understand.
All this comes with the caveat that if your eventual goal is to bind these data frames together and you want the name column to see what list element the data came from, then that is implemented directly in dplyr::bind_rows.

Match rows on several variables efficiently - exclusion of columns

I am working on a financial problem of deleting messages from a financial center. I am using data.table and I am very satisfied with its performance and easy handling.
Though, I ask myself always how to improve and use the whole power of data.table.
Here is an example of my task:
set.seed(1)
DT <- data.table(SYM = c(rep("A", 10), rep("B", 12)), PRC = format(rlnorm(22, 2), digits = 2), VOL = rpois(22, 312), ID = c(seq(1000, 1009), seq(1004, 1015)), FLAG = c(rep("", 8), "R", "A", rep("", 4), "R", rep("", 7)))
DT$PRC[9] <- DT$PRC[6]
DT$PRC[7] <- DT$PRC[6]
DT$VOL[9] <- DT$VOL[6]
DT$VOL[7] <- DT$VOL[6]
DT$PRC[15] <- DT$PRC[13]
DT$VOL[15] <- DT$VOL[13]
## See the original dataset
DT
## Set the key
setkey(DT, "SYM", "PRC", "VOL", "FLAG")
## Get all rows, that match a row with FLAG == "R" on the given variables in the list
DT[DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## Remove these rows from the dataset
DT <- DT[!DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## See the modified data.table
DT
My questions are now:
Is this an efficient way to perform my task or does there exist something more 'data.table' style? Is the key set efficiently?
How can I perform my task if I do not only have three variables to match on (here: SYM, PRC, VOL) but a lot more, does there exist something like exclusion (I do know I can use it data.frame style but I want to know if there is a more elegant way for a data.table)?
What is with the copying in the last command? Following the thread on remove row by reference, I think copying is the only way to do it. What if I have several tasks, can I compound them in a way and avoid copying for each task?
I'm confused why you're setting the key to FLAG, isn't what you want simply
setkey(DT, SYM, PRC, VOL)
DT[!DT[FLAG == "R"]]
If you are only setting the key to perform this operation, #eddi's answer is the best and easiest to read.
setkey(DT, SYM, PRC, VOL)
# ^ as in #eddi's answer, since you are not using the rest of the key
microbenchmark(
notjoin=DT[!DT[FLAG == "R"][,list(SYM, PRC, VOL)]],
logi_not=DT[!DT[,rep(any(FLAG=='R'),.N),by='SYM,PRC,VOL']$V1],
idx_not=DT[!DT[,if(any(FLAG=='R')){.I}else{NULL},by='SYM,PRC,VOL']$V1],
SD=DT[,if(!any(FLAG=='R')){.SD}else{NULL},by='SYM,PRC,VOL'],
eddi=DT[!DT[FLAG == "R"]],
times=1000L
)
results:
Unit: milliseconds
expr min lq median uq max neval
notjoin 4.983404 5.577309 5.715527 5.903417 66.468771 1000
logi_not 4.393278 4.960187 5.097595 5.273607 66.429358 1000
idx_not 4.523397 5.139439 5.287645 5.453129 15.068991 1000
SD 3.670874 4.180012 4.308781 4.463737 9.429053 1000
eddi 2.767599 3.047273 3.137979 3.255680 11.970966 1000
On the other hand, several of options above do not require that your operation involve grouping by the key. Suppose you either...
are doing this once using groups other than the key (which you don't want to change) or
want to perform several operations like this using different groupings before doing the copy operation to drop rows, newDT <- DT[...] (as mentioned in the OP's point 3).
.
setkey(DT,NULL)
shuffDT <- DT[sample(1:nrow(DT))] # not realistic, of course
# same benchmark with shuffDT, but without methods that require a key
# Unit: milliseconds
# expr min lq median uq max neval
# logi_not 4.466166 5.120273 5.298174 5.562732 64.30966 1000
# idx_not 4.623821 5.319501 5.517378 5.799484 15.57165 1000
# SD 4.053672 4.448080 4.612213 4.849505 66.76140 1000
In these cases, the OP's and eddi's methods are not available (since joining requires a key). For a one-off operation, using .SD seems faster. For subsetting by multiple criteria, you'll want to keep track of the rows you want to keep/drop before making the copy newDT <- DT[!union(badrows1,badrows2,...)].
DT[,rn:=1:.N] # same as .I
badflagrows <- DT[,if(any(FLAG=='R')){rn}else{NULL},by='SYM,PRC,VOL']$V1
# fill in next_cond, next_grp
badnextrows <- DT[!badflagrows][,
if(any(next_cond)){rn}else{NULL},by='next_grp']$V1
Perhaps something similar can be done with the logical subsetting ("logi_not" in the benchmarks), which is a little faster.

R diff function on large dataset

I'd like to use the diff function on a really big data.frame : 140 Millions rows and two columns.
The goal is to compute the gap between two consecutive date activity, for each user_id.
For each user, the first activity doesn't have previous one, so I need a NA value.
I used this function, and it works for small dataset, but with the big one, it's really slow. I'm waiting since yesterday, and it's still running.
df2 <- as.vector(unlist(tapply(df$DATE,df$user_id, FUN=function(x){ return (c(NA,diff(x)))})))
I have a lot of memory (24GO) and a 4 cores CPU, but only one is working.
How can we do to manage this problem ? Is it better if I convert the dataframe to a matrix ?
Here is an example using some example data on a dataset that is at first 10 million rows, with 100 users, diffing 100,000 time points each, then 140 million rows, with 1,400 users so same number of timepoints. This transposes the time points to the columns. I should imagine if you were transposing users to columns it would be even faster. I used #Arun 's answer here as a template. Basically it shows that on a really big table you can do it on a single core (i7 2.6 GhZ) in < 90 seconds (and that is using code which is probably not fully optimsied):
require(data.table)
## Smaller sample dataset - 10 million row, 100 users, 100,000 time points each
DT <- data.table( Date = sample(100,1e7,repl=TRUE) , User = rep(1:100,each=1e5) )
## Size of table in memory
tables()
# NAME NROW MB COLS KEY
#[1,] DT 10,000,000 77 Date,User
#Total: 77MB
## Diff by user
dt.test <- quote({
DT2 <- DT[ , list(Diff=diff(c(0,Date))) , by=list(User) ]
DT2 <- DT2[, as.list(setattr(Diff, 'names', 1:length(Diff))) , by = list(User)]
})
## Benchmark it
require(microbenchmark)
microbenchmark( eval(dt.test) , times = 5L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 5.788364 5.825788 5.9295 5.942959 6.109157 5
## And with 140 million rows...
DT <- data.table( Date = sample(100,1.4e8,repl=TRUE) , User = rep(1:1400,each=1e5) )
#tables()
# NAME NROW MB
#[1,] DT 140,000,000 1069
microbenchmark( eval(dt.test) , times = 1L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 84.3689 84.3689 84.3689 84.3689 84.3689 1
This is a lot faster if you avoid tapply all together, which is fairly easy because your tapply call assumes the data are already sorted by user_id and DATE.
set.seed(21)
N <- 1e6
Data <- data.frame(DATE=Sys.Date()-sample(365,N,TRUE),
USER=sample(1e3,N,TRUE))
Data <- Data[order(Data$USER,Data$DATE),]
system.time({
Data$DIFF <- unlist(tapply(Data$DATE,Data$USER, function(x) c(NA,diff(x))))
})
# user system elapsed
# 1.58 0.00 1.59
Data2 <- Data
system.time({
Data2$DIFF <- c(NA,diff(Data2$DATE))
is.na(Data2$DIFF) <- which(c(NA,diff(Data2$USER))==1)
})
# user system elapsed
# 0.12 0.00 0.12
identical(Data,Data2)
# [1] TRUE

Resources