Memory efficient alternative to rbind - in-place rbind? - r

I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.

data.table is your friend!
C.f. http://www.mail-archive.com/r-help#r-project.org/msg175877.html
Following up on nikola's comment, here is ?rbindlist's description (new in v1.8.2) :
Same as do.call("rbind",l), but much faster.

First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.
One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.
I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.
You could try following approach :
n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.
Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.
Test script :
n1 <- 3000000
n2 <- 3000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()

Right now I worked out the following solution:
nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)
Now I don't run out of memory. I think its because I store
object.size(df) + 2 * object.size(df.extension)
while with rbind R would need
object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).
After that I use
rm(df.extension)
gc(reset=TRUE)
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.

This is a perfect candidate for bigmemory. See the site for more information. Here are three usage aspects to consider:
It's OK to use the HD: Memory mapping to the HD is much faster than practically any other access, so you may not see any slowdowns. At times I rely upon > 1TB of memory-mapped matrices, though most are between 6 and 50GB. Moreover, as the object is a matrix, this requires no real overhead of rewriting code in order to use the object.
Whether you use a file-backed matrix or not, you can use separated = TRUE to make the columns separate. I haven't used this much, because of my 3rd tip:
You can over-allocate the HD space to allow for a larger potential matrix size, but only load the submatrix of interest. This way there is no need to do rbind.
Note: Although the original question addressed data frames and bigmemory is suitable for matrices, one can easily create different matrices for different types of data and then combine the objects in RAM to create a dataframe, if it's really necessary.

Related

alternative to subsetting in R

I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.
I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.
Here is my code:
srt=NULL
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)
doubletCounts is 700 x 3 df, and each of the values is found within the large df.
I would be glad to hear any ideas to optimize/speed up this process.
Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.
# load library
library(datat.table)
# convert your dataset into data.table
setDT(YearHT)
# subset YearHT keeping values that are present in doubletCounts$Var1
YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]
# aggregate values
output <- YearHT_df[ , .( median= median(berthtime)) ]
for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.
srt = NULL
for (i in index)
{
[stuff]
srt = rbind(srt, [stuff])
}
is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do
srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)
Or the apply alternative of
srt = lapply(doubletCounts$Var1,
function(i)
{
s=subset(YearHT,YearHT$berthlet==i)
unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
)
Both of those should run at about the same speed
(Note: both are untested, for lack of data, so they might be a little buggy)
Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to
unlist(c(strsplit(i, '\\|'),
median(YearHT[YearHT$berthlet == i, "berthtime"])))
But I'm not sure how much time that would save.

how to make arithmatic operations in ffdf object of ff package

I have the script making a ffdf object:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=1000, next.rows=10000,levels=NULL))
Now I want to increase the column#1 of x by 5.
To perform such an operation I use method 'add()' of ff package:
add(x[,1],5)
The ouput is Ok (column#1 is increased by 5). But the extra RAM allocation is disasterous - it looks like as if I am operating the entire dataframe in RAM but not a ffdf object.
So my question is about the correct way to deal with elements of ffdf object without drastic extra RAM allocations.
You can just do as follows
require(ffbase)
x <- ff(1:10)
y <- x + 5
x
y
ffbase has worked out all the Arithmetic operations see help("+.ff_vector")
I have used chunk approach to make arithmatic calculations without RAM extra overheads (see the initial script in the question section):
chunk_size<-100
m<-numeric(chunk_size)
chunks <- chunk(x, length.out=chunk_size)
system.time(
for(i in seq_along(chunks)){
x[chunks[[i]],][[1]]<-x[chunks[[i]],][[1]]+5
}
)
x
Now, I have increased each element of the column#1 of x object by 5 without significant RAM allocations.
The 'chunk_size' regulates the number of chunks as well -> more chunks are used the smaller RAM overheads are. But processing time issues could arise.
The brief example and explanations about chunks in ffdf are here:
https://github.com/demydd/R-for-Big-Data/blob/master/09-ff.Rmd
Anyway, It would be nice to hear alternative approaches.

What's the fastest way for counting words in a large dataset using R?

I have an SQL database with 7 million+ records, each record containing some text. Within each record I want to perform text analysis, say count the occurences of specific words. I've tried R's tokenize function within the openNLP package which works great for small files, but 7 million records * between 1-100 words per record gets too large for R to hold in a data.frame. I thought about using R's bigmemory or ff packages, or even the mapReduce package. Do you guys have a preferred approach or package for this type of analysis?
Maybe approach it in parallel. I used parLapply b/c I believe it works on all three OS.
wc <- function(x) length(unlist(strsplit(x, "\\s+")))
wordcols <- rep("I like icecream alot.", 100000)
library(parallel)
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("wc", "wordcols"), envir=environment())
output <- parLapply(cl, wordcols, function(x) {
wc(x)
}
)
stopCluster(cl)
sum(unlist(output))
On the SQL side you could extract also for each entry the len, then apply a replace(" yourWord ","") (with flanking spaces...) to it, calculate again the total string length and then take the differences between those two, that should do the trick. My SQL skills are not so well that I could present here easily an running example, sorry for that...

In R, when is smart to formaly declare a data.frame?

In other R code, it is common to see data.frame declared before a loop is started.
Suppose I have data frame data1 with 2000 rows.
And in a loop, I am via web service looping over data1 to create a new data.frame data2. (Please don't recommend not using a loop).
And in data2$result and data2$pubcount I need to store different values for each of the 2000 data1 items.
Do I HAVE to declare before the loop
data2=data.frame()
and do I have to tell R how many rows and what columns I will later use? I know that columns can be added without declaring. What about rows. Is there advantage in doing:
data2<-data.frame(id=data1$id)
I would like to do only what I absolutely HAVE to declare and do.
Why the empty declaration gives error once in the loop?
later edit: Speed and memory is not of issue. 10s vs. 30s makes no difference and I have a under 100MB data and big PC (8GB). Matrix is not an option since the data is numbers and text (mixed), so I have to use non-matrix.
Something like this:
df <- data.frame(a=numeric(n),b=character(n))
for (i in 1:n) {
#<do stuff>
df[i,1] <- ...
df[i,2] <- ...
}
You should avoid manipulation of data.frames in a loop, since subsetting of data.frames is a slow operation:
a <- numeric(n)
b <- character(n)
for (i in 1:n) {
#<do stuff>
a[i] <- ...
b[i] <- ...
}
df <- data.frame(a,b)
Of course, there are often better ways than a for loop. But it is strongly recommended to avoid growing objects (and I wont teach you how to do that). Pre-allocate as shown here.
Why should you pre-allocate? Because growing objects in a loop is sloooowwwww and that's one of the main reasons why people think loops in R are slow.

What's the higher-performance alternative to for-loops for subsetting data by group-id?

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.
How I typically do this in R is something like the following:
data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups
results <- matrix(rep("NA",300),ncol=3,nrow=100)
for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])
# Run various stats on tempmat (correlations, regressions, etc), checking to
# make sure this specific group doesn't have NAs in the variables I'm using
# and assign results to x, y, and z, for example.
results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}
This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.
Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?
Edit:
Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:
system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})
user system elapsed
361.41 92.62 458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})
user system elapsed
378.44 102.03 485.94
Update:
In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):
data.mat <- read.csv("//home....")
groupids <- unique(data.mat$ID)
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))
for(i in 1:length(groupids)) {
tempmat <- data.mat[data.mat$ID==groupids[i],]
results[i,1] <- groupids[i]
results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")
results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")
}
I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in the mid-afternoon. Figure 5-7 hours.
Restructuring my code to use data.table....
data.mat <- read.csv("//home....")
data.mat <- data.table(data.mat)
testfunc <- function(x,y,z) {
temp1 <- cor(x,y,use="pairwise.complete.obs")
temp2 <- cor(x,z,use="pairwise.complete.obs")
res <- list(temp1,temp2)
res
}
system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])
user system elapsed
16.41 0.05 17.44
Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.
Update 2:
Running the code using subsets finally finished up again:
user system elapsed
17575.79 4247.41 23477.00
Update 3:
I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.
Using the same variables and setup as before...
data.mat <- read.csv("//home....")
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2, use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))
user system elapsed
250.25 7.35 272.09
This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.
Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.
Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.
Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.
dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column
Three statements of this kind and a cbind at the end - that is all you need.
You can also use dt do some action for only one ID without a subset command in an new syntax:
result.sd.oneiD<- dt[ID="oneID",sd(my_column)]
The first statment refers to rows (i), the second to columns (j).
If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"...
The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package.
Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.
You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and just assume that vectorizing will always give you a performance boost (like it does in other languages, e.g., Python + NumPy, MATLAB).
An example:
# small function to time the results:
time_this = function(...) {
start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent())));
end.time = Sys.time(); print(end.time - start.time)
}
# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)
# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )
print(x) # returns => 0.5312099
print(y) # returns => 0.661242
The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)
But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':
z = time_this(rowMeans(a))
print(z) # returns => 0.03679609
An order of magnitude improvement versus the 'for' loop (and the vectorized version).
The other members of the apply family are not just wrappers over a native 'for' loop.
a = abs(floor(10*rnorm(1e6)))
time_this(sapply(a, sqrt))
# returns => 6.64 secs
time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs
'sapply' is about 5x slower compared with a 'for' loop.
Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.

Resources