Essentially I've got a large dataframe: 10,000,000x900 (rows,columns) and I'm trying to convert the class of each column in parallel. The end result needs to be a data.frame
Here's what I've got so far:
Pretend df is the dataframe already defined, all columns are a mixture of numeric and character classes
library(snow)
cl=makeCluster(50,type="SOCK")
cl.out=clusterApplyLB(cl,df,function(x)factor(x,exclude=NULL))
cl.out is a list of what I want, except what I need is for this to be as a data.frame class
So this is where I get stuck... do I try and combine all of the elements of cl.out into a data.frame which isn't going to be in parallel? (SLOW, time is an issue)
Can I implement something else with a different package? (foreach?)
Do I have to hard-code some c to get this done efficiently?
Any help would be appreciated.
Thanks,
One useful paradigm is to subset and replace all columns, treating df as list-like
df[] <- lapply(df, factor, exclude=NULL)
Do you really have 50 cores on a single machine, as implied by your call to makeCluster? If you're not on a Windows machine, use the parallel package and mclapply instead
library(parallel)
options(mc.cores=50)
df[] <- mclapply(df, factor, exclude=NULL)
Is this really going to help you in your overall evaluation? It seems to cost as much to distribute and retrieve the data as to do the calculation.
> f = factor(rep("M", 10000000), levels=LETTERS)
> df = data.frame(f, f, f, f, f, f, f, f)
> system.time(lapply(df, factor, exclude=NULL))
user system elapsed
2.676 0.564 3.250
> system.time(clusterApply(cl, df, factor, exclude=NULL))
user system elapsed
1.488 0.752 2.476
> system.time(mclapply(df, factor, exclude=NULL))
user system elapsed
1.876 1.832 1.814
(the multi-core and multi-process timings are probably highly variable).
If you have a data.frame of that size, I think you will be running into memory issues very quickly.
I think it will be much faster, and more efficient.
You could use set
library(data.table)
# to set as a data.table without having to copy
setattr(df, 'class', c('data.table','data.frame')
alloc.col(df)
for(nn in names(df)){
set(df, j = nn, value = factor(df[[nn]])
}
It is worth reading data.table and parallel computing
Since a data.frame is a list, with a class attribute,
you can just convert the list into a data.frame,
with as.data.frame.
cl.out <- as.data.frame(cl.out)
I notice that the column names are lost: if you are sure that they are
in the same order, you can set them back with:
names(cl.out) <- names(df)
Related
I have a dataset with 4811616 rows, consisting of variables A, B, and C. Variable C has NAs, and I want to assign zeroes to cases that are NA. I proceed as follows:
df$A <- ifelse(is.na(df$A), 0, df$A)
And I get an error saying that R runs out of memory. It is not possible, as I am running a 64bit version of R on Windows 7 with 36Gb of memory, using memory.limit(size=34000) to assign memory to R, and the only object in the environment is my data frame of 128.5 Mb. Moreover, print(object.size(ifelse(is.na(df$A), 0, df$A)),units="MB) returns 36.7 Mb, so it cannot be that the vector that results from the ifelse statement is too big.
In fact, assigning the vector to a variable x doesn't cause R to run out of memory. It is when I try to assign it to my tbl_df that the problem happens. It also happens if I assign it to a data.frame(tbl_df).
Can anyone help me to uncover what is going on and to find a way around it?
You could use
df$A[is.na(df$A)] <- 0
You could try data.table
library(data.table)
setDT(df)[is.na(A), A:=0][]
If you need to replace the "NAs" in all the columns, you can use set which would be very efficient.
for(j in seq_len(ncol(df))){
set(df, i=which(is.na(df[[j]])), j=j, value=0)
}
Using a bigger dataset
set.seed(495)
df1 <- as.data.frame(matrix(sample(c(NA,1:5),3*4811616,
replace=TRUE), ncol=3, dimnames=list(NULL, LETTERS[1:3])))
system.time(setDT(df1)[is.na(A), A:=0])
# user system elapsed
# 0.026 0.002 0.027
Just to compare with #lukeA's method
system.time(df1$A[is.na(df1$A)] <- 0)
# user system elapsed
# 0.140 0.004 0.144
data
set.seed(25)
df <- as.data.frame(matrix(sample(c(NA,1:5), 3*20,
replace=TRUE), ncol=3, dimnames=list(NULL, LETTERS[1:3])))
I come from a Java/Python comp sci theory background so I am still getting used to the various R packages and how they can save run time in functions.
Basically, I am working on a few projects and all of them involve taking individual factors in a long-list data set (15,000 to 200,000 factors) and performing calculations on individual factors in an equally-large data set, and concurrently storing the results of those calculations in an exponentially-longer data frame.
So far I have been using nested while loops and concatenating into a growing list, but that is taking days. Ive recently learned about 'lapply' and the 'data.frame' options in R, and I would love to see an example of how to apply (no pun intended) them to the following basic correlation function:
Corr<-function(miRdf, mRNAdf)
{
j=1
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
while(m<=1477)
{
corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
corrList<-c(corrList, corr)
j=j+1
c=c+1
print(c) #just a counter to see how far the function has run
m=m+1
}
k=k+1
n=n+1
j=1
m=1 #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
}
As you can see, the nested while loop results in 105,636,517 (71521x1477) miRNA vs mRNA expression-value correlation scores that need to be performed and stored in a data frame that is 1477 cols x 71521 rows in order to generate a scoring matrix.
My question is, can anyone shed light on how to turn the above monstrosity into an efficient function that utilizes 'lapply' instead of the while loops, and uses the 'data.table' set() function to do away with the inefficiency of concatenating a list during every pass through the loop?
Thank you in advance!
Your names end with 'df', which makes it seem like your data are a data.frame. But #Troy's answer uses a matrix. A matrix is appropriate when the data are homogeneous, and generally matrix operations are much faster than data.frame operations. So you can see already that if you'd provided a small example of your data set (e.g., dput(mRNAdf[1:10,]) that people might be in a better position to help you; this is what they're asking for.
In large numerical calculations it makes sense to 'hoist' any repeated calculations outside the loop, so they are performed only once. Repeated calculations in your case include sub-setting to columns 2:13, and coercion to numeric. With this idea, and guessing that you actually have a data.frame where each column is already a numeric vector, I'd start with
mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])
From the help page ?cor we see that the arguments can be a matrix, and if so the correlation is calculated between columns. You're interested in the result when the arguments are transposed relative to your current representation. So
result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")
This is fast enough for your purposes
> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
user system elapsed
9.124 0.200 9.340
> dim(ans)
[1] 71521 1477
result is the same as your corrList -- it's not a list, but a matrix; probably the row and column names have been carried forward. You'd write this to a file as you do above, write.csv(result, "testCorrWhole.csv")
UPDATED BELOW TO SHOW PARALLEL PROCESSING - ABOUT A 60% SAVING
Using apply() might not be quick enough for you. Here's how to do it, though. Will have a think about performance since this example (1M output correlations in 1000x1000 grid) takes over a minute on laptop.
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user system elapsed
# 72.94 0.00 73.39
And with parallel::parApply on a 4 core Win64 laptop
require(parallel) ## Library to allow parallel processing
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
# Make a cluster from all available cores
cl=makeCluster(detectCores())
# Use clusterExport() to distribute the function and data.frames needed in the apply() call
clusterExport(cl,c("corRow","miRdf","mRNAdf"))
# time the call
system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))
# Stop the cluster
stopCluster(cl)
# time the call without clustering
system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))
## WITH CLUSTER (4)
user system elapsed
0.04 0.03 29.94
## WITHOUT CLUSTER
user system elapsed
73.96 0.00 74.46
I am selecting a subset of a data.frame g.raw, like this:
g.raw <- read.table(gfile,sep=',', header=F, row.names=1)
snps = intersect(row.names(na.omit(csnp.raw)),row.names(na.omit(esnp.raw)))
g = g.raw[snps,]
It works. However, that last line is EXTREMELY slow.
g.raw is about 18M rows and snps is about 1M. I realize these are pretty large numbers, but this seems like a simple operation, and reading in g into a matrix/data.frame held in memory wasn't a problem (took a few minutes), whereas this operation I described above is taking hours.
How do I speed this up? All I want is to shrink g.raw a lot.
Thanks!
It seems to be the case where data.table can shine.
Reproducing data.frame:
set.seed(1)
N <- 1e6 # total number of rows
M <- 1e5 # number of rows to subset
g.raw <- data.frame(sample(1:N, N), sample(1:N, N), sample(1:N, N))
rownames(g.raw) <- sapply(1:N, function(x) paste(sample(letters, 50, replace=T), collapse=""))
snps <- sample(rownames(g.raw), M)
head(g.raw) # looking into newly created data.frame
head(snps) # and rows for subsetting
data.frame approach:
system.time(g <- g.raw[snps,])
# > user system elapsed
# > 881.039 0.388 884.821
data.table approach:
require(data.table)
dt.raw <- as.data.table(g.raw, keep.rownames=T)
# rn is a column with rownames(g.raw)
system.time(setkey(dt.raw, rn))
# > user system elapsed
# > 8.029 0.004 8.046
system.time(dt <- dt.raw[snps,])
# > user system elapsed
# > 0.428 0.000 0.429
Well, 100x times faster with these N and M (and even better speed-up with larger N).
You can compare results:
head(g)
head(dt)
Pre-allocate, and use a matrix for building if the data is of uniform type. See iteratively constructed dataframe in R for a far more beautiful answer.
UPDATE
You were right - the bottleneck was in selection. The solution is to look up the numeric indexes of snps, once, and then just select those rows, like so:
g <- g.raw[match(snps, rownames(g.raw)),]
I'm an R newbie - thanks, this was an informative exercise. FWIW, I've seen comments by others that they never use rownames - probably because of things like this.
UPDATE 2
See also fast subsetting in R, which is more-or-less a duplicate. Most significantly, note the first answer, and the reference to Extract.data.frame, where we find out that rowname matching is partial, that there's a hashtable on rownames, and that the solution I suggested here turns out to be the canonical one. However, given all that, and experiments, I now don't see why it's so slow. The partial match algorithm should first look in the hash-table for an exact match, which in our case should always succeed.
I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.
data.table is your friend!
C.f. http://www.mail-archive.com/r-help#r-project.org/msg175877.html
Following up on nikola's comment, here is ?rbindlist's description (new in v1.8.2) :
Same as do.call("rbind",l), but much faster.
First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.
One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.
I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.
You could try following approach :
n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.
Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.
Test script :
n1 <- 3000000
n2 <- 3000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()
Right now I worked out the following solution:
nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)
Now I don't run out of memory. I think its because I store
object.size(df) + 2 * object.size(df.extension)
while with rbind R would need
object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).
After that I use
rm(df.extension)
gc(reset=TRUE)
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.
This is a perfect candidate for bigmemory. See the site for more information. Here are three usage aspects to consider:
It's OK to use the HD: Memory mapping to the HD is much faster than practically any other access, so you may not see any slowdowns. At times I rely upon > 1TB of memory-mapped matrices, though most are between 6 and 50GB. Moreover, as the object is a matrix, this requires no real overhead of rewriting code in order to use the object.
Whether you use a file-backed matrix or not, you can use separated = TRUE to make the columns separate. I haven't used this much, because of my 3rd tip:
You can over-allocate the HD space to allow for a larger potential matrix size, but only load the submatrix of interest. This way there is no need to do rbind.
Note: Although the original question addressed data frames and bigmemory is suitable for matrices, one can easily create different matrices for different types of data and then combine the objects in RAM to create a dataframe, if it's really necessary.
A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.
How I typically do this in R is something like the following:
data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups
results <- matrix(rep("NA",300),ncol=3,nrow=100)
for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])
# Run various stats on tempmat (correlations, regressions, etc), checking to
# make sure this specific group doesn't have NAs in the variables I'm using
# and assign results to x, y, and z, for example.
results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}
This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.
Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?
Edit:
Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:
system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})
user system elapsed
361.41 92.62 458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})
user system elapsed
378.44 102.03 485.94
Update:
In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):
data.mat <- read.csv("//home....")
groupids <- unique(data.mat$ID)
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))
for(i in 1:length(groupids)) {
tempmat <- data.mat[data.mat$ID==groupids[i],]
results[i,1] <- groupids[i]
results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")
results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")
}
I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in the mid-afternoon. Figure 5-7 hours.
Restructuring my code to use data.table....
data.mat <- read.csv("//home....")
data.mat <- data.table(data.mat)
testfunc <- function(x,y,z) {
temp1 <- cor(x,y,use="pairwise.complete.obs")
temp2 <- cor(x,z,use="pairwise.complete.obs")
res <- list(temp1,temp2)
res
}
system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])
user system elapsed
16.41 0.05 17.44
Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.
Update 2:
Running the code using subsets finally finished up again:
user system elapsed
17575.79 4247.41 23477.00
Update 3:
I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.
Using the same variables and setup as before...
data.mat <- read.csv("//home....")
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2, use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))
user system elapsed
250.25 7.35 272.09
This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.
Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.
Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.
Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.
dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column
Three statements of this kind and a cbind at the end - that is all you need.
You can also use dt do some action for only one ID without a subset command in an new syntax:
result.sd.oneiD<- dt[ID="oneID",sd(my_column)]
The first statment refers to rows (i), the second to columns (j).
If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"...
The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package.
Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.
You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and just assume that vectorizing will always give you a performance boost (like it does in other languages, e.g., Python + NumPy, MATLAB).
An example:
# small function to time the results:
time_this = function(...) {
start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent())));
end.time = Sys.time(); print(end.time - start.time)
}
# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)
# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )
print(x) # returns => 0.5312099
print(y) # returns => 0.661242
The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)
But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':
z = time_this(rowMeans(a))
print(z) # returns => 0.03679609
An order of magnitude improvement versus the 'for' loop (and the vectorized version).
The other members of the apply family are not just wrappers over a native 'for' loop.
a = abs(floor(10*rnorm(1e6)))
time_this(sapply(a, sqrt))
# returns => 6.64 secs
time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs
'sapply' is about 5x slower compared with a 'for' loop.
Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.