R really slow matrix / data.frame index selection - r

I am selecting a subset of a data.frame g.raw, like this:
g.raw <- read.table(gfile,sep=',', header=F, row.names=1)
snps = intersect(row.names(na.omit(csnp.raw)),row.names(na.omit(esnp.raw)))
g = g.raw[snps,]
It works. However, that last line is EXTREMELY slow.
g.raw is about 18M rows and snps is about 1M. I realize these are pretty large numbers, but this seems like a simple operation, and reading in g into a matrix/data.frame held in memory wasn't a problem (took a few minutes), whereas this operation I described above is taking hours.
How do I speed this up? All I want is to shrink g.raw a lot.
Thanks!

It seems to be the case where data.table can shine.
Reproducing data.frame:
set.seed(1)
N <- 1e6 # total number of rows
M <- 1e5 # number of rows to subset
g.raw <- data.frame(sample(1:N, N), sample(1:N, N), sample(1:N, N))
rownames(g.raw) <- sapply(1:N, function(x) paste(sample(letters, 50, replace=T), collapse=""))
snps <- sample(rownames(g.raw), M)
head(g.raw) # looking into newly created data.frame
head(snps) # and rows for subsetting
data.frame approach:
system.time(g <- g.raw[snps,])
# > user system elapsed
# > 881.039 0.388 884.821
data.table approach:
require(data.table)
dt.raw <- as.data.table(g.raw, keep.rownames=T)
# rn is a column with rownames(g.raw)
system.time(setkey(dt.raw, rn))
# > user system elapsed
# > 8.029 0.004 8.046
system.time(dt <- dt.raw[snps,])
# > user system elapsed
# > 0.428 0.000 0.429
Well, 100x times faster with these N and M (and even better speed-up with larger N).
You can compare results:
head(g)
head(dt)

Pre-allocate, and use a matrix for building if the data is of uniform type. See iteratively constructed dataframe in R for a far more beautiful answer.
UPDATE
You were right - the bottleneck was in selection. The solution is to look up the numeric indexes of snps, once, and then just select those rows, like so:
g <- g.raw[match(snps, rownames(g.raw)),]
I'm an R newbie - thanks, this was an informative exercise. FWIW, I've seen comments by others that they never use rownames - probably because of things like this.
UPDATE 2
See also fast subsetting in R, which is more-or-less a duplicate. Most significantly, note the first answer, and the reference to Extract.data.frame, where we find out that rowname matching is partial, that there's a hashtable on rownames, and that the solution I suggested here turns out to be the canonical one. However, given all that, and experiments, I now don't see why it's so slow. The partial match algorithm should first look in the hash-table for an exact match, which in our case should always succeed.

Related

R: Efficiently Calculate Deviations from the Mean Using Row Operations on a DF (Without Using a For Loop)

I am generating a very large data frame consisting of a large number of combinations of values. As such, my coding has to be as efficient as possible or else 1) I get errors like - R cannot allocate vector of size XX or 2) the calculations take forever.
I am to the point where I need to calculate r (in the example below r = 3) deviations from the mean for each sample (1 sample per row of the df)(Labeled dev1 - dev3 in pic below):
These are my data in R:
I tried this (r is the number of values in each sample, here set to 3):
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
When I try this, I get:
I am guessing that this code is attempting to calculate the difference between each row of X1 (x) and the entire vector of X1$x.bar instead of 81 for the 1st row, 81.25 for the 2nd row, etc.
Once again, I can easily do this using for loops, but I'm assuming that is not the most efficient way.
Can someone please stir me in the right direction? Any assistance is appreciated.
Here is the whole code for the small sample version with r<-3. WARNING: This computes all possible combinations, so the df's get very large very quick.
options(scipen = 999)
dp <- function(x) {
dp1<-nchar(sapply(strsplit(sub('0+$', '', as.character(format(x, scientific = FALSE))), ".",
fixed=TRUE),function(x) x[2]))
ifelse(is.na(dp1),0,dp1)
}
retain1<-function(x,minuni) length(unique(floor(x)))>=minuni
# =======================================================
r<-3
x0<-seq(80,120,.25)
X0<-data.frame(t(combn(x0,r)))
names(X0)<-paste("x",1:r,sep="")
X<-X0[apply(X0,1,retain1,minuni=r),]
rm(X0)
gc()
X$x.bar<-rowMeans(X)
dp1<-dp(X$x.bar)
X1<-X[dp1<=2,]
rm(X)
gc()
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
Because R is vectorized you only need to subtract x.bar from from x1, x2, x3 collectively:
devs <- X1[ , 1:3] - X1[ , 4]
X1devs <- cbind(X1, devs)
That's it...
I think you just got the margin wrong, in apply you're using 1 as in row wise, but you want to do column wise so use 2:
X2<-apply(X1[,1:r], 2, function(x) x-X1$x.bar)
But from what i quickly searched, apply family isn't better in performance than loops, only in clarity. Check this post: Is R's apply family more than syntactic sugar?

Can I vectorize code when data is in a list?

I am in the process of optimizing my code, and I am running into some problems. I know that the greatest speed ups in R come from vectorizing code instead of using loops. However, I have my data in lists, and I am not sure if I can vectorize my code or not. I have tried using the apply functions (like lapply, vapply), but I read that these functions are just for writing cleaner code and are actually using loops under the hood!
Here are my three biggest bottlenecks in my code, though I do not think anything can be done for the first part.
1) Reading data
I work with batches of 1000 matrices of dimensions 277x349. This is the biggest bottleneck in my script, but I alleviated the problem a little bit by using the doMC package to take advantage of multiple cores with the foreach function. This results in a list containing 1000 277x349 matrices.
For the purposes of the question, say we have a list of 1000 matrices of dimensions 277 x 349
# Fake data
l <- list()
for(i in 1:1000) {
l[[i]] <- matrix(rnorm(277*349), nrow=277, ncol=349)
}
2) Bottleneck #1
I need to make comparisons to some reference matrix (of same dimensions). This leads to comparing the 1000 matrices in my list to my reference matrix to get a vector of 1000 distances. If I know that the matrices are of the same dimensions, can I vectorize this step?
Here is some code:
# The reference matrix
r <- matrix(rnorm(277*349), nrow=277, ncol=349)
# The number of non NA values in matrix. Do not need to worry about this...
K <- 277*349
# Make a function to calculate distances
distance <- function(xi, xj, K, na.rm=TRUE) {
sqrt(sum((xi - xj)^2, na.rm=na.rm)/K)
}
# Get a vector containing all the distances
d <- vapply(l, distance, c(0), xj=r, K=K)
This step is bearably fast using vapply, but it is the third slowest part of the code.
3) Bottleneck #2
I now want to make a weighted average matrix of the J "closest" matrices to my reference matrix. (There is a sorting step, but assume that d[1] < d[2] < ... < d[1000] for simplicity). I want to get the weighted average matrix for when J=1,2,...,1000
# Get the weighted matrix
weightedMatrix <- function(listOfData, distances, J) {
# Calculate weights:
w <- d[1:J]^{-2} / sum(d[1:J]^{-2})
# Get the weighted average matrix
# *** I use a loop here ***
x_bar <- matrix(0, nrow=nrow(listOfData[[1]]), ncol=ncol(listOfData[[1]]))
for(i in 1:J) {
x_bar <- x_bar + {listOfData[[i]] * w[i]}
}
return(x_bar)
}
# Oh no! Another loop...
res <- list()
for(i in 1:length(l) ) {
res[[i]] <- weightedMatrix(l, d, J=i)
}
I am a little stumped. I do not see an intuitive way to vectorize operations on a list of matrices.
The script that I am writing will be called fairly often, so even a little improvement can add up!
EDIT:
RE: 1) Reading data
I forgot to mention that my data is in a special format, so I have to use a special data reading function to read the data in R. The files are in netcdf4 format, and I am using the nc_open function from the package ncdf4 to access the files, and then I have to use the ncvar_get function to read the variable of interest. The nice thing is that the data in the files can be read from disk, and then I can read the data into memory with ncvar_get to do operations on them with R.
That being said, although I know the size of my matrices and how many of them I will have, I asked my question with a list of data because the foreach function that enables me to do parallel computing outputs the results from the parallel-ized loop in a list. I found that with the foreach function, the data reading step was about 3x faster.
I imagine that I can arrange the data as a 3d array afterwards, but maybe the time it takes to allocate the 3d array may take more time than it saves? I will have to try it tomorrow.
EDIT 2:
Here are some of the timings I took of my script.
Original Script:
[1] "Reading data to memory"
user system elapsed
176.063 44.070 26.611
[1] "Calculating Distances"
user system elapsed
2.312 0.000 2.308
[1] "Calculating the best 333 weighted matrices"
user system elapsed
63.697 28.495 9.092
I made the following improvements thus far: (1) Pre-allocate the list before reading data, (2) Improved the weighted matrix calculations, as per Martin Morgan's suggestion.
[1] "Reading data to memory"
user system elapsed
192.448 38.578 27.872
[1] "Calculating Distances"
user system elapsed
2.324 0.000 2.326
[1] "Calculating all 1000 weighted matrices"
user system elapsed
1.376 0.000 1.374
Some notes:
I use 12 cores in my foreach loop to read in the data (registerDoMC(12)). The whole script takes approximately 40s / 36s to run before / after the improvements.
The timing for my Bottleneck #2 has improved by quite a bit. Previously, I had been computing only the top third (i.e. 333) of the weighted matrices, but now the script can just calculate all the weighted matrices in a fraction of the original time.
Thanks for the help, I will try tweaking my code later to see if I can change my script to work with 3D arrays instead of lists. I am going to take some time now to verify the calculations just to be sure they work!
My 'low hanging fruit' (scan; pre-allocate and fill) seem to be not relevant, so...
The operations in the distance calculation sort of look vectorized enough to me. Probably you can squeeze some extra speed out of doing a single distance calculation over all your matrices, but this probably makes the code less understandable.
The weightedMatrix calculation looks like there is room for improvement. Let's calculate
w <- d^(-2) / cumsum(d^(-2))
For a weighted matrix m I think the relationship between successive matrices is just m' = m * (1 - w[i]) + l[[i]] * w[i], so
res <- vector("list", length(l))
for (i in seq_along(l))
if (i == 1L) {
res[[i]] = l[[i]] * w[[i]]
} else {
res[[i]] = res[[i - 1]] * (1 - w[[i]]) + l[[i]] * w[[i]]
}
This changes the calculation of res from quadratic to linear. My thoughts about better than linear performance were just a (probably also misguided) hunch; I haven't pursued that.
Returning to pre-allocate and fill and #flodel's comment, we have
f0 <- function(n) {
## good: pre-allocate and fill
l = vector("list", n)
for (i in seq_along(l))
l[[i]] = 1
l
}
f1 <- function(n) {
## bad: copy and append
l = list()
for (i in seq_len(n))
l[[i]] = 1
l
}
which produce the same result
> identical(f0(100), f1(100))
[1] TRUE
but with different performance
> sapply(10^(1:5), function(i) system.time(f0(i))[3])
elapsed elapsed elapsed elapsed elapsed
0.000 0.000 0.002 0.014 0.134
> sapply(10^(1:5), function(i) system.time(f1(i))[3])
elapsed elapsed elapsed elapsed elapsed
0.000 0.001 0.005 0.253 24.520
Even though this does not matter for the scale of the current problem does not matter, it seems like one should adopt the better pre-allocate and fill strategy to avoid having to guess whether it's relevant or not. Better, use the *apply or in this case replicate family to avoid having to think about it
l <- replicate(1000, matrix(rnorm(277*349), nrow=277, ncol=349), simplify=FALSE)

R large dataframe conversions in parallel

Essentially I've got a large dataframe: 10,000,000x900 (rows,columns) and I'm trying to convert the class of each column in parallel. The end result needs to be a data.frame
Here's what I've got so far:
Pretend df is the dataframe already defined, all columns are a mixture of numeric and character classes
library(snow)
cl=makeCluster(50,type="SOCK")
cl.out=clusterApplyLB(cl,df,function(x)factor(x,exclude=NULL))
cl.out is a list of what I want, except what I need is for this to be as a data.frame class
So this is where I get stuck... do I try and combine all of the elements of cl.out into a data.frame which isn't going to be in parallel? (SLOW, time is an issue)
Can I implement something else with a different package? (foreach?)
Do I have to hard-code some c to get this done efficiently?
Any help would be appreciated.
Thanks,
One useful paradigm is to subset and replace all columns, treating df as list-like
df[] <- lapply(df, factor, exclude=NULL)
Do you really have 50 cores on a single machine, as implied by your call to makeCluster? If you're not on a Windows machine, use the parallel package and mclapply instead
library(parallel)
options(mc.cores=50)
df[] <- mclapply(df, factor, exclude=NULL)
Is this really going to help you in your overall evaluation? It seems to cost as much to distribute and retrieve the data as to do the calculation.
> f = factor(rep("M", 10000000), levels=LETTERS)
> df = data.frame(f, f, f, f, f, f, f, f)
> system.time(lapply(df, factor, exclude=NULL))
user system elapsed
2.676 0.564 3.250
> system.time(clusterApply(cl, df, factor, exclude=NULL))
user system elapsed
1.488 0.752 2.476
> system.time(mclapply(df, factor, exclude=NULL))
user system elapsed
1.876 1.832 1.814
(the multi-core and multi-process timings are probably highly variable).
If you have a data.frame of that size, I think you will be running into memory issues very quickly.
I think it will be much faster, and more efficient.
You could use set
library(data.table)
# to set as a data.table without having to copy
setattr(df, 'class', c('data.table','data.frame')
alloc.col(df)
for(nn in names(df)){
set(df, j = nn, value = factor(df[[nn]])
}
It is worth reading data.table and parallel computing
Since a data.frame is a list, with a class attribute,
you can just convert the list into a data.frame,
with as.data.frame.
cl.out <- as.data.frame(cl.out)
I notice that the column names are lost: if you are sure that they are
in the same order, you can set them back with:
names(cl.out) <- names(df)

R data.table efficient replication by group

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.
Here is some sample data:
ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")
What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.
The following code works and gives me the answer I want:
good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))
good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]
The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.
If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.
Does anyone have a better solution?
I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.
One way is:
require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]
Everything except the last line should be straightforward. The last line uses a subset using key column with the help of J(.). For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned.
That is, if you do dt[J(1)] you'll get the subset where multiple = 1. And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)]. The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table.
So, if we were to pass the same value of the column 2 times in J(.), then it gets be duplicated twice. We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. rep(.) gives 1,2,2,3,3,3,4,4,4,4.
And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.)), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8).
Edit: Here's some benchmarking I did on a "relatively" big data. I don't see any spike in memory allocations in both methods. But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. I'll write back again. For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves.
# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)
DF <- data.frame(yr = rep(yr, sz),
token = token,
multiple = multiple, stringsAsFactors=FALSE)
# Arun's solution
ARUN.DT <- function(dt) {
setkey(dt, "multiple")
idx <- unique(dt$multiple)
dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}
# Ricardo's solution
RICARDO.DT <- function(dt) {
setkey(dt, key="yr")
newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
newDT
}
# create data.table
require(data.table)
DT <- data.table(DF)
# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")
# test replications elapsed relative user.self sys.self
# 1 res1 <- ARUN.DT(DT) 10 9.542 1.000 7.218 1.394
# 2 res2 <- RICARDO.DT(DT) 10 17.484 1.832 14.270 2.888
But as Ricardo says, it may not matter if you run out of memory. So, in that case, there has to be a trade-off between speed and memory. What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better.
you can try allocating the memory for all the rows first, and then populating them iteratively.
eg:
# make sure `sample_data$multiple` is an integer
sample_data$multiple <- as.integer(sample_data$multiple)
# create data.table
S <- data.table(sample_data, key='yr')
# optionally, drop original data.frame if not needed
rm(sample_data)
## Allocate the memory first
newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]
Two notes:
(1) sample_data$multiple is currently a character and thus getting coerced when passed to rep (in your original example). It might be worth double-checking your real data if that is also the case.
(2) I used the following to determine the number of rows needed per year
S[, list(rows=length(token) * unique(multiple)), by=yr]

How can I prevent rbind() from geting really slow as dataframe grows larger?

I have a dataframe with only 1 row. To this I start to add rows by using rbind
df #mydataframe with only one row
for (i in 1:20000)
{
df<- rbind(df, newrow)
}
this gets very slow as i grows. Why is that? and how can I make this type of code faster?
You are in the 2nd circle of hell, namely failing to pre-allocate data structures.
Growing objects in this fashion is a Very Very Bad Thing in R. Either pre-allocate and insert:
df <- data.frame(x = rep(NA,20000),y = rep(NA,20000))
or restructure your code to avoid this sort of incremental addition of rows. As discussed at the link I cite, the reason for the slowness is that each time you add a row, R needs to find a new contiguous block of memory to fit the data frame in. Lots 'o copying.
I tried an example. For what it's worth, it agrees with the user's assertion that inserting rows into the data frame is also really slow. I don't quite understand what's going on, as I would have expected the allocation problem to trump the speed of copying. Can anyone either replicate this, or explain why the results below (rbind < appending < insertion) would be true in general, or explain why this is not a representative example (e.g. data frame too small)?
edit: the first time around I forgot to initialize the object in hell2fun to a data frame, so the code was doing matrix operations rather than data frame operations, which are much faster. If I get a chance I'll extend the comparison to data frame vs. matrix. The qualitative assertions in the first paragraph hold, though.
N <- 1000
set.seed(101)
r <- matrix(runif(2*N),ncol=2)
## second circle of hell
hell2fun <- function() {
df <- as.data.frame(rbind(r[1,])) ## initialize
for (i in 2:N) {
df <- rbind(df,r[i,])
}
}
insertfun <- function() {
df <- data.frame(x=rep(NA,N),y=rep(NA,N))
for (i in 1:N) {
df[i,] <- r[i,]
}
}
rsplit <- as.list(as.data.frame(t(r)))
rbindfun <- function() {
do.call(rbind,rsplit)
}
library(rbenchmark)
benchmark(hell2fun(),insertfun(),rbindfun())
## test replications elapsed relative user.self
## 1 hell2fun() 100 32.439 484.164 31.778
## 2 insertfun() 100 45.486 678.896 42.978
## 3 rbindfun() 100 0.067 1.000 0.076

Resources