iteratively constructed dataframe in R - r

I'm relatively new to R, and was wondering the most efficient way to iteratively construct a dataframe (one row at a time, the number of iterations "n" and the length of each row "l" are known beforehand).
Create empty dataframe, add a row each iteration
Preallocate n x l dataframe, modify a row each iteration
Preallocate n x l matrix, modify a row each iteration, make dataframe from matrix
Something else

Pre-allocate!!!
And use a matrix if the data are all the same type. It will be much faster than a data.frame.
For example:
> n <- 1000 # Number of rows
> row <- 1:20*1 # one row
>
> # Adding row, one-by-one
> Data <- data.frame()
> system.time(for(i in 1:n) Data <- rbind(Data,row))
user system elapsed
2.18 0.00 2.18
>
> # Pre-allocated data.frame
> Data <- as.data.frame(Data)
> system.time(for(i in 1:n) Data[i,] <- row)
user system elapsed
0.94 0.00 0.93
>
> # Pre-allocated matrix (fast!)
> Data <- as.matrix(Data)
> system.time({ for(i in 1:n) Data[i,] <- row; Data <- as.data.frame(Data) })
user system elapsed
0 0 0

How about pre-allocating with whatever column types you need from a list first?
as.data.frame(list(a1 = vector("numeric", n), a2 = vector("character", n)))

Related

Optimization of apply

I have existing code that calculates concordance value for a dataframe/matrix. It's basically the number of rows where all the values are the same over the total number of rows.
...
concordance<-new[complete.cases(new),] #removes rows with NAs
TF<-apply(concordance, 1, function(x) if(length(unique(x))>1) F else T)
#outputs vector of T/F if it is concordant
numF<-table(TF)["TRUE"]#gets number of trues
concValue<-numF/NROW(TF) #true/total
...
Above is what I have now. It runs ok but I was wondering if there was any way to make it faster.
Edit: Dimensions of object is variable, but # of cols are typically 2-6 and there are typically 1,000,000+ rows. This is part of a package i'm developing so input data is variable.
Because the number of rows is much larger than the number of columns it makes sense to loop on columns instead, dropping rows, where there is more than different one value in the process:
propIdentical <- function(Mat){
nrowInit <- nrow(Mat)
for(i in 1:(ncol(Mat) - 1)){
if(!nrow(Mat)) break #stop if the matrix has no rows
else{
#check which elements of column i and column i+1 are equal:
equals <- Mat[,i] == Mat[, i+1]
# remove all other rows from the matrix
Mat <- Mat[equals,,drop = F]
}
}
return(nrow(Mat)/nrowInit)
}
some tests:
set.seed(1)
# normal case
dat <- matrix(sample(1:10, rep = T, size = 3*10^6), nrow = 10^6)
system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.053 0.017 0.070
[1] 0.009898
# normal case on my pc for comparison:
system.time(app <- mean(apply(dat, 1, function(x) length(unique(x))) == 1L)); app
user system elapsed
12.176 0.036 12.231
[1] 0.009898
# worst case
dat <- matrix(1L, nrow = 10^6, ncol = 6)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.302 0.044 0.348
[1] 1
# worst case on my pc for comparison
system.time(mean(apply(dat, 1, function(x) length(unique(x))) == 1L))
user system elapsed
12.562 0.001 12.578
# testing drop = F and if(!nrow(Mat)) break
dat <- matrix(1:2, ncol = 2)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0 0 0
[1] 0
Note: if you run this on a data.frame make sure to turn it into a matrix first.

Is there a fast way to extract elements in a list of data frames?

I'm trying to find a fast way to extract elements in a list of data frames.
To do this, I've tested the function lapply. Here is a reproducible example:
i <- 2
dat <- replicate(100000, data.frame(x=1:5000, y = 1:5000, z = 1:5000), simplify=FALSE)
system.time(test <- lapply(dat, function(y) y[i, c("x", "y")]))
user system elapsed
7.69 0.00 7.73
Ideally, the elapsed time should be <= 1 second.

set missing values to constant in R, computational speed

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude
You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE
I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.

How to vectorize or otherwise speed-up this looping logic in R?

Long time lurker, first time asker.
I'm trying to calculate 'items in common between 2 sets of items' for a 20M+ items dataset. Sample data looks like this.
#serially numbered items
parents <- rep(1:10000)
#generate rnorm # of children items
numchild <- round(rnorm(10000, mean=30, sd=10))
#fill the parent-child list
parent_child <- list()
for (x in 1:length(parents)){
if (numchild[x]>0){
f1 <- sample(1:length(parents), size=numchild[x])
f2 <- list(parents[f1])
parent_child <- c(parent_child, f2)
}
else {
parent_child <- c(parent_child, list(x+1)) #if numchild=0, make up something
}
}
Here is what I want to do: say parent item #1 has 5 children items-- 1,2,3,4,5 and parent item #2 has 3 children item-- 4,10,22.
I want to compute the length(intersection) of every (parent_i, parent_j) combination. In the above case, it would be 1 common item-- 4.
I am doing this for 10M+ parent items that on average have 15-20 children items with a (0,100) range. So that's a 10M x 10M item-item matrix.
I have a foreach loop that I am testing out on a smaller subset that works but doesn't quite scale for the full dataset (64 core machine with 256GB RAM). With the loop below I am already computing only half of the user-user matrix--> (parent_i, parent_j) same as (parent_j, parent_i) for this purpose.
#small subset
a <- parent_child[1:1000]
outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
rest <- a[i+1:length(a)]
foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
common <- length(intersect(b, rest[[j]]))
if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
}
}
I've been experimenting variations on this (using Reduce, storing parent-children in a daataframe etc.) but haven't had much luck.
Is there a way to make this scale?
I reversed the split, so that we have a child-parent relationship
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
Something like the following constructs a string with pairs of parents sharing a child, across all children
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
x <- combn(sort(x), 2)
paste(x[1,], x[2,], sep=".")
})
and tallying
table(unlist(int, use.names=FALSE))
or a little more quickly
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms), nms)
for
f1 <- function(parent_child) {
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], function(x) {
x <- combn(sort(x), 2)
paste(x[1,], x[2,], sep=".")
})
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms)), nms)
}
with (this is for all 10000 parent-child elements)
> system.time(ans1 <- f1(parent_child))
user system elapsed
14.625 0.012 14.668
> head(ans1)
542.1611 542.1832 542.2135 542.2435 542.2527 542.2806
1 1 1 1 1 1
I'm not sure that this would really scale to the size of problem you're talking about, though -- it's polynomial in the number of parents per child.
One possibility for speed-up is to 'memoize' the combinatorial calculation, using the length of the argument as a 'key' and storing the combination as 'value'. This reduces the number of times combn is called to the number of unique lengths of elements of child_parent.
combn1 <- local({
memo <- new.env(parent=emptyenv())
function(x) {
key <- as.character(length(x))
if (!exists(key, memo))
memo[[key]] <- t(combn(length(x), 2))
paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".")
}
})
f2 <- function(parent_child) {
len <- sapply(parent_child, length)
child_parent <- split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], combn1)
xx <- unlist(int, use.names=FALSE)
nms <- unique(xx)
cnt <- match(xx, nms)
setNames(tabulate(cnt, length(nms)), nms)
}
which helps somewhat
> system.time(ans2 <- f2(parent_child))
user system elapsed
5.337 0.000 5.347
> identical(ans1, ans2)
[1] TRUE
The slow part is now paste
> Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()
$by.self
self.time self.pct total.time total.pct
"paste" 3.92 73.41 3.92 73.41
"match" 0.74 13.86 0.74 13.86
"unique.default" 0.40 7.49 0.40 7.49
"as.character" 0.08 1.50 0.08 1.50
"unlist" 0.08 1.50 0.08 1.50
"combn" 0.06 1.12 0.06 1.12
"lapply" 0.02 0.37 4.00 74.91
"any" 0.02 0.37 0.02 0.37
"setNames" 0.02 0.37 0.02 0.37
$by.total
...
We can avoid this by encoding the parents with shared child id into a single integer; because of the way floating point numbers are represented in R, this will be exact until about 2^21
encode <- function(x, y, n)
(x - 1) * (n + 1) + y
decode <- function(z, n)
list(x=ceiling(z / (n + 1)), y = z %% (n + 1))
and adjusting our combn1 and f2 functions as
combn2 <- local({
memo <- new.env(parent=emptyenv())
function(x, encode_n) {
key <- as.character(length(x))
if (!exists(key, memo))
memo[[key]] <- t(combn(length(x), 2))
encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n)
}
})
f3 <- function(parent_child) {
encode_n <- length(parent_child)
len <- sapply(parent_child, length)
child_parent <-
unname(split(rep(seq_along(parent_child), len),
unlist(parent_child, use.names=FALSE)))
keep <- sapply(child_parent, length) > 1
int <- lapply(child_parent[keep], combn2, encode_n)
id <- unlist(int, use.names=FALSE)
uid <- unique(xx)
n <- tabulate(match(xx, uid), length(uid))
do.call(data.frame, c(decode(uid, encode_n), list(n=n)))
}
leading to
> system.time(f3(parent_child))
user system elapsed
2.140 0.000 2.146
This compares very favorably (note that the timing in the previous line is for 10,000 parent-child relations) with jlhoward's revised answer
> system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
user system elapsed
2.465 0.000 2.468
> system.time(f3(parent_child[1:99]))
user system elapsed
0.016 0.000 0.014
and scales in a much more reasonable way.
For what it's worth, the data generation routine is in the second circle of Patrick Burn's R Inferno, using the 'copy-and-append' algorithm rather than pre-allocating the space and filling it in. Avoid this by writing the for loop body as a function, and using lapply. Avoid the need for the complicated conditional in the for loop by fixing the issue before-hand
numchild <- round(rnorm(10000, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
or by sampling from a distribution (rpois, rbinom) that generates positive integer values. Data generation is then
n_parents <- 10000
numchild <- round(rnorm(n_parents, mean=30, sd=10))
numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))
parent_child <- lapply(numchild, sample, x=n_parents)
Here is another approach that is about 10X faster than my previous answer, and 17X faster than the original code (also simpler):
ff <- function(u2, u1, a) {
common <- length(intersect(a,parent_child[[u2]]))
if (common>0) {return(data.frame(u1,u2,common))}
}
gg <- function(u1) {
a <- parent_child[[u1]]
do.call("rbind",lapply((u1+1):100,ff,u1,a))
}
system.time(result.3 <- do.call("rbind",lapply(1:99,gg)))
user system elapsed
1.04 0.00 1.03
result.3 is identical to result.2 from previous answer:
max(abs(result.3-result.2))
[1] 0
Well, a small improvement (I think):
Original code (wrapped in function call):
f = function(n) {
#small subset
a <- parent_child[1:n]
outerresults <- foreach (i = 1:(length(a)),
.combine=rbind,
.packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
rest <- a[i+1:length(a)]
foreach (j = 1:(length(rest)), .combine=rbind) %dopar% {
common <- length(intersect(b, rest[[j]]))
if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)}
}
}
return(outerresults)
}
Modified code:
g <- function(n) {
a <- parent_child[1:n]
outerresults <- foreach (i = 1:n,
.combine=rbind,
.packages=c('foreach','doParallel')) %dopar% {
b <- a[[i]]
foreach (j = (i):n, .combine=rbind) %dopar% {
if (i!=j) {
c <- a[[j]]
common <- length(intersect(b, c))
if (common > 0) {g <- data.frame(u1=i, u2=j, common)}
}
}
}
return(outerresults)
}
Benchmarks:
system.time(result.old<-f(100))
user system elapsed
17.21 0.00 17.33
system.time(result.new<-g(100))
user system elapsed
10.42 0.00 10.47
The numbering for u2 is a little different becasue of the different approaches, but both produce the same vector of matches:
max(abs(result.old$common-result.new$common))
[1] 0
I tried this with data table joins replacing intersect(...) and it was actually much slower(!!)

How to subset from a long list?

I have a list x with millions of entries in it. And I want to put all the entries with a length larger than one into a new list z. How can I do this efficiently in R?
I tried this code, and R just keeps running for a long time.
z=NULL
for(i in 1:length(x)) {
if(length(x[[i]])!=1) z=list(z,x[[i]])
}
This is one case where you want to use vapply:
z <- x[vapply(x, length, integer(1)) > 1L]
Here are benchmarks comparing sapply and vapply:
A <- list( x = c(), y = c(1), z = c(1, 2))
B <- A[sample(1:3, 1e7, replace = TRUE)]
system.time(sapply(B, length))
# user system elapsed
# 55.95 0.54 56.50
system.time(vapply(B, length, integer(1)))
# user system elapsed
# 6.78 0.00 6.78
Just do:
z = x[sapply(x, length) > 1]

Resources