Problems with speeding up loop in R - r

I have a particularly big dataset which consists of 3.7 mio rows and 76 string columns.
I want to compare the above row with the below row in terms of whether they match and have written this code. The number of same patterns of the above and the below row should be indicated.
a <- c("a","a","a","a","a","a","a","a","a")
b <- c("b","b","b","b","a","b","b","b","b")
c <- c("c","c","c","c","a","a","a","b","b")
d <- c("d","d","d","d","d","d","d","d","d")
features_split <- data.frame(a,b,c,d); features_split
ncol = max(sapply(features_split,length))
safe <- as.data.table(lapply(1:ncol,function(i)sapply(features_split,"[",i)))
nrow(safe)
df <- safe
LIST <-list()
LIST2 <-list()
for(i in 1:(nrow(df)-1))
{
LIST[[i]] <-df[i+1,] %in% df[i,]
LIST2[[i]] <- length(LIST[[i]][LIST[[i]]==TRUE])
}
safe2 <- unlist(LIST2)
not_available <- rowSums(!is.na(safe))
It takes forever to run that loop. How can I improve?
(about 1 hour for 100.000 rows, but I have more than 3.7 mio)
Grateful for anything,
Tobi

Using a data.frame
Proof of concept, using data.frame:
set.seed(4)
nr <- 1000
mydf <- data.frame(a=sample(letters[1:3], nr, repl=TRUE),
b=sample(letters[1:3], nr, repl=TRUE),
c=sample(letters[1:3], nr, repl=TRUE),
d=sample(letters[1:3], nr, repl=TRUE),
stringsAsFactors=FALSE)
matches <- vapply(seq.int(nrow(mydf)-1),
function(ii,zz) sum(mydf[ii,] == mydf[ii+1,]),
integer(1))
head(matches)
## [1] 0 3 4 2 1 0
sum(matches == 4) # total number of perfect row-matches
## 16
In matches, the integer in position i indicates how many strings from row i exactly match the corresponding string from row i+1. A match of 0 means no matches at all, and (in this case) 4 means the row is a perfect match.
Taking it a bit larger for a demonstration of time:
nr <- 100000
nc <- 76
mydf2 <- as.data.frame(matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc),
stringsAsFactors=FALSE)
dim(mydf2)
## [1] 100000 76
system.time(
matches2 <- vapply(seq.int(nrow(mydf2)-1),
function(ii) sum(mydf2[ii,] == mydf2[ii+1,]),
integer(1))
)
## user system elapsed
## 370.63 12.14 385.36
Using a matrix instead
If you can afford to do it as a matrix (since you have a homogenous data type of "character") instead of a data.frame, you'll get considerably better performance:
nr <- 100000
nc <- 76
mymtx2 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx2)
## [1] 10000 76
system.time(
matches2 <- vapply(seq.int(nrow(mymtx2)-1),
function(ii) sum(mymtx2[ii,] == mymtx2[ii+1,]),
integer(1))
)
## user system elapsed
## 0.81 0.00 0.81
(Compare with 370.63 user from the previous run.) Scaling it up to full-strength:
nr <- 3.7e6
nc <- 76
mymtx3 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx3)
## [1] 3700000 76
system.time(
matches3 <- vapply(seq.int(nrow(mymtx3)-1),
function(ii) sum(mymtx3[ii,] == mymtx3[ii+1,]),
integer(1))
)
## user system elapsed
## 35.32 0.05 35.81
length(matches3)
## [1] 3699999
sum(matches3 == nc)
## [1] 0
Unfortunately, still no matches, but I think 36 seconds is considerably better for 3.7M than an hour for 100K. (Please correct me if I'm made an incorrect assumption.)
(Ref: win7 x64, R-3.0.3-64bit, intel i7-2640M 2.8GHz, 8GB RAM)

Related

How to create a unique identifier for 100000 with 5 characters?

I have 100,000 individuals
Using a combination of upper case letters, lower case letters and numbers, I want to create
a five-character ID for each individual. I should not have any duplicates.
How can I do this? I have tried the code below but I have 4 duplicates.
What is the number of possible unique combinations to create a 5 character ID with "letters", "LETTERS" and "0:9"?
set.seed(0)
mydata<-data.frame(
ID=rep(NA,10^5),
Poids=rnorm(n=10^5,mean = 65,sd=5)
)
for (i in 1:nrow(mydata)){
mydata$ID[i]<-c(
paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
)
}
table(duplicated(mydata$ID))
FALSE TRUE
99996 4
(length(letters) + length(LETTERS) + length(0:9))^5 is 91,6132,832, so there is plenty of space to avoid clashes.
In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:
space <- c(LETTERS, letters, 0:9)
set.seed(0)
samps <- sample(length(space)^5, 10^5)
m <- matrix("", nrow = 10^5, ncol = 5)
for(i in seq(ncol(m))) {
m[,i] <- space[(samps %% length(space)) + 1]
samps <- samps %/% length(space)
}
ID <- apply(m, 1, paste, collapse = "")
We can see this fulfils our requirements:
head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"
length(ID)
#> [1] 100000
length(unique(ID))
#> [1] 100000
The whole thing takes less than a second on my modest machine:
user system elapsed
0.72 0.00 0.74
Update
It occurs to me that it is possible to give 100,000 people a unique ID using only 16 characters, i.e. 0-9 and a-f, with code that is much quicker and simpler than above:
set.seed(0)
ID <- as.hexmode(sample(16^5, 10^5))
head(ID)
#> [1] "d43f9" "392a7" "033a2" "cf1d7" "aa10e" "134bb"
length(unique(ID))
#> [1] 100000
Which takes less than 10 milliseconds.
Created on 2022-05-15 by the reprex package (v2.0.1)
You can try the code below (given N <- 1e5 and k <- 5):
n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)
where
n gives a subset of the whole space that supports all unique combinations up to given number N, e.g., N <- 100000
S denotes a sub-space from which we draw the alphabets or digits
expand.grid gives all combinations
If you don't need randomness, the highly performant arrangements package can help by iterating over the permutations in order, not generating any more than are needed:
library(arrangements)
x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)
ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))
rbind(head(mydata), tail(mydata))
# ID Poids
# 1 abcde 64.46278
# 2 abcdf 62.00053
# 3 abcdg 75.71787
# 4 abcdh 67.73765
# 5 abcdi 66.45402
# 6 abcdj 66.85561
# 99995 abFpe 56.20545
# 99996 abFpf 64.14443
# 99997 abFpg 70.70191
# 99998 abFph 66.83226
# 99999 abFpi 65.22835
# 100000 abFpj 56.28880
This is quite fast:
user system elapsed
0.194 0.001 0.203

R: Find set of columns which contain most 1s in matrix of 0 and 1

I have a matrix of 1s and 0s where the rows are individuals and the columns are events. A 1 indicates that an event happened to an individual and a 0 that it did not.
I want to find which set of (in the example) 5 columns/events that cover the most rows/individuals.
Test Data
#Make test data
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
My attempt
My initial attempt was just based on combining the set of 5 columns with the highest colMeans:
#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)
#Have a look the set
col_set
>
197 199 59 80 76
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000
#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5
>
[1] 0.7
However this set does not cover the most rows. I tested this by pseudo-random sampling 10.000 different sets of 5 columns, and then finding the set with the highest coverage:
#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
set.seed(123)
result <- lapply(1:10000, function(x){
col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
list(set = col_set2, cover = cover)
})
##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]
>
[[1]]
[[1]]$set
59 169 262 68 197
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667
[[1]]$cover
[1] 0.7666667
The reason for supplying the colMeans to sample is that the columns with the highest coverages are the ones I am most interested in.
So, using pseudo-random sampling I can collect a set of columns with higher coverage than when just using the top 5 columns. However, since my actual data sets are larger than the example I am looking for a more efficient and rational way of finding the set of columns with the highest coverage.
EDIT
For the interested, I decided to microbenchmark the 3 solutions provided:
#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
#G. Grothendieck top solution
solution1 <- function(d){
cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#G. Grothendieck "Older solution"
solution2 <- function(d){
require(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
m <- matrix(res$solution[1:3000] == 1, 300)
cols <- which(rowSums(m) > 0)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
}
#user2554330 solution
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
result
}
#Benchmarking...
microbenchmark::microbenchmark(solution1 = solution1(d),
solution2 = solution2(d),
solution3 = bestCols(d), times = 10)
>
Unit: microseconds
expr min lq mean median uq max neval
solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16 10
solution2 55252.890 71492.781 84613.301 84811.7210 93916.544 117451.35 10
solution3 425.922 517.843 3087.758 589.3145 641.551 25742.11 10
This looks like a relatively hard optimization problem, because of the ways columns interact. An approximate strategy would be to pick the column with the highest mean; then delete the rows with ones in that column, and repeat. You won't necessarily find the best solution this way, but you should get a fairly good one.
For example,
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
bestCols <- function(d, n = 5) {
result <- numeric(n)
for (i in seq_len(n)) {
result[i] <- which.max(colMeans(d))
d <- d[d[,result[i]] != 1,, drop = FALSE]
}
cat("final dim is ", dim(d))
result
}
col_set <- bestCols(d)
sum(apply(d[,colnames(d) %in% col_set], 1, sum) > 0) / 30 #top 5
This gives 90% coverage.
The following provides a heuristic to find an approximate solution. Find the N=20 columns, say, with the most ones, cols, and then use brute force to find every subset of 5 columns out of those 20. The subset having the highest coverage is shown below and its coverage is 93.3%.
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
N <- 20
cols <- tail(as.numeric(names(sort(colSums(d)))), N)
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
Repeating this for N=5, 10, 15 and 20 we get coverages of 83.3%, 86.7%, 90% and 93.3%. The higher the N the better the coverage but the lower the N the less the run time.
Older solution
We can approximate the problem with a knapsack problem that chooses the 5 columns with largest numbers of ones using integer linear programming.
We get the 10 best solutions to this approximate problem, get all columns which are in at least one of the 10 solutions. There are 14 such columns and we then use brute force to find which subset of 5 of the 14 columns has highest coverage.
library(lpSolve)
ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
# each column of m is logical 300-vector defining possible soln
m <- matrix(res$solution[1:3000] == 1, 300)
# cols is the set of columns which are in any of the 10 solutions
cols <- which(rowSums(m) > 0)
length(cols)
## [1] 14
# use brute force to find the 5 best columns among cols
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1] 90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333
You can try to test if there is a better column and exchange this with the one currently in the selection.
n <- 5 #Number of columns / events
i <- rep(1, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
sort(i)
#[1] 90 123 197 199 286
mean(rowSums(d[,i]) > 0)
#[1] 0.9333333
Taking into account, that the initial condition influences the result you can take random starts.
n <- 5 #Number of columns / events
x <- apply(d, 2, function(x) colSums(x == 0 & d == 1))
diag(x) <- -1
idx <- which(!apply(x==0, 1, any))
x <- apply(d, 2, function(x) colSums(x != d))
diag(x) <- -1
x[upper.tri(x)] <- -1
idx <- unname(c(idx, which(apply(x==0, 1, any))))
res <- sample(idx, n)
for(l in 1:100) {
i <- sample(idx, n)
for(k in 1:10) { #How many times itterate
tt <- i
for(j in seq_along(i)) {
x <- +(rowSums(d[,i[-j]]) > 0)
i[j] <- which.max(colSums(x == 0 & d == 1))
}
if(identical(tt, i)) break
}
if(sum(rowSums(d[,i]) > 0) > sum(rowSums(d[,res]) > 0)) res <- i
}
sort(res)
#[1] 90 123 197 199 286
mean(rowSums(d[,res]) > 0)
#[1] 0.9333333

Optimization of apply

I have existing code that calculates concordance value for a dataframe/matrix. It's basically the number of rows where all the values are the same over the total number of rows.
...
concordance<-new[complete.cases(new),] #removes rows with NAs
TF<-apply(concordance, 1, function(x) if(length(unique(x))>1) F else T)
#outputs vector of T/F if it is concordant
numF<-table(TF)["TRUE"]#gets number of trues
concValue<-numF/NROW(TF) #true/total
...
Above is what I have now. It runs ok but I was wondering if there was any way to make it faster.
Edit: Dimensions of object is variable, but # of cols are typically 2-6 and there are typically 1,000,000+ rows. This is part of a package i'm developing so input data is variable.
Because the number of rows is much larger than the number of columns it makes sense to loop on columns instead, dropping rows, where there is more than different one value in the process:
propIdentical <- function(Mat){
nrowInit <- nrow(Mat)
for(i in 1:(ncol(Mat) - 1)){
if(!nrow(Mat)) break #stop if the matrix has no rows
else{
#check which elements of column i and column i+1 are equal:
equals <- Mat[,i] == Mat[, i+1]
# remove all other rows from the matrix
Mat <- Mat[equals,,drop = F]
}
}
return(nrow(Mat)/nrowInit)
}
some tests:
set.seed(1)
# normal case
dat <- matrix(sample(1:10, rep = T, size = 3*10^6), nrow = 10^6)
system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.053 0.017 0.070
[1] 0.009898
# normal case on my pc for comparison:
system.time(app <- mean(apply(dat, 1, function(x) length(unique(x))) == 1L)); app
user system elapsed
12.176 0.036 12.231
[1] 0.009898
# worst case
dat <- matrix(1L, nrow = 10^6, ncol = 6)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.302 0.044 0.348
[1] 1
# worst case on my pc for comparison
system.time(mean(apply(dat, 1, function(x) length(unique(x))) == 1L))
user system elapsed
12.562 0.001 12.578
# testing drop = F and if(!nrow(Mat)) break
dat <- matrix(1:2, ncol = 2)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0 0 0
[1] 0
Note: if you run this on a data.frame make sure to turn it into a matrix first.

Aggregate rows in a large matrix by rowname

I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
Here's a solution using by and colSums, but requires some fiddling due to the default output of by.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9
There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.
The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
combineByRow <- function(m) {
m <- m[ order(rownames(m)), ]
## keep track of previous row name
prev <- rownames(m)[1]
i.start <- 1
i.end <- 1
## cache the rownames -- profiling shows that it takes
## forever to look at them
m.rownames <- rownames(m)
stopifnot(all(!is.na(m.rownames)))
## go through matrix in a loop, as we need to combine some unknown
## set of rows
for (i in 2:(1+nrow(m))) {
curr <- m.rownames[i]
## if we found a new row name (or are at the end of the matrix),
## combine all rows and mark invalid rows
if (prev != curr || is.na(curr)) {
if (i.start < i.end) {
m[i.start,] <- apply(m[i.start:i.end,], 2, max)
m.rownames[(1+i.start):i.end] <- NA
}
prev <- curr
i.start <- i
} else {
i.end <- i
}
}
m[ which(!is.na(m.rownames)),]
}
Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)
start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)
all(m1 == m2)

splitting up ranges

Say I have some ranges represented by start coordinates start<-c(1,2,3) and end coordiantes end<-c(4,5,4) ;ranges<-data.frame(start,end) How can I split this up into one length intervals?
i.e. I want
this
starts ends
1 1 4
2 2 5
3 3 4
to be transformed into this:
starts ends
1 1 2 |
2 3 4 <-end of original first interval
3 2 3 |
4 4 5 <-end of original second interval
5 3 4 <-end of original third interval
right now I have a for loop iterating through the list and creating a sequence sequence that goes from start to end but this loop takes a very long time to execute for long lists of ranges.
Here's one way. It's a "glorified for-loop" in the disguise of lapply on a sequence.
# Your sample data
ranges<-data.frame(start=c(1,2,3),end=c(4,5,4))
# Extract the start/end columns
start <- ranges$start
end <- ranges$end
# Calculate result data
res <- lapply(seq_along(start), function(i) start[i]+seq(0, end[i]-start[i]))
# Make it into a data.frame by way of a matrix (which has a byrow argument)
newRanges <- as.data.frame( matrix(unlist(res), ncol=2, byrow=TRUE, dimnames=list(NULL, names(ranges))) )
Which gives the correct result:
> newRanges
start end
1 1 2
2 3 4
3 2 3
4 4 5
5 3 4
And then time it on a bigger problem:
n <- 1e5
start <- sample(10, n, replace=TRUE)
end <- start + sample( 3, n, replace=TRUE)*2-1
system.time( newRanges <- as.data.frame( matrix(unlist(lapply(seq_along(start), function(i) start[i]+seq(0, end[i]-start[i]))), ncol=2, byrow=TRUE) ) )
This takes about 1.6 seconds on my machine. Good enough?
...The trick is to work on the vectors directly instead of on the data.frame. And then build the data.frame at the end.
Update #Ellipsis... commented that lapply is no better than a for-loop. Let's see:
system.time( a <- unlist(lapply(seq_along(start), function(i) start[i]+seq(0, end[i]-start[i]))) ) # 1.6 secs
system.time( b <- {
res <- vector('list', length(start))
for (i in seq_along(start)) {
res[[i]] <- start[i]+seq(0, end[i]-start[i])
}
unlist(res)
}) # 1.8 secs
So, not only is the for-loop about 12% slower in this case, it is also much more verbose...
UPDATE AGAIN!
#Martin Morgan suggested using Map, and it is indeed the fastest solution yet - faster than do.call in my other answer. Also, by using seq.int my first solution is also much faster:
# do.call solution: 0.46 secs
system.time( matrix(do.call('c', lapply(seq_along(start), function(i) call(':', start[i], end[i]))), ncol=2, byrow=TRUE) )
# lapply solution: 0.42 secs
system.time( matrix(unlist(lapply(seq_along(start), function(i) start[[i]]+seq.int(0L, end[[i]]-start[[i]]))), ncol=2, byrow=TRUE) )
# Map solution: 0.26 secs
system.time( matrix(unlist(Map(seq.int, start, end)), ncol=2, byrow=TRUE) )
You could try creating text for the vectors, parse-ing and eval-uating and then using a matrix to create the data.frame:
txt <- paste("c(",paste(ranges$start,ranges$end,sep=":",collapse=","),")",sep="")
> txt
[1] "c(1:4,2:5,3:4)"
vec <- eval(parse(text=txt))
> vec
[1] 1 2 3 4 2 3 4 5 3 4
mat <- matrix(vec,ncol=2,byrow=T)
> data.frame(mat)
X1 X2
1 1 2
2 3 4
3 2 3
4 4 5
5 3 4
Here's another answer based on #James great solution. It avoids paste and parse and is a little bit faster:
vec <- do.call('c', lapply(seq_along(start), function(i) call(':', start[i], end[i])))
mat <- matrix(vec,ncol=2,byrow=T)
Timing it:
set.seed(42)
n <- 1e5
start <- sample(10, n, replace=TRUE)
end <- start + sample( 3, n, replace=TRUE)*2-1
# #James code: 6,64 secs
system.time({
for(i in 1:10) {
txt <- paste("c(",paste(start,end,sep=":",collapse=","),")",sep="")
vec <- eval(parse(text=txt))
mat <- matrix(vec,ncol=2,byrow=T)
}
})
# My variant: 5.17 secs
system.time({
for(i in 1:10) {
vec <- do.call('c', lapply(seq_along(start), function(i) call(':', start[i], end[i])))
mat <- matrix(vec,ncol=2,byrow=T)
}
})

Resources