Speedup search of Elements - r

I got two data.frames m (23 columns 135.973 rows) with the two important columns
head(m[,2])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(m[,7])
# [1] 3661216 3661217 3661223 3661224 3661564 3661567
and search (4 columns 1.019.423 rows) with three important columns
head(search[,1])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(search[,3])
# [1] 3000009 3003160 3003187 3007262 3028947 3050944
head(search[,4])
# [1] 3000031 3003182 3003209 3007287 3028970 3050995
For each row in m I like to get the information if the m[XX,7] position is between any position of search[,3] and search[,4]. So search[,3] can be considered as "start" and search[,4] as "end". In addition search[,1] and m[,2] have to be identical.
Example:
m at row 215
"chr1" 10.984.038
hits in search at line 2898
"chr1" 10.984.024 10.984.046
In general I'm not interested which line or how many lines of search could be found. I just want the information for any line of m is there a matching line in search yes or no.
I'm ending up in this function:
f_4<-function(x,y,z){
for (out in 1:length(x[,1])) {
z[out]<-length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4])))
}
return(z)
}
found4<-vector(length=length(m[,1]), mode="numeric")
found4<-f_4(m,search,found4)
It took 3 hours to run this code.
I have already tried some speedup approaches, however I didn't manage to get any of this running proper or faster.
I even tried some lappy/apply approaches -which worked but aren't faster-. However they failed when trying to speed up using parLapply/parRapply.
Anybody got a quite faster approach and may can give some advise?
EDIT 2015/09/18
Found another way to speed up, using foreach %dopar%.
f5<-function(x,y,z){
foreach(out=1:length(x[,1]), .combine="c") %dopar% {
takt<-1000
z=length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4]) ))
}
return(z)
}
found5<-vector(length=length(m[,1]), mode="numeric")
found5<-f5(m,search,found5)
Only need 45min. However I'm always getting 0 only. Thing I need to read some more of the foreach %dopar% tutorials.

You can try merging with subsequent logical subsetting. First let's create some mock data:
set.seed(123) # used for reproducibility
m <-as.data.frame(matrix(sample(50,7000, replace=T), ncol=7, nrow=1000))
search <- as.data.frame(matrix(sample(50,1200, replace=T), ncol=4, nrow=300))
Since we want to compare different rows of the two sets, we can use the criterion that m[,2] should be equal to search[,1]. For convenience we can name these columns "ID" in both sets:
m <- cbind(m,seq_along(1:nrow(m)))
search <- cbind(search,seq_along(1:nrow(search)))
colnames(m) <- c("a","ID","c","d","e","f","val","rownum.m")
colnames(search) <- c("ID","nothing","start","end", "rownum.s")
We have added a column to m named 'rownum.m' and a similar column to search which in the end will help identifying the resulting entries in the initial dataset.
Now we can merge the data sets, such that the ID is the same:
m2 <- merge(m,search)
In a final step, we can perform a logical subset of the merged data set and assign the output to a new data frame m3:
m3 <- m2[(m2[,"val"] >= m2[,"start"]) & (m2[,"val"] <= m2[,"end"]),]
#> head(m3)
# ID a c d e f val rownum.m nothing start end rownum.s
#5 1 14 36 36 31 30 25 846 10 20 36 291
#13 1 34 49 24 8 44 21 526 10 20 36 291
#17 1 19 32 29 44 24 35 522 6 33 48 265
#20 1 19 32 29 44 24 35 522 32 31 50 51
#21 1 19 32 29 44 24 35 522 10 20 36 291
#29 1 6 50 10 13 43 22 15 10 20 36 291
If we are only interested in a TRUE/FALSE statement whether a specific row of m matches the criterions, we can define a vector match_s:
match_s <- m$rownum.m %in% m3$rownum.m
which can be stored as an additional column in the original data set m:
m <- cbind(m,match_s)
Finally, we can remove the auxiliary column 'rownum.m' from the data set m which is no longer needed, with m <- m[,-8].
The result is:
> head(m)
# a ID c d e f val match_s
#1 15 14 8 11 16 13 23 FALSE
#2 40 30 8 48 42 50 20 FALSE
#3 21 9 8 19 30 36 19 TRUE
#4 45 43 26 32 41 33 27 FALSE
#5 48 43 25 10 15 13 4 FALSE
#6 3 24 31 33 8 5 36 FALSE

If you're trying to find SNPs (say) inside a set of genomic regions, don't use R. Use BEDOPS.
Convert your SNP or single-base positions to a three-column BED file. In R, make a three-column data table with m[,2], m[,7] and m[,7] + 1, which represent the chromosome, start and stop position of the SNP. Use write.table() to write out this data table to a tab-delimited text file.
Do the same with your genomic regions: Write search[,1], search[,3], and search[,4] to a three-column data table representing the chromosome, start and stop position of the region. Use write.table() to write this out to a tab-delimited text file.
Use sort-bed to sort both BED files. This step might be optional, but it doesn't take long to do and it guarantees that the files are prepped for use with BEDOPS tools.
Finally, use bedmap on the two BED files to map SNPs to genomic regions. Mapping associates SNPs with regions. The bedmap tool can report which SNPs map to a region, or report the number of SNPs, or one or more of many other operations. The documentation for bedmap goes into more detail on the list of operations, but the provided example should get you started quickly.
If your data are in BED format, or can be quickly coerced into BED format, don't use R for genomic operations, as it is slow and memory-intensive. The BEDOPS toolkit introduced the use of sorting to make genomic operations fast, with low memory overhead.

Related

How to extract rows with similar names into a submatrix?

I am building an asymmetrical matrix of values with the rows being coefficient names and the column the value of each coefficient:
# Set up Row and Column Names.
rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
"actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
"unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
"covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
"yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
"yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
"yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")
# Build Matrix
matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))
output:
> matrix1
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
and I wish to extract certain rows that share row names (i.e. all rows with "unitBreaks'x'") into a submatrix.
I tried this
est_actsBreaks <- est_coef_mtrx[c("actsBreaks0","actsBreaks1","actsBreaks2","actsBreaks3",
"actsBreaks4","actsBreaks5","actsBreaks6","actsBreaks7",
"actsBreaks8","actsBreaks9"),c("Value")]
but it returns a vector and I need a matrix. I have seen other questions concerning similar procedures but their columns and rows all had identical names and/or values. Is there a way to do the operation I have in mind, such as grep()?
Welcome to StackOverflow.
As usual in R, there would probably be many ways to do what you request.
EDIT: I realized that my solution was going a little bit too far, sorry about that.
To extract only the rows that contain the pattern "unitBreaks" followed by several numbers, and still keep a matrix structure, you can run the following code. In a nutshell, grep is going to look for the pattern that you need and the argument drop = FALSE is going to make sure that you get a matrix as a result and not a vector.
uniBreakLines <- grep("unitBreaks[0-9]*", rows)
matrix1[uniBreakLines, , drop = FALSE]
Below is the first version of my answer.
First, I create a vector that describes the groups of rows. For this, I remove the numbers at the end of the row names.
grp <- gsub("[0-9]+$", "", rows)
Then, I transform the data matrix into a data-frame (why I do that is explained a little bit later).
dat1 <- data.frame(matrix1)
Finally, I use "split" on the data-frame, with the groups defined earlier. Using split on the data-frame will keep the structure: the result will be a list of data-frames, even though there is only one column.
dat1.split <- split(dat1, grp)
The result is a list of data-frames.
lapply(dat1.split, head)
$actsBreaks
Value
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
$covgBreaks
Value
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
$Intercept
Value
Intercept 1
$plugBump
Value
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
$tBreaks
Value
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
$unitBreaks
Value
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
$yearBreaks
Value
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
After that, if you still need matrices, you can convert them with the function as.matrix in an "lapply":
matrix1.split <- lapply(dat1.split, as.matrix)
You might want to consider combining your data in a "tibble" with the "grouping" column. You will then be able to use these groups with the group_by function or other functions from the dplyr package (or other packages from the tidyverse).
For example:
library(dplyr)
tib1 <- tibble(rows, simpler_rows, value = 1:48)
And an example on how to use the grouping variable:
tib1 %>%
group_by(simpler_rows) %>%
summarize(sum(value))
EDIT bis: what if I don't know the pattern?
I played around a little bit with your example to answer the question (that nobody asked, but still, it's fun!): "what if I don't know the pattern?"
In this case, I would use a distance between the row names. This distance would look like this:
... and would be the output of the following lines of code
library(stringdist)
library(pheatmap)
strdist <- stringdistmatrix(rows)
pheatmap(strdist, border_color = "white", cluster_rows = F, cluster_cols = FALSE, cellwidth = 10, cellheight = 10, labels_row = rows, fontsize_row = 7)
After that, I only need to get the number of cluster, which can be done with a silhouette plot (similar to this one), that tells me that there are 8 clusters of words, which seems about right:
The cluster can be extracted then with the function used to create the silhouette plot (I used hclust and cutree).
Here a solution with dplyr and stringr to extract rownames that contain a certain string.
At the end change back to matrix:
library(dplyr)
library(stringr)
df1 <- df %>%
filter(!str_detect(rownames(df), "unitBreaks"))
df1 <- as.matrix(df1)
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48

Replace every word by an index in 15 million strings

I have a list of 15 million strings and I have a dictionary of 8 million words. I want to replace every string in database by the index of the string in the dictionary.
I tried using the hash package for faster indexing, but it is still taking hours for replacing in all 15 million strings.
What is the efficient way of implementing this?
Example[EDITED]:
# Database
[[1]]
[1]"a admit been c case"
[[2]]
[1]"co confirm d ebola ha hospit howard http lik"
# dictionary
"t" 1
"ker" 2
"be" 3
.
.
.
.
# Output:
[[1]]123 3453 3453 567
[[2]]6786 3423 234123 1234 23423 6767 3423 124431 787889 111
Where the index of admit in the dictionary is 3453.
Any kind of help is appreciated.
Updated Example with Code:
This is what I am currently doing.
Example: data =
[1] "a co crimea divid doe east hasten http polit secess split t threaten ukrain via w west xtcnwl youtub"
[2] "billion by cia fund group nazy spent the tweethead ukrain"
[3] "all back energy grandpar home miss my posit radiat the"
[4] "ao bv chega co de ebola http kkmnxv pacy rio suspeito t"
[5] "android androidgam co coin collect gameinsight gold http i jzdydkylwd t ve"
words.list = strsplit(data, "\\W+", perl=TRUE)
words.vector = unlist(words.list)
sorted.words = sort(table(words.vector),decreasing=TRUE)
h = hash(names(sorted.words),1:length(names(sorted.words)))
index = lapply(data, function(row)
{
temp = trim.leading(row)
word_list = unlist(strsplit(temp, "\\W+", perl=TRUE))
index_list = lapply(word_list,function(x)
{
return(h[[x]])
}
)
#print(index_list)
return(unlist(index_list))
}
)
Output:
index_list
[[1]]
[1] 6 1 19 21 22 23 31 2 40 44 46 3 48 5 51 52 53 54 55
[[2]]
[1] 12 14 16 26 30 38 45 4 49 5
[[3]]
[1] 7 11 25 29 32 36 37 41 42 4
[[4]]
[1] 10 13 15 1 20 24 2 35 39 43 47 3
[[5]]
[1] 8 9 1 17 18 27 28 2 33 34 3 50
The output is index. This runs fast if the length of data is small but execution is really slow if the length is 15 million.
My task is the nearest neighbor search. I want to search for 1000 queries which are of same format as the database.
I have tried many things like parallel computations as well, but had issues with memory.
[EDIT] How can I implement this using RCpp?
I think you'd like to avoid the lapply() by splitting the data, unlisting, then processing the vector of words
data.list = strsplit(data, "\\W+", perl=TRUE)
words = unlist(data.list)
## ... additional processing, e.g., strip white space, on the vector 'words'
perform the match, then re-list to original
relist(match(words, word.vector), data.list)
For downstream applications it might actually pay to retain the vector + 'partitioning' information, partition = sapply(data.list, length) rather than re-listing, since it'll continue to be efficient to operate on the unlisted vector. The Bioconductor S4Vectors package provides a CharacterList class that takes this approach, where one mostly works on something that is list-like, but where the data are stored and most operations are on an underlying character vector.
Sounds like you're doing NLP.
A fast non-R solution (which you could wrap in R) is word2vec
The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the
training text data and then learns vector representation of words. The
resulting word vector file can be used as features in many natural
language processing and machine learning applications.

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Combining vectors of unequal length into a data frame

I have a list of vectors which are time series of inequal length. My ultimate goal is to plot the time series in a ggplot2 graph. I guess I am better off first merging the vectors in a dataframe (where the shorter vectors will be expanded with NAs), also because I want to export the data in a tabular format such as .csv to be perused by other people.
I have a list that contains the names of all the vectors. It is fine that the column titles be set by the first vector, which is the longest. E.g.:
> mylist
[[1]]
[1] "vector1"
[[2]]
[1] "vector2"
[[3]]
[1] "vector3"
etc.
I know the way to go is to use Hadley's plyr package but I guess the problem is that my list contains the names of the vectors, not the vectors themselves, so if I type:
do.call(rbind, mylist)
I get a one-column df containing the names of the dfs I wanted to merge.
> do.call(rbind, actives)
[,1]
[1,] "vector1"
[2,] "vector2"
[3,] "vector3"
[4,] "vector4"
[5,] "vector5"
[6,] "vector6"
[7,] "vector7"
[8,] "vector8"
[9,] "vector9"
[10,] "vector10"
etc.
Even if I create a list with the object themselves, I get an empty dataframe :
mylist <- list(vector1, vector2)
mylist
[[1]]
1 2 3 4 5 6 7 8 9 10 11 12
0.1875000 0.2954545 0.3295455 0.2840909 0.3011364 0.3863636 0.3863636 0.3295455 0.2954545 0.3295455 0.3238636 0.2443182
13 14 15 16 17 18 19 20 21 22 23 24
0.2386364 0.2386364 0.3238636 0.2784091 0.3181818 0.3238636 0.3693182 0.3579545 0.2954545 0.3125000 0.3068182 0.3125000
25 26 27 28 29 30 31 32 33 34 35 36
0.2727273 0.2897727 0.2897727 0.2727273 0.2840909 0.3352273 0.3181818 0.3181818 0.3409091 0.3465909 0.3238636 0.3125000
37 38 39 40 41 42 43 44 45 46 47 48
0.3125000 0.3068182 0.2897727 0.2727273 0.2840909 0.3011364 0.3181818 0.2329545 0.3068182 0.2386364 0.2556818 0.2215909
49 50 51 52 53 54 55 56 57 58 59 60
0.2784091 0.2784091 0.2613636 0.2329545 0.2443182 0.2727273 0.2784091 0.2727273 0.2556818 0.2500000 0.2159091 0.2329545
61
0.2556818
[[2]]
1 2 3 4 5 6 7 8 9 10 11 12
0.2824427 0.3664122 0.3053435 0.3091603 0.3435115 0.3244275 0.3320611 0.3129771 0.3091603 0.3129771 0.2519084 0.2557252
13 14 15 16 17 18 19 20 21 22 23 24
0.2595420 0.2671756 0.2748092 0.2633588 0.2862595 0.3549618 0.2786260 0.2633588 0.2938931 0.2900763 0.2480916 0.2748092
25 26 27 28 29 30 31 32 33 34 35 36
0.2786260 0.2862595 0.2862595 0.2709924 0.2748092 0.3396947 0.2977099 0.2977099 0.2824427 0.3053435 0.3129771 0.2977099
37 38 39 40 41 42 43 44 45 46 47 48
0.3320611 0.3053435 0.2709924 0.2671756 0.2786260 0.3015267 0.2824427 0.2786260 0.2595420 0.2595420 0.2442748 0.2099237
49 50 51 52 53 54 55 56 57 58 59 60
0.2022901 0.2251908 0.2099237 0.2213740 0.2213740 0.2480916 0.2366412 0.2251908 0.2442748 0.2022901 0.1793893 0.2022901
but
do.call(rbind.fill, mylist)
data frame with 0 columns and 0 rows
I have tried converting the vectors to dataframes, but there is no cbind.fill function, so plyr complains that the dataframes are of different length.
So my questions are:
Is this the best approach? Keep in mind that the goals are a) a ggplot2 graph and b) a table with the time series, to be viewed outside of R
What is the best way to get a list of objects starting with a list of the names of those objects?
What the best type of graph to highlight the patterns of 60 timeseries? The scale is the same, but I predict there'll be a lot of overplotting. Since this is a cohort analysis, it might be useful to use color to highlight the different cohorts in terms of recency (as a continuous variable). But how to avoid overplotting? The differences will be minimal so faceting might leave the viewer unable to grasp the difference.
I think that you may be approaching this the wrong way:
If you have time series of unequal length then the absolute best thing to do is to keep them as time series and merge them. Most time series packages allow this. So you will end up with a multi-variate time series and each value will be properly associated with the same date.
So put your time series into zoo objects, merge them, then use my qplot.zoo function to plot them. That will deal with switching from zoo into a long data frame.
Here's an example:
> z1 <- zoo(1:8, 1:8)
> z2 <- zoo(2:8, 2:8)
> z3 <- zoo(4:8, 4:8)
> nm <- list("z1", "z2", "z3")
> z <- zoo()
> for(i in 1:length(nm)) z <- merge(z, get(nm[[i]]))
> names(z) <- unlist(nm)
> z
z1 z2 z3
1 1 NA NA
2 2 2 NA
3 3 3 NA
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
>
> x.df <- data.frame(dates=index(x), coredata(x))
> x.df <- melt(x.df, id="dates", variable="val")
> ggplot(na.omit(x.df), aes(x=dates, y=value, group=val, colour=val)) + geom_line() + opts(legend.position = "none")
If you're doing it just because ggplot2 (as well as many other things) like data frames then what you're missing is that you need the data in long format data frames. Yes, you just put all of your response variables in one column concatenated together. Then you would have 1 or more other columns that identify what makes those responses different. That's the best way to have it set up for things like ggplot.
You can't. A data.frame() has to be rectangular; but recycling rules assures that the shorter vectors get expanded.
So you may have a different error here -- the data that you want to rbind is not suitable, maybe ? -- but is hard to tell as you did not supply a reproducible example.
Edit Given your update, you get precisely what you asked for: a list of names gets combined by rbind. If you want the underlying data to appear, you need to involve get() or another data accessor.

Resources