Intersecting multiple datasets in R

Intersecting multiple datasets in R - r

How Do I intersect between multiple samples?
I have 29 lists of concatenates I build according to gene name, cc change, coordinate. Each list is 400-800 long. I need to build a table showing how many variants shared among two lists for all 812 combinations. Is there a way I can do this in R?
For example: If I have 4 lists.
A<-c("TSC22112517","SLC141T43309911","RAD51D33446609","WRN31024638")
B<-c("TSC22112517","SLC14A143309911","RHBDF274474996","WRN31024638")
C<-c("TSC22112517","SLC14A143309911","RAD51D33446609","MEN164575556")
D<-c("FANCM45665468","SLC14A143309911","RAD51D33446609","MEN164575556")
I just need to find how many variants are shard among each other.
AB<-length(intersect(A,B))
give me the # of variants shared by A and B which is 3.
Then I can get a table like below showing # of shared variants:
A B C D
A 4 3 2 2
B 3 4 3 2
C 2 3 4 2
D 2 2 2 4
How to do it for large # of lists?
I have 29 lists and each has 600 variants.

You could try something like this: I do a lot of things in lists...
#x is your data in list() format
shared<-list()
for (i in 1:29){
shared[[i]]<-list()
for (j in 1:29){
if (i != j){
shared[[i]][[j]]<-x[[i]][x[[i]][,2]==x[[j]][,2]]
}
}
}

So happy to figure it out
x<- list()
shared<-matrix(1:841,ncol=29)
temp<-NULL
for (i in 1:29){
for (j in 1:29){
temp[j] <- length(intersect(x[[i]][[1]],x[[j]][[1]]))
}
shared[,i] <- matrix(temp)
}
shared

Related

retrieving a list element in O(1) in R

suppose I have the following:
a <- vector('list',50)
for(i in 1:50)
{
a[[i]] <- list(path=paste0("file",sample(0:600,1)),contents=sample(1:5,10*i,replace=TRUE))
}
Now, for example; I want to retrieve the contents of file45(assuming it exists in this randomly generated data) as fast as possible.
I have tried the following:
contents <- unlist(Filter(function(x) x$path=="file45",a),recursive=FALSE)$contents
However, the list searching overhead makes reading from memory even slower than reading directly from disk (to some extent).
Is there any other way of retrieving the contents in something reasonably faster than reading from disk ideally O(1) ?
edit: assume that there are no duplicate filepaths in my sublists and that there are largely more than 50 sublists

Use the names attribute to track the items instead:
a <- vector('list',50)
for(i in 1:50)
{
a[[i]] <- list(contents=sample(1:5,10*i,replace=TRUE))
}
names(a) <- paste0("file",sample(1:600,50))
a[["file45"]]
NULL
a[["file25"]]
$contents
[1] 3 1 3 1 2 5 1 5 1 2 3 1 4 1 1 4 1 5 1 5 1 4 5 2 5 2 2 5 1 1

Try the following:
a[sapply(a, function(x) x$path == "file45")][[1]]$contents

Parallel processing for multiple nested for loops

I am trying to run simulation scenarios which in turn should provide me with the best scenario for a given date, back tested a couple of months. The input for a specific scenario has 4 input variables with each of the variables being able to be in 5 states (625 permutations). The flow of the model is as follows:
Simulate 625 scenarios to get each of their profit
Rank each of the scenarios according to their profit
Repeat the process through a 1-day expanding window for the last 2 months starting on the 1st Dec 2015 - creating a time series of ranks for each of the 625 scenarios
The unfortunate result for this is 5 nested for loops which can take extremely long to run. I had a look at the foreach package, but I am concerned around how the combining of the outputs will work in my scenario.
The current code that I am using works as follows, first I create the possible states of each of the inputs along with the window
a<-seq(as.Date("2015-12-01", "%Y-%m-%d"),as.Date(Sys.Date()-1, "%Y-%m-%d"),by="day")
#input variables
b<-seq(1,5,1)
c<-seq(1,5,1)
d<-seq(1,5,1)
e<-seq(1,5,1)
set.seed(3142)
tot_results<-NULL
Next the nested for loops proceed to run through the simulations for me.
for(i in 1:length(a))
{
cat(paste0("\n","Current estimation date: ", a[i]),";itteration:",i," \n")
#subset data for backtesting
dataset_calc<-dataset[which(dataset$Date<=a[i]),]
p=1
results<-data.frame(rep(NA,625))
for(j in 1:length(b))
{
for(k in 1:length(c))
{
for(l in 1:length(d))
{
for(m in 1:length(e))
{
if(i==1)
{
#create a unique ID to merge onto later
unique_ID<-paste0(replicate(1, paste(sample(LETTERS, 5, replace=TRUE), collapse="")),round(runif(n=1,min=1,max=1000000)))
}
#Run profit calculation
post_sim_results<-profit_calc(dataset_calc, param1=e[m],param2=d[l],param3=c[k],param4=b[j])
#Exctract the final profit amount
profit<-round(post_sim_results[nrow(post_sim_results),],2)
results[p,]<-data.frame(unique_ID,profit)
p=p+1
}
}
}
}
#extract the ranks for all scenarios
rank<-rank(results$profit)
#bind the ranks for the expanding window
if(i==1)
{
tot_results<-data.frame(ID=results[,1],rank)
}else{
tot_results<-cbind(tot_results,rank)
}
suppressMessages(gc())
}
My biggest concern is the binding of the results given that the outer loop's actions are dependent on the output of the inner loops.
Any advice on how proceed would greatly be appreciated.

So I think that you can vectorize most of this, which should give a big reduction in run time.
Currently, you use for-loops (5, to be exact) to create every combination of values, and then run the values one by one through profit_calc (a function that is not specified). Ideally, you'd just take all possible combinations in one go and push them through profit_calc in one single operation.
-- Rationale --
a <- 1:10
b <- 1:10
d <- rep(NA,10)
for (i in seq(a)) d[i] <- a[i] * b[i]
d
# [1] 1 4 9 16 25 36 49 64 81 100
Since * also works on vectors, we can rewrite this to:
a <- 1:10
b <- 1:10
d <- a*b
d
# [1] 1 4 9 16 25 36 49 64 81 100
While it may save us only one line of code, it actually reduces the problem from 10 steps to 1 step.
-- Application --
So how does that apply to your code? Well, given that we can vectorize profit_calc, you can basically generate a data frame where each row is every possible combination of your parameters. We can do this with expand.grid:
foo <- expand.grid(b,c,d,e)
head(foo)
# Var1 Var2 Var3 Var4
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 1 1
# 4 4 1 1 1
# 5 5 1 1 1
# 6 1 2 1 1
Lets say we have a formula... (a - b) / (c + d)... Then it would work like:
bar <- (foo[,1] - foo[,2]) * (foo[,3] + foo[,4])
head(bar)
# [1] 0 2 4 6 8 -2
So basically, try to find a way to replace for-loops with vectorized options. If you cannot vectorize something, try looking into apply instead, as that can also save you some time in most cases. If your code is running too slow, you'd ideally first see if you can write a more efficient script. Also, you may be interested in the microbenchmark library, or ?system.time.

Creating combination of sequences

I am trying to solve following problem:
Consider 5 simple sequences: 0:100, 100:0, rep(0,101), rep(50,101), rep(100,101)
I need sets of 3 numeric variables, which have above sequences in all combinations. Since there are 5 sequences and 3 variables, there can be 5*5*5 combinations, hence total of 12625 (5*5*5*101) numbers in each variable (101 for each sequence).
These can be grouped in a data.frame of 12625 rows and 4 columns. First column (V) will simply have seq(1:12625) (rownumbers can be used in its place). Other 3 columns (A,B,C) will have above 5 sequences in different combinations. For example, the first 101 rows will have 0:100 in all 3 A,B and C. Next 101 rows will have 0:100 in A and B, and 100:0 in C. And so on...
I can create sequences as:
s = list()
s[[1]] = 0:100
s[[2]] = 100:0
s[[3]] = rep(0,101)
s[[4]] = rep(50,101)
s[[5]] = rep(100,101)
But how to proceed further? I do not really need the data frame but I need a function that returns a list containing the values of c(A,B,C) for the number (first or V column) sent to it. The number can obviously vary from 1 to 12625.
How can I create such a function. I will prefer a vector solution or one using apply family functions to optimize the speed.

You asked for a vectorized solution, so here's one using only data.table (similar to #SimonGs methodology)
library(data.table)
grd <- CJ(A = seq_len(5), B = seq_len(5), C = seq_len(5))
res <- grd[, lapply(.SD, function(x) unlist(s[x]))]
res
# A B C
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 2
# 4: 3 3 3
# 5: 4 4 4
# ---
# 12621: 100 100 100
# 12622: 100 100 100
# 12623: 100 100 100
# 12624: 100 100 100
# 12625: 100 100 100

I came up with two solutions. I find this hard to do with apply and the likes since they tend to give an output that is not so nice to handle (maybe someone can "tame" them better than I can :D)
First solution uses seperate calls to lapply, second one uses a for loop and some programming No-No's. Personally I prefer the second one, first one is faster though...
grd <- expand.grid(a=1:5,b=1:5,c=1:5)
# apply-ish
A <- lapply(grd[,1], function(z){ s[[z]] })
B <- lapply(grd[,2], function(z){ s[[z]] })
C <- lapply(grd[,3], function(z){ s[[z]] })
dfr <- data.frame(A=do.call(c,A), B=do.call(c,B), C=do.call(c,C))
# for-ish
mat <- NULL
for(i in 1:nrow(grd)){
cur <- grd[i,]
tmp <- cbind(s[[cur[,1]]],s[[cur[,2]]],s[[cur[,3]]])
mat <- rbind(mat,tmp)
}
The output of both dfr and mat seem to be what you describe.
Cheers!

Eliminating rows from a data.frame

I have this example data.frame:
df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), level = c(1,5,2,3,6,4,2,1,1))
> df
id start end level
1 a 100 150 1
2 a,b,c 100 350 5
3 d,e 400 550 2
4 d 400 450 3
5 h 800 850 6
6 e 500 550 4
7 i 900 950 2
8 b 200 250 1
9 c 300 350 1
where each row is a linear interval.
As this example shows some rows are merged intervals (rows 2 and 3).
What I'd like to do is for each merged interval either eliminate all its individual parts from df if the df$level of the merged interval is greater than that of all its parts, or if the df$level of the merged interval is smaller than at least one of its parts eliminate the merged interval.
So for this example, the output should be:
> res.df
id start end level
1 a,b,c 100 350 5
2 d 400 450 3
3 h 800 850 6
4 e 500 550 4
5 i 900 950 2

Method 1 (ID values)
So If we can assume that all the "merged" group have ID names that are a comma separated list of the individual groups, then we can tackle this problem just looking at the IDs and ignore the start/end information. Here is one such method
First, find all the "merged" groups by finding the IDs with commas
groups<-Filter(function(x) length(x)>1,
setNames(strsplit(as.character(df$id),","),df$id))
Now, for each of those groups, determine who has the larger level, either the merged group or one of the individual groups. Then return the index of the rows to drop as a negative number
drops<-unlist(lapply(names(groups), function(g) {
mi<-which(df$id==g)
ii<-which(df$id %in% groups[[g]])
if(df[mi, "level"] > max(df[ii, "level"])) {
return(-ii)
} else {
return(-mi)
}
}))
And finally, drop those from the data.frame
df[drops,]
# id start end level
# 2 a,b,c 100 350 5
# 4 d 400 450 3
# 5 h 800 850 6
# 6 e 500 550 4
# 7 i 900 950 2
Method 2 (Start/End Graph)
I wanted to also try a method that ignored the (very useful) merged ID names and just looked at the start/end positions. I may have gone off in a bad direction but this lead me to think of it as a network/graph type problem so I used the igraph library.
I created a graph where each vertex represented a start/end position. Each edge therefore represented a range. I used all the ranges from the sample data set and filled in any missing ranges to make the graph connected. I merged that data together to create an edge list. For each edge, I remember the "level" and "id" values from the original data set. Here's the code to do that
library(igraph)
poslist<-sort(unique(c(df$start, df$end)))
seq.el<-embed(rev(poslist),2)
class(seq.el)<-"character"
colnames(seq.el)<-c("start","end")
el<-rbind(df[,c("start","end","level", "id")],data.frame(seq.el, level=0, id=""))
el<-el[!duplicated(el[,1:2]),]
gg<-graph.data.frame(el)
And that creates a graph that looks like
So basically we want to eliminate cycles in the graph by taking the path with the edge that has the maximum "level" value. Unfortunately since this isn't a normal path-weighting scheme, I didn't find an easy way to do this with a default algorithm (maybe I missed it). So I had to write my own graph transversal function. It's not as pretty as I would have liked, but here it is.
findPath <- function(gg, fromv, tov) {
if ((missing(tov) && length(incident(gg, fromv, "in"))>1) ||
(!missing(tov) && V(gg)[fromv]==V(gg)[tov])) {
return (list(level=0, path=numeric()))
}
es <- E(gg)[from(fromv)]
if (length(es)>1) {
pp <- lapply(get.edges(gg, es)[,2], function(v) {
edg <- E(gg)[fromv %--% v]
lvl <- edg$level
nxt <- findPaths(gg,v)
return (list(level=max(lvl, nxt$level), path=c(edg,nxt$path)))
})
lvl <- sapply(pp, `[[`, "level")
take <- pp[[which.max(lvl)]]
nxt <- findPaths(gg, get.edges(gg, tail(take$path,1))[,2], tov)
return (list(level=max(take$level, nxt$level), path=c(take$path, nxt$path)))
} else {
lvl <- E(gg)[es]$level
nv <- get.edges(gg,es)[,2]
nxt <- findPaths(gg, nv, tov)
return (list(level=max(lvl, nxt$level), path=c(es, nxt$path)))
}
}
This will find a path between two nodes that satisfies the property of having a maximal level when presented with a branch. We call that with this data set with
rr <- findPaths(gg, "100","950")$path
This will find the final path. Since each row in the original df data.frame is represented by an edge, we just need to extract the edges from the path that correspond to the final path. This actually gives us a path that looks like
where the red path is the chosen one. I can then subset df with
df[df$id %in% na.omit(E(gg)[rr]$id), ]
# id start end level
# 2 a,b,c 100 350 5
# 4 d 400 450 3
# 5 h 800 850 6
# 6 e 500 550 4
# 7 i 900 950 2
Method 3 (Overlap Matrix)
He's another way to look at the start/stop positions. I create a matix where columns correspond to ranges in the rows of the data.frame and the rows of the matrix correspond to positions. Each value in the matrix is true if a range overlaps a position. Here I use the between.R helper function
#find unique positions and create overlap matrix
un<-sort(unique(unlist(df[,2:3])))
cc<-sapply(1:nrow(df), function(i) between(un, df$start[i], df$end[i]))
#partition into non-overlapping sections
groups<-cumsum(c(F,rowSums(cc[-1,]& cc[-nrow(cc),])==0))
#find the IDs to keep from each section
keeps<-lapply(split.data.frame(cc, groups), function(m) {
lengths <- colSums(m)
mx <- which.max(lengths)
gx <- setdiff(which(lengths>0), mx)
if(length(gx)>0) {
if(df$level[mx] > max(df$level[gx])) {
mx
} else {
gx
}
} else {
mx
}
})
This will give a list of the IDs to keep from each group, and we can get the final data.set with
df[unlist(keeps),]
Method 4 (Open/Close Listing)
I have one last method. This one might be the most scalable. We basically melt the positions and keep track of opening and closing events to identify the groups. Then we split and see if the longest in each group has the max level or not. Ultimately we return the IDs. This method uses all standard base functions.
#create open/close listing
dd<-rbind(
cbind(df[,c(1,4)],pos=df[,2], evt=1),
cbind(df[,c(1,4)],pos=df[,3], evt=-1)
)
#annotate with useful info
dd<-dd[order(dd$pos, -dd$evt),]
dd$open <- cumsum(dd$evt)
dd$group <- cumsum(c(0,head(dd$open,-1)==0))
dd$width <- ave(dd$pos, dd$id, FUN=function(x) diff(range(x)))
#slim down
dd <- subset(dd, evt==1,select=c("id","level","width","group"))
#process each group
ids<-unlist(lapply(split(dd, dd$group), function(x) {
if(nrow(x)==1) return(x$id)
mw<-which.max(x$width)
ml<-which.max(x$level)
if(mw==ml) {
return(x$id[mw])
} else {
return(x$id[-mw])
}
}))
and finally subset
df[df$id %in% ids, ]
by now I think you know what this returns
Summary
So if your real data has the same type of IDs as the sample data, obviously method 1 is a better, more direct choice. I'm still hoping there is a way to simplify method 2 that i'm just missing. I've not done any testing on efficiency or performance of these methods. I'm guessing method 4 might be be the most efficient since it should scale linearly.

I'll take a procedural approach; basically, sort descending by level,
and for each record, remove later records that have a matching id.
df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350),
level = c(1,5,2,3,6,4,2,1,1), stringsAsFactors=FALSE)
#sort
ids <- df[order(df$level, decreasing=TRUE), "id"]
#split
ids <- sapply(df$id, strsplit, ",")
i <- 1
while( i < length(ids)) {
current <- ids[[i]]
j <- i + 1
while(j <= length(ids)) {
if(any(ids[[j]] %in% current))
ids[[j]] <- NULL
else
j <- j + 1
}
i <- i + 1
}
And finally, only keep the ids that are left:
R> ids <- data.frame(id=names(ids), stringsAsFactors=FALSE)
R> merge(ids, df, sort=FALSE)
id start end level
1 h 800 850 6
2 a,b,c 100 350 5
3 e 500 550 4
4 d 400 450 3
5 i 900 950 2
This has ugly while loops because R only has for-each loops, and also note the stringsAsFactors=FALSE is necessary for splitting the ids. Deleting middle elements
could be bad for performance, but that will depend on the underlying implementation
R uses for lists (linked vs arrays).

Elements within lists.

I'm relatively new in R (~3 months), and so I'm just getting the hang of all the different data types. While lists are a super useful way of holding dissimilar data all in one place, they are also extremely inflexible for function calls, and riddle me with angst.
For the work I'm doing, I often uses lists because I need to hold a bunch of vectors of different lengths. For example, I'm tracking performance statistics of about 10,000 different vehicles, and there are certain vehicles which are so similar they can essentially be treated as the same vehicles for certain analyses.
So let's say we have this list of vehicle ID's:
List <- list(a=1, b=c(2,3,4), c=5)
For simplicity's sake.
I want to do two things:
Tell me which element of a list a particular vehicle is in. So when I tell R I'm working with vehicle 2, it should tell me b or [2]. I feel like it should be something simple like how you can do
match(3,b)
> 2
Convert it into a data frame or something similar so that it can be saved as a CSV. Unused rows could be blank or NA. What I've had to do so far is:
for(i in length(List)) {
length(List[[i]]) <- max(as.numeric(as.matrix(summary(List)[,1])))
}
DF <- as.data.frame(List)
Which seems dumb.

For your first question:
which(sapply(List, `%in%`, x = 3))
# b
# 2
For your second question, you could use a function like this one:
list.to.df <- function(arg.list) {
max.len <- max(sapply(arg.list, length))
arg.list <- lapply(arg.list, `length<-`, max.len)
as.data.frame(arg.list)
}
list.to.df(List)
# a b c
# 1 1 2 5
# 2 NA 3 NA
# 3 NA 4 NA

Both of those tasks (and many others) would become much easier if you were to "flatten" your data into a data.frame. Here's one way to do that:
fun <- function(X)
data.frame(element = X, vehicle = List[[X]], stringsAsFactors = FALSE)
df <- do.call(rbind, lapply(names(List), fun))
# element vehicle
# 1 a 1
# 2 b 2
# 3 b 3
# 4 b 4
# 5 c 5
With a data.frame in hand, here's how you could perform your two tasks:
## Task #1
with(df, element[match(3, vehicle)])
# [1] "b"
## Task #2
write.csv(df, file = "outfile.csv")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Intersecting multiple datasets in R - r

You could try something like this: I do a lot of things in lists... #x is your data in list() format shared<-list() for (i in 1:29){ shared[[i]]<-list() for (j in 1:29){ if (i != j){ shared[[i]][[j]]<-x[[i]][x[[i]][,2]==x[[j]][,2]] } } }

So happy to figure it out x<- list() shared<-matrix(1:841,ncol=29) temp<-NULL for (i in 1:29){ for (j in 1:29){ temp[j] <- length(intersect(x[[i]][[1]],x[[j]][[1]])) } shared[,i] <- matrix(temp) } shared

Related

retrieving a list element in O(1) in R

Parallel processing for multiple nested for loops

Creating combination of sequences

Eliminating rows from a data.frame

Elements within lists.

Categories

Resources