Plot the intersection in every two list elements - r

Given a list of 16 elements, where each element is a named numeric vector, I want to plot the length of the intersection of names between every 2 elements. That is; the intersection of element 1 with element 2, that of element 3 with element 4, etc.
Although I can do this in a very tedious, low-throughput manner, I'll have to repeat this sort of analysis, so I'd like a more programmatic way of doing it.
As an example, the first 5 entries of the first 2 list elements are:
topGenes[[1]][1:5]
3398 284353 219293 7450 54658
2.856363 2.654106 2.653845 2.635599 2.626518
topGenes[[2]][1:5]
1300 64581 2566 5026 146433
2.932803 2.807381 2.790484 2.739735 2.705030
Here, the first row of numbers are gene IDs & I want to know how many each pair of vectors (a treatment replicate) have in common, among, say, the top 100.
I've tried using lapply() in the following manner:
vectorOfIntersectLengths <- lapply(topGenes, function(x) lapply(topGenes, function(y) length(intersect(names(x)[1:100],names(y)[1:100]))))
This only seems to operate on the first two elements; topGenes[[1]] & topGenes[[2]].
I've also been trying to do this with a for() loop, but I'm unsure how to write this. Something along the lines of this:
lengths <- c()
for(i in 1:length(topGenes)){
lens[i] <- length(intersect(names(topGenes[[i]][1:200]),
names(topGenes[[i+1]][1:200])))
}
This returns a 'subscript out of bounds' error, which I don't really understand.
Thanks a lot for any help!

Is this what you're looking for?
# make some fake data
set.seed(123)
some_list <- lapply(1:16, function(x) {
y <- rexp(100)
names(y) <- sample.int(1000,100)
y
})
# identify all possible pairs
pairs <- t( combn(length(some_list), 2) )
# note: you could also use: pairs <- expand.grid(1:length(some_list),1:length(some_list))
# but in addition to a-to-b, you'd get b-to-a, a-to-a, and b-to-b
# get the intersection of names of a pair of elements with given indices kept for bookkeeping
get_intersection <- function(a,b) {
list(a = a, b = b,
intersection = intersect( names(some_list[[a]]), names(some_list[[b]]) )
)
}
# get intersection for each pair
intersections <- mapply(get_intersection, a = pairs[,1], b = pairs[,2], SIMPLIFY=FALSE)
# print the intersections
for(indx in 1:length(intersections)){
writeLines(paste('Intersection of', intersections[[indx]]$a, 'and',
intersections[[indx]]$b, 'contains:',
paste( sort(intersections[[indx]]$intersection), collapse=', ') ) )
}

Related

Submit every similarly named elements of a list of vectors to a function in R

Below, I'm wondering how to use BASE R function quantile() separately across elements in L that are named EFL and ESL?
Note: this is a toy example, L could contain any number of similarly named elements.
foo <- function(X) {
X <- as.matrix(X)
tab <- table(row(X), factor(X, levels = sort(unique(as.vector(X)))))
w <- diag(ncol(tab))
rosum <- rowSums(tab)
obs_oc <- tab * (t(w %*% t(tab)) - 1)
obs_c <- colSums(obs_oc)
max_oc <- tab * (rosum - 1)
max_c <- colSums(max_oc)
SA <- obs_c / max_c
h <- names(SA)
h[is.na(h)] <- "NA"
setNames(SA, h)
}
DAT <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/X.csv", row.names = 1)
L <- replicate(50, foo(DAT[sample(1:nrow(DAT), replace = TRUE),]), simplify = FALSE)
# How to use `quantile()` separately across all similarly named elements (e.g., EFL, ESL) in `L[[i]]` i = 1,... 5
# quantile(all EFL elements across `L`)
# quantile(all ESL elements across `L`)
The previous solution I used do.call to rbind each list into a matrix and array and then calculate the quantile over each data.frame row.
sapply(as.data.frame(do.call(rbind, L)), quantile)
However, when there is a missing row, it does not take that into account. To accurately get the rows you need to fill the missing rows. I used data.table's rbindlist (you could also use plyr::rbind.fill) with fill=TRUE to fill the missing values. It requires each to be a data.frame/table/list, so I converted each to a data.frame, but before doing so you need to transpose (t()) the data so that the rows line up to each element. It could be written in a single line, but it's easier read what is happening in multiple lines.
L2 = lapply(L, function(x){as.data.frame(t(x))})
df = data.table::rbindlist(L2, fill=TRUE) # or plyr::rbind.fill(L2)
sapply(df, quantile, na.rm = TRUE)
You can also use purrr::transpose:
Lt <- purrr::tranpose(L)
quantile(unlist(Lt$EFL),.8)
quantile(unlist(Lt$ESL),.8)

Transform a function into something that lapply works on properly (lists)

I'm trying to make a list with 10 elements, each element consisting of 5 * i items drawn from a uniform distribution, i being the ith entry, and I want to use lapply.
Currently I made this function:
z_list <- list()
z_list_generator <- function(n) {
for(i in 1:n){
a <- runif(5 * i)
tmp <- list(a)
mybiglist[[i]] <- tmp
}
mybiglist
}
This function does give the correct outcome when I just put z_list_generator(2), it prints a list with the first element consisting of 5 elements, the second of 10 elements.
What I want to achieve is that I do lapply(some number, z_list_generator) such that it generates this same list, and such that when I do length(lapply(some number, z_list_generator)), the outcome is 'some number'.
Do you mean something like this?
z_list_generator <- function(k) lapply(1:k, function(i) runif(5 * i))
set.seed(2018) # Fixed random seed for reproducibility
z_list_generator(2)
#[[1]]
#[1] 0.33615347 0.46372327 0.06058539 0.19743361 0.47431419
#
#[[2]]
# [1] 0.3010486 0.6067589 0.1300121 0.9586547 0.5468495 0.3956160 0.6645386
# [8] 0.9821123 0.6782154 0.8060278
length(z_list_generator(2))
#[1] 2
Your z_list_generator is strange.
1) You do not initialise mybiglist in your function code. It probably modifies some global variable.
2) You assign mybiglist elements with another list (of lenght 1), which first element contains a sample from a uniform distrubution. Better assign a, not tmp there.

How to find out the best combination of a given vector whose sum is closest to a given number

My question is quite similar to this one: Find a subset from a set of integer whose sum is closest to a value
It discussed the algorithm only, but I want to solve it with R. I'm quite new to R and tried to work out a solution, but I wonder whether there is a more efficient way.
Here is my example:
# Define a vector, to findout a subset whose sum is closest to the reference number 20.
A <- c(2,5,6,3,7)
# display all the possible combinations
y1 <- combn(A,1)
y2 <- combn(A,2)
y3 <- combn(A,3)
y4 <- combn(A,4)
y5 <- combn(A,5)
Y <- list(y1,y2,y3,y4,y5)
# calculate the distance to the reference number of each combination
s1 <- abs(apply(y1,2,sum)-20)
s2 <- abs(apply(y2,2,sum)-20)
s3 <- abs(apply(y3,2,sum)-20)
s4 <- abs(apply(y4,2,sum)-20)
s5 <- abs(apply(y5,2,sum)-20)
S <- list(s1,s2,s3,s4,s5)
# find the minimum difference
M <- sapply(S,FUN=function(x) list(which.min(x),min(x)))
Mm <- which.min(as.numeric(M[2,]))
# return the right combination
data.frame(Y[Mm])[as.numeric(M[,Mm[1]])]
so the answer is 2,5,6,7.
How can I refine this program? Especially the five combn()s and five apply()s, is there a way that can work them at once? I hope when A has more items in it, I can use length(A) to cover it.
Here is another way to do it,
l1 <- sapply(seq_along(A), function(i) combn(A, i))
l2 <- sapply(l1, function(i) abs(colSums(i) - 20))
Filter(length, Map(function(x, y)x[,y], l1, sapply(l2, function(i) i == Reduce(min, l2))))
#[[1]]
#[1] 2 5 6 7
The last line uses Map to index l1 based on a logical list created by finding the minimum value from list l2.
combiter library has isubsetv iterator, which goes through all subset of a vector. Combined with foreach simplifies the code.
library(combiter)
library(foreach)
A <- c(2,5,6,3,7)
res <- foreach(x = isubsetv(A), .combine = c) %do% sum(x)
absdif <- abs(res-20)
ind <- which(absdif==min(absdif))
as.list(isubsetv(A))[ind]

Store values in For Loop

I have a for loop in R in which I want to store the result of each calculation (for all the values looped through). In the for loop a function is called and the output is stored in a variable r in the moment. However, this is overwritten in each successive loop. How could I store the result of each loop through the function and access it afterwards?
Thanks,
example
for (par1 in 1:n) {
var<-function(par1,par2)
c(var,par1)->var2
print(var2)
So print returns every instance of var2 but in var2 only the value for the last n is saved..is there any way to get an array of the data or something?
initialise an empty object and then assign the value by indexing
a <- 0
for (i in 1:10) {
a[i] <- mean(rnorm(50))
}
print(a)
EDIT:
To include an example with two output variables, in the most basic case, create an empty matrix with the number of columns corresponding to your output parameters and the number of rows matching the number of iterations. Then save the output in the matrix, by indexing the row position in your for loop:
n <- 10
mat <- matrix(ncol=2, nrow=n)
for (i in 1:n) {
var1 <- function_one(i,par1)
var2 <- function_two(i,par2)
mat[i,] <- c(var1,var2)
}
print(mat)
The iteration number i corresponds to the row number in the mat object. So there is no need to explicitly keep track of it.
However, this is just to illustrate the basics. Once you understand the above, it is more efficient to use the elegant solution given by #eddi, especially if you are handling many output variables.
To get a list of results:
n = 3
lapply(1:n, function(par1) {
# your function and whatnot, e.g.
par1*par1
})
Or sapply if you want a vector instead.
A bit more complicated example:
n = 3
some_fn = function(x, y) { x + y }
par2 = 4
lapply(1:n, function(par1) {
var = some_fn(par1, par2)
return(c(var, par1)) # don't have to type return, but I chose to make it explicit here
})
#[[1]]
#[1] 5 1
#
#[[2]]
#[1] 6 2
#
#[[3]]
#[1] 7 3

Splitting a data set using two parameters and saving the sub-data sets in a list

I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }

Resources