Multivariate apply function to compare, pair-wise, a set of files - r

I have a vector, which contains names of data vectors, named c("tom.txt", "tim.txt" , "Amy.txt"). My task is to: build a symmetric matrix that looks like the following at the end.
> m
tom.txt tim.txt amy.txt
tom.txt 0 10 5
tim.txt 10 0 7
amy.txt 5 7 0
The entries are obtained by a function called get.result(vec1, vec2), which finds the corresponding data vectors of the 2 text files in the directory and does some operations to them and then returns a value for each position in the matrix. For instance, if I want to get the entry m["tom.txt", "tim.txt"], I need to pass "tom.txt" and "tim.txt" into get.result("tom.txt", "tim.txt"). The content of this function is not important.
However, if I want to compute the value for all entries, it will be tedious to keep typing get.result("tom.txt", "amy.txt"), get.result("tim.txt", "amy.txt"), especially when I am actually working with 100 different text files.
My question: Is there an efficient way to program this such that I am always comparing one text file against the rest (never compare against itself) and at the same time, I can keep track of their positions in the matrix ?
Should I initialise the matrix with all zeros right at the beginning and set the column and row names to be the text files name ? But in this case, I am not sure how to obtain the column names and row names such that I can pass them into get.result(vec1, vec2).

Try this solution
fn <- c("tom.txt", "tim.txt" , "Amy.txt")
n <- length(fn)
m <- matrix(0, n, n)
rownames(m) <- fn
colnames(m) <- fn
for (i in 1:n) for (j in i:n) if (i!=j) {
v <- get.result(fn[i], fn[j])
m[i,j] <- v
m[j,i] <- v
}
m

fn = dir(pattern=".txt") (change the pattern if needed) will give you the text files in your target folder. You could then loop over that list like in the previous answer.

The combn function gives you distinct combinations of vector elements:
combs <-combn( c("tom.txt", "tim.txt" , "Amy.txt") , 2)
#----------------
[,1] [,2] [,3]
[1,] "tom.txt" "tom.txt" "tim.txt"
[2,] "tim.txt" "Amy.txt" "Amy.txt"
You can then : apply( combs, 2, get.result)

Given that your file names are in a vector, say
vec <- c("tom.txt", "tim.txt" , "Amy.txt")
you can use
temp <- outer(seq(vec), seq(vec), Vectorize(function(x,y) if(x>y) get.result(vec[x],vec[y]) else 0 ))
result <- temp + t(temp)
Note that this makes sure get.result() is called only once for every relevant comparison, i.e., it's not called for equal files, nor is it called for pairs that differ only by order.
The last line creates a symmetric matrix.
EDIT: to get the names, use this:
rownames(result) <- colnames(result) <- vec

Related

How can I write a For Loop in R that executes multiple operations and generates multiple outputs? [duplicate]

I am trying to save the data from a loop of logical tests.
So I have the following data:
T1 <- matrix(seq(from=100000, to=6600000,length.out=676),26,26) # a matrix of 26X26 - here with illustrive values
minmax <- seq(from=1,to=49,by=1) # creates a sequence
Fstep <- 6569141.82/minmax # define a vector from 0 to 6569141.82 with 49 divisions
F <- rev(round(Fstep,0)) # round the vector values and re order them
F
I have runned the following loop
for (i in 1:49) {
print(T1 > F[i]) # I used print to see the results in the screen
}
This loop returns me 49 matrices filled in with logical values (True or false). Each matrix is the comparison of T1 against each of the 49 positions F[i] (F[1], ...., F[49])
I need to have the values in those matrices for further using as adjacency matrices for network plots. However when I can't neither assign those logical values to an matrix, neither save them in csv values using "write.matrix".
So, I need to have 49 - matrices "W" filled in with logical values (T or F). I already got those values by the loop above but I can't get it as an object or as collection of csv. files.
I tried
W<-matrix(0,26,26) #create an empty matrix to assign the logical arguments
for (i in 1:49){
W[i] <- T1>F[i] # I used print to see the results in the screen
}
which returns the following warning
Warning messages:
1: In W[i] <- (T1 > F[i]) :
number of items to replace is not a multiple of replacement length
I also tried a different setting in which all the matrices I compare have the same dimensions.
create.M <- function(F){ # a function to transform each position F[i] into a 26X26 matrix
for (i in 1:49) {
matrix(F[i],26,26)
}
}
Loop.T1 <- function(T1){ # a function to replicate T1(49 times)
for ( i in 1:49) {
T1
}
}
and compared the two outputs
Loop.T1(T1)>create.M(F)
which returns
logical(0)
Store each boolean matrix as an item in a list:
result <- vector("list",49)
for (i in 1:49)
{
result[[i]] <- T1>F[i] # I used print to see the results in the screen
}
#Print the first matrix on screen
result[[1]]
Another way to do what joran suggests is to use the apply family of functions.
result2 <- lapply(F, function(f) {T1 > f})
This gives the same thing as joran's result, a list where each element corresponds to one of the values of F and that element is a 26x26 logical matrix.
Another alternative is to store the results as a three dimensional logical matrix (49*26*26) where each slice corresponds to one of the values of F.
result3 <- sapply(F, function(f) {T1 > f}, simplify="array")
the structure of which is
> str(result3)
logi [1:26, 1:26, 1:49] FALSE FALSE FALSE FALSE TRUE TRUE ...

Finding all combinations using recursion in R

I'm having issue with returning values from recursive functions, hoping you could help me out. I have a list with a bunch of matrices, each matrix representing a set of possible combinations and generated using combn(). As an example, this could be 3 matrices inside the list:
# set 1 has 4 elements, do nCk = 4C1:
set1 <- c("b2","b3","b4","b5")
set1 <- combn(set1,1,simplify = T)
# set 2 has 3 elements, choose 2:
set2 <- c("c1","c2","b2")
set2 <- combn(set2,2,simplify = T)
# set 3 has 10 elements, choose 1:
set3 <- combn(c(1:10),1, simplify = T)
If we were to print set2, for instance, it would have 2 rows (choose 2), and 3 columns (3C2 = 3):
> set2
[,1] [,2] [,3]
[1,] "c1" "c1" "c2"
[2,] "c2" "b2" "b2"
I need get all possible 4-element combinations (1 element per set above). I can do this using a while loop and simulating a state machine, but that solution is clunky and makes for long code. I know this can be done using recursion as I was able to print the 120 combinations correctly (code below), but when trying to return them or save them in a variable, either I get a <font color="red">subscript out of bounds error or the results repeat thousands of times. I want to avoid global variables too, this will be embedded in a rather large project, so I'd prefer to avoid bloating my workspace with more variables than needed.
Of course, when deployed the number of sets will be dynamic, and the elements per set will change too. The sets aren't too big either, so I would love to implement a recursive approach!
Working code to print:
combb <- function(allsets, number, carry){
if(number>length(allsets)){
print(carry)
return()
} else{
for(j in 1:length(allsets[[number]][1,])){
newcarry <- c(carry, allsets[[number]][,j])
number2 <- number + 1
combb(allsets, number2, newcarry)
}
}
}
Thank you!
I found that it was very hard to carry the results back and forth, as it required flags and lists or a different solution. What I did instead was create a wrapper function where I created a local variable. The recursive function is defined inside, and accesses ("globally") the variable mentioned above. Then, that variable is returned by the wrapper:
combb <- function(allsets){
carry <- integer(0)
height <- 0L
for (j in 1:length(allsets)) {
height <- height + length(allsets[[j]][, 1])
}
output <- matrix(allsets[[1]][0, 1], nrow = height, ncol = 0)
combb1 <- function(allsets, number, carry) {
if(number > length(allsets)){
output <<- cbind(output, carry, deparse.level = 0)
return()
} else{
for (j in 1:length(allsets[[number]][1,])) {
# Only add unique combinations (some combinations are vectors)
if((TRUE %in% (allsets[[number]][, j] %in% carry)) == FALSE) {
newcarry <- c(carry, allsets[[number]][, j], use.names = FALSE)
number2 <- number + 1
combb1(allsets, number2, newcarry)
} else{
next()
}
}
}
}
combb1(allsets, 1, carry)
return(output)
}
As you can see from that solution, recursion is neat (combb1 function) and doesn't clutter any of the global/workspace variables.

Plot the intersection in every two list elements

Given a list of 16 elements, where each element is a named numeric vector, I want to plot the length of the intersection of names between every 2 elements. That is; the intersection of element 1 with element 2, that of element 3 with element 4, etc.
Although I can do this in a very tedious, low-throughput manner, I'll have to repeat this sort of analysis, so I'd like a more programmatic way of doing it.
As an example, the first 5 entries of the first 2 list elements are:
topGenes[[1]][1:5]
3398 284353 219293 7450 54658
2.856363 2.654106 2.653845 2.635599 2.626518
topGenes[[2]][1:5]
1300 64581 2566 5026 146433
2.932803 2.807381 2.790484 2.739735 2.705030
Here, the first row of numbers are gene IDs & I want to know how many each pair of vectors (a treatment replicate) have in common, among, say, the top 100.
I've tried using lapply() in the following manner:
vectorOfIntersectLengths <- lapply(topGenes, function(x) lapply(topGenes, function(y) length(intersect(names(x)[1:100],names(y)[1:100]))))
This only seems to operate on the first two elements; topGenes[[1]] & topGenes[[2]].
I've also been trying to do this with a for() loop, but I'm unsure how to write this. Something along the lines of this:
lengths <- c()
for(i in 1:length(topGenes)){
lens[i] <- length(intersect(names(topGenes[[i]][1:200]),
names(topGenes[[i+1]][1:200])))
}
This returns a 'subscript out of bounds' error, which I don't really understand.
Thanks a lot for any help!
Is this what you're looking for?
# make some fake data
set.seed(123)
some_list <- lapply(1:16, function(x) {
y <- rexp(100)
names(y) <- sample.int(1000,100)
y
})
# identify all possible pairs
pairs <- t( combn(length(some_list), 2) )
# note: you could also use: pairs <- expand.grid(1:length(some_list),1:length(some_list))
# but in addition to a-to-b, you'd get b-to-a, a-to-a, and b-to-b
# get the intersection of names of a pair of elements with given indices kept for bookkeeping
get_intersection <- function(a,b) {
list(a = a, b = b,
intersection = intersect( names(some_list[[a]]), names(some_list[[b]]) )
)
}
# get intersection for each pair
intersections <- mapply(get_intersection, a = pairs[,1], b = pairs[,2], SIMPLIFY=FALSE)
# print the intersections
for(indx in 1:length(intersections)){
writeLines(paste('Intersection of', intersections[[indx]]$a, 'and',
intersections[[indx]]$b, 'contains:',
paste( sort(intersections[[indx]]$intersection), collapse=', ') ) )
}

Store values in For Loop

I have a for loop in R in which I want to store the result of each calculation (for all the values looped through). In the for loop a function is called and the output is stored in a variable r in the moment. However, this is overwritten in each successive loop. How could I store the result of each loop through the function and access it afterwards?
Thanks,
example
for (par1 in 1:n) {
var<-function(par1,par2)
c(var,par1)->var2
print(var2)
So print returns every instance of var2 but in var2 only the value for the last n is saved..is there any way to get an array of the data or something?
initialise an empty object and then assign the value by indexing
a <- 0
for (i in 1:10) {
a[i] <- mean(rnorm(50))
}
print(a)
EDIT:
To include an example with two output variables, in the most basic case, create an empty matrix with the number of columns corresponding to your output parameters and the number of rows matching the number of iterations. Then save the output in the matrix, by indexing the row position in your for loop:
n <- 10
mat <- matrix(ncol=2, nrow=n)
for (i in 1:n) {
var1 <- function_one(i,par1)
var2 <- function_two(i,par2)
mat[i,] <- c(var1,var2)
}
print(mat)
The iteration number i corresponds to the row number in the mat object. So there is no need to explicitly keep track of it.
However, this is just to illustrate the basics. Once you understand the above, it is more efficient to use the elegant solution given by #eddi, especially if you are handling many output variables.
To get a list of results:
n = 3
lapply(1:n, function(par1) {
# your function and whatnot, e.g.
par1*par1
})
Or sapply if you want a vector instead.
A bit more complicated example:
n = 3
some_fn = function(x, y) { x + y }
par2 = 4
lapply(1:n, function(par1) {
var = some_fn(par1, par2)
return(c(var, par1)) # don't have to type return, but I chose to make it explicit here
})
#[[1]]
#[1] 5 1
#
#[[2]]
#[1] 6 2
#
#[[3]]
#[1] 7 3

How to use apply instead of for loop for stringMatch function?

I'm trying to calculate the number of pairwise differences between a long list of sequences, and put it back into a matrix form. This is a toy example of what I want to do.
library(MiscPsycho)
b <- c("-BC", "ACB", "---") # Toy example of sequences
workb <- expand.grid(b,b)
new <- c(1:9)
# Need to get rid of this for loop somehow
for (i in 1:9) {
new[i] <- stringMatch(workb[i,1], workb[i,2], normalize="NO")
}
workb <- cbind(workb, new)
newmat <- reShape(workb$new, id=workb$Var1, colvar=workb$Var2)
a <- c("Subject1", "Subject2", "Subject3") #Relating it back to the subject ID
colnames(newmat) <- a
rownames(newmat) <- a
newmat
I'm not very familiar with using the apply functions, but I'd like to use it to be able to replace the for loop, which will probably get slow considering I have a large number of sequences. (The stringMatch function is from MiscPsycho). Please let me know how to make it more efficient!
Thank you very much!
To get those "pairwise distances" I would have done something like:
Vm <- Vectorize(stringMatch)
nex <- outer(b,b, FUN=Vm, normalize = "NO")
nex
[,1] [,2] [,3]
[1,] 0 3 2
[2,] 3 0 3
[3,] 2 3 0
To replace the loop
new <- apply(workb, 1, function(x) stringMatch(x[[1]],x[[2]], normalize="NO"))
I would make a function that takes your index, i, and returns new[i].
myfun <- function(i) {
stringMatch(workb[i, 1], workb[i, 2], normalize='NO')
}
Then you can apply along your new vector:
workb$new <- unlist(lapply(new, myfun))
In general, you are using a for loop correctly in R. You have allocated the vector new before hand and are filling it rather than growing it.

Resources