I have a triangular matrix and would like to loop through all the elements efficiently. Is there a smart way that I simply don't see?
So here's a small example of my matrix
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 2 1 0 0
[3,] 3 7 1 0
[4,] 4 4 10 1
What I want to do is perform some function f() on the elements in this matrix that are under (over) the main diagonal. (background: I know that I have a symmetric matrix and would like to perform some time consuming data manipulations and I'd like to 'mirror' my matrix)
As shown below, lower.tri() and upper.tri() provide an expressive (and fast) means of extracting and replacing elements in the relevant sectors a matrix. Since the function you're applying to the elements is presumably slow compared to indexing operations, there's probably no point in searching for faster indexing options (or in trying to avoid the single call to t()).
## Example data
m <- matrix(c(1,2,3,4,0,1,7,4,0,0,1,10,0,0,0,1), ncol=4)
## Example of a slow function
slowFun <- function(x) sapply(x, function(x) {Sys.sleep(0.1); x^2})
## Proposed strategy
m[lower.tri(m)] <- slowFun(m[lower.tri(m)])
m[upper.tri(m)] <- t(m)[upper.tri(m)]
Related
I have a numeric matrix of considerable size (say gigabytes). I need to do some computations on this matrix by treating it as-is (a matrix), and some other computations where I need to treat slices of it as arrays of, say, two or three dimensions (the new dimensionality can be different from slice to slice).
My question is how I can do these two kinds of operation using as little memory as possible.
Here is an example with my tentative solution. I'm just using a 2×9 matrix as a manageable example:
m <- matrix(sample(18,2*(1+2*2+2*2)),nrow=2,ncol=1+2*2+2*2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 3 10 12 2 7 15 6 1 9
## [2,] 8 4 18 17 13 14 11 16 5
#### I need to use m as-is in some function from some package
## some_result <- function_from_some_package(m)
#### Now I need to treat row 1, columns 2--5 as a 2*2 matrix,
#### let's say this represents some means.
#### And the same for row 1, columns 6--9,
#### let's say this represents some sds.
#### From these I want to calculate a matrix of normal density values:
somedata <- matrix(1:4, ncol=2) # these come from somewhere else
values <- dnorm(somedata, mean=matrix(m[1,2:5],nrow=2,ncol=2), sd=matrix(m[1,6:9],nrow=2,ncol=2))
## [,1] [,2]
## [1,] 0.01978261 0.0242596
## [2,] 0.00951157 0.0303845
The idea of using matrix (or array for more than two dimensions) to change the dimensionality of the needed slices of m comes from this question. My worry, however, is whether R is then internally creating two new objects, one for
matrix(m[1,2:5],nrow=2,ncol=2) and one for matrix(m[1,6:9],nrow=2,ncol=2). This would represent a huge waste of memory in my concrete case.
I suppose at some point I must read more in detail how R does this kind of memory management, but I'd be grateful for answers and explanations for this specific case for the moment.
Edit: from experimenting with tracemem, as suggested by jay.sf in the comments, it looks like a slice might actually be copied twice! Once to create the input to dnorm, and once because dnorm seems to internally copy the mean argument...
I have a big list (10,000 items) of string vectors (50 elements each).
Here is a small sample from a vector:
c("0101000000000000000000000000000000000000000000001000000000000000000000000000000100000000000100000000000000100000000000000000110000000000000000000000",
"0000000000000000000000000000000010000000000000000000000000110000000000000000000000000001000000000000001000000010000000000000000000000000000000000001",
"0000000000000000000100010000000010000000000000010000000000100000000000000100000001000001100000000000001000001000000100000000000000000000000000000100")
In every vector, I would like to estimate all pairwise differences (ie number of different characters in each pair) and then estimate the mean.
Given the big size of the list, what would be a cost effective way to do so?
thank you!!
EDIT- UPDATE
I am adding an example with shorter vectors and expected output (same as my comment below):
a=c("0010100101",
"1001011101",
"1111111010")
expected result (mean of pairwise differences):
6.666667
Many thanks to #Roland for providing the solution, based on Levenshtein distance!
Your question is not terribly clear but you appear to want a Levenshtein distance:
x = c("0010100101",
"1001011101",
"1111111010")
#switch off deletions and insertions:
d <- adist(x, costs = list(ins=Inf, del=Inf, sub=1))
# [,1] [,2] [,3]
#[1,] 0 6 8
#[2,] 6 0 6
#[3,] 8 6 0
mean(d[upper.tri(d)])
#[1] 6.666667
Well, I realised the small data sample was not that short, so I repeat the example with shorter strings:
a=c("0010100101",
"1001011101",
"1111111010")
I came up with this solution:
library(stringdist)
# matrix with no. of pairwise different characters
xx=(1-outer(a,a,"stringsim",method="hamming"))*nchar(a[1])
> xx
[,1] [,2] [,3]
[1,] 0 6 8
[2,] 6 0 6
[3,] 8 6 0
# mean no. of pairwise differences
xx1=mean(xx[lower.tri(xx,diag = FALSE)])
> xx1
[1] 6.666667
The result is fine, but I am not sure if this is an ideal solution, though..
(and if the "hamming" method is the correct one for any case).
I have a time series of n columns, where n >=2. For each row I am trying to generate a matrix by applying a function to each permutation of the n elements. for example:
perm <- function(x, y) x-y
row = c(1, 2, 3)
sapply(row,function(x) sapply(row, function(y) perm(x,y)))
> sapply(row,function(x) sapply(row, function(y) perm(x,y)))
[,1] [,2] [,3]
[1,] 0 1 2
[2,] -1 0 1
[3,] -2 -1 0
This approach works. But I know that it is not right. I just have this feeling that there is a better way to do this using the matrix math functions and I am not quite figuring it out. The apply approach is ~fine, but I know I can do better. In this case, reproducing the output for a simple subtraction like the one illustrated would be a fine answer but ideally it would come with some pointers to where to look in the docs for other types of operations. For what its worth, if anyone cares, the actual data is a list of interest rates of various tenors and the final output is going to be a matrix of forward-rates, so only the upper triangular matters, the diagonal and lower triangular won't have any information.
There is no way to eliminate the loops entirely in this case (that I know of). In all cases you will have to calculate the function for each element pair. R does have a function specifically built for this however outer
outer(row, row, perm)
[,1] [,2] [,3]
[1,] 0 -1 -2
[2,] 1 0 -1
[3,] 2 1 0
It is not necessarily faster to use outer compared to a normal loop, but it is a builtin method designed for the purpose.
Edit: Vectorized functions
There is a method one could use however, if the function is vectorized (like perm). Generating all permutations and then adding dimensions afterwards
n <- length(row)
perm1 <- row[rep(1:n, n)]
perm2 <- row[rep(1:n, each = n)]
res <- perm(perm1, perm2)
dim(res) <- c(n, n) #change to a 2 dimensional array (matrix)
res
[,1] [,2] [,3]
[1,] 0 -1 -2
[2,] 1 0 -1
[3,] 2 1 0
Which will be much faster (but more memory intensive) if the function is optimized for vectorization
Will try not to complicate things too much with my explanations, but I'm confused how to best go about filling a triangulated correlation matrix with no repeat values with existing correlation values derived from another package. This involves extracting specific values from a list of text files. This is what I have done so far:
# read in list of file names (they are named '1_1', '1_2' .. so on until '47_48' with no repeat values generated)
filenames <- read_table('/home/filenames.txt', col_names = 'file_id')
# create symmetrical matrix
M <- diag(48)
ct <- 1
for (sub in (filenames$file_id)) {
subj <- read.table(paste0(dat_dir, '/ht_', sub, '.HEreg'), sep="", fill=TRUE)
ht <- as.character(subj$V2[grep("rG",sub$V1)]) # wanting to extract the specific value in that column for each text file
M[ct,] <- as.numeric(ht) #input this value into the appropriate location
ct <- ct + 1
}
This obviously does not give me the triangulated output I would envision - I know there is an error with inputting the variable 'ht' into the matrix, but am not sure how to solve this moving forward. Ideally, the correlation value of file 1_1 should be inserted in row 1, col 1, file 1_2 should be inserted in row 2, col 1, so on and so forth, and avoiding repeats (should be 0's)
Should I turn to nested loops?
Much help would be appreciated from this R newbie here, I hope I didn't complicate things unnecessarily!
I think the easiest way would be to read in all your values into a vector. You can do this using a variation of your existing loop.
Let us assume that your desired size correlation matrix is 5x5 (I know you have 48x48 judging by your code, but to keep the example simple I will work with a smaller matrix).
Let us assume that you have read all of your correlation values into the vector x in column major order (same as R uses), i.e. the first element of x is row 2 column 1, second element is row 3 column 1 etc. I am further assuming that you are creating a symmetric correlation matrix, i.e. you have ones on the diagonal, which is why the indexing starts the way it does, because of your use of the diag() function. Let's assume your vector x contains the following values:
x <- 1:10
I know that these are not correlations, but they will make it easy to see how we fill the matrix, i.e. which vector element goes into which position in the resulting matrix.
Now, let us create the identity matrix and zero matrices for the upper and lower triangular correlations (off diagonal).
# Assuming 5x5 matrix
n_elements <- 5
m <- diag(n_elements)
m_upper <- m_lower <- matrix(0, n_elements, n_elements)
To quickly fill the lower triangular matrix, we can use the lower.tri().
m_lower[lower.tri(m_lower, diag = FALSE)] <- x
This will yield the following output:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 2 5 0 0 0
[4,] 3 6 8 0 0
[5,] 4 7 9 10 0
As you can see, we have successfully filled the lower triangular. Also note the order in which the elements of the vector is filled into the matrix. This is crucial for your results to be correct. The upper triangular is simply the transpose of the lower triangular, and then we can add our three matrices together to form your symmetric correlation matrix.
m_upper <- t(m_lower)
M <- m_lower + m + m_upper
Which yields the desired output:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 2 3 4
[2,] 1 1 5 6 7
[3,] 2 5 1 8 9
[4,] 3 6 8 1 10
[5,] 4 7 9 10 1
As you see, there is no need to work with nested loops to fill these matrices. The only loop you need is to read in the results from files (which it appears you have a handle on). If you only want the triangulated output, you can simply stop at the lower triangular matrix above. If your vector of estimated correlations (in my example x) include the diagonal elements, simply set diag = TRUE in the lower.tri() function and you are good to go.
I have a very complicated function. I need to repeat this function several times and sum the result. This is easy. However, I need to sum them at the same time. Since my function is difficult to show it here, I provide a very simple example just to explain my idea. Please note that (based on the amazing questions from the comments) My function needs to be done pairwise. Also, my matrices are all the same dimensions. Finally, the result is not as a list. I need to assign the result to a new variable. That is,
Res <– myfunc(x[i,j],y[i,j])+myfunc(z[i,j],t[i,j])+..+..
Also, my function must loop over the elements of the matrices. x[i,j].
My matrices are stored in a list.
Mymatrices–list(x,y,z,t).
For example,
x <- matrix(5,5,5)
x[upper.tri(x,diag=T)] <- 0
y <– matrix(4,5,5)
y[upper.tri(y,diag=T)] <- 0
z <- matrix(3,5,5)
z[upper.tri(z,diag=T)] <- 0
t <- matrix(2,5,5)
t[upper.tri(t,diag=T)] <- 0
myfunc <– function(x,y){
sum(x,y)
}
I would like it like this:
Res <– myfunc(x[i,j],y[i,j])+myfunc(z[i,j],t[i,j])+..+..
Suppose I have 10 matrices and would like to have the sum as shown above. It is hard to do it manually. I would like to do this automatically. lapply function takes a list and I do not want it as a list.
Any help, please?
I cant either tell whether you need a matrix in the end or a value. But since you used i,j I presume you need a matrix:
Reduce("+",list(x,y,z,t))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 14 0 0 0 0
[3,] 14 14 0 0 0
[4,] 14 14 14 0 0
[5,] 14 14 14 14 0
or do you need:
Reduce(sum,list(x,y,z,t))
[1] 140
Let's say your matrices are in a list, paired in the way you want:
input = list(list(x, y), list(z, t))
For convenience, we'll make a version of your function that takes a list as input (we could use do.call or an anonymous function instead, but this is very clear):
myfunc_list = function(x) {
myfunc(x[[1]], x[[2]])
}
We can then sapply the list function to your input list, and sum:
sum(sapply(input, myfunc_list))
# [1] 140
Glad to have helped. To be honest, I'm still not completely sure what you are asking for though - no one thinks your final answer will be a list, just an intermediate step in order to do the summation effectively. Looking at the answers, I think the Reduce function suggested by Onyambu is what you need - where x, y, z, and t are the results from your function (called pairwise on different matrices).
Is the summation really where you need help, or is it efficiently calling your function pairwise on all those matrices? That is a very different question. If that's the case, check out the map2 function in the purrr package. It takes two lists (of the same length) as inputs, computes a function on each element, and returns a list (which can be fed into Reduce).