Is there a specification of R language that covers it?
For example I wanted to know what happens if you multiply a matrix and a vector (possibly transposed), the other way around. Using the standard * operator, not the %*% matrix multiplication. (Is it commutative?) I know I could find the specification of the behaviour for python or c++, but R?
One thing I found:
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Operators
[...] adding two vectors of the same length will create a vector containing the element-wise sums, implicitly looping over the vector index. This applies also to other operators like -, *, and / as well as to higher dimensional structures. Notice in particular that multiplying two matrices does not produce the usual matrix product.
Then:
[...] (matrix+matrix, dimensions must match. vector+matrix: first recycle, then check if dims fit, error if not)
I'm trying to decode what first recycle, then check if dims fit means.
Should the vector multiply rows of the matrix, or columns? Why should it check if some dims fit when it is recycled?
Like first check if #rows of the matrix is a multiple of the length of the vector? Or #rows times #cols is a multiple of length(vector)?
It does not seem as the specification though as it does not (?) answer such a basic question.
Related
I've been staring at the following underlined statement from this book for hours, and I cannot for the life of me figure out how it can be right:
For some definitions:
is an r x r matrix (we may ignore its contents for this purpose).
A is an N x r matrix defined as the following matrix of column vectors, where each vector is N elements long:
First of all I'm convinced that when they write:
they really mean:
otherwise it simply would not make sense from the start. My confusion is when they say is a linear combination of the column vectors of A.
At first I thought maybe it just wasn't obvious to me, so I started doing the calculations as an exercise.
My calculation (please don't make me type all this into a text equation editor):
I THINK my calculation is correct, but there was a lot to keep track of, so...
I did not do the multiplication with because it's trivial, and it doesn't solve the problem.
How can the products of different elements (complex conjugated no less) of the vectors in A end up as a linear combination of the columns of A?
Am I forgetting something fundamental here? Maybe something to do with the fact that is an eigenvector of ...?
I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks
Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes
I'm needing to subset a list which contains an array as well as a factor variable. Essentially if you imagine each component of the array is relative to a single individual which is then associated to a two factor variable (treatment).
list(array=array(rnorm(2,4,1),c(5,5,10)), treatment= rep(c(1,2),5))
Typically when sub-setting multiple components of the array from the first component of the list I would use something like
list$array[,,c(2,4,6)]
this would return the array components in location 2,4 and 6. However, for the factor component of the list this wouldn't work as subsetting is different, what you would need is this:
list$treatment[c(2,4,6)]
Need to subset a list with containing different classes (array and vector) by the same relative number.
You're treating your list of matrices as some kind of 3-dimensional object, but it's not.
Your list$matrices is of itself a list as well, which means you can index at as a list as well, it doesn't matter if it is a list of matrices, numerics, plot-objects, or whatever.
The data you provided as an example can just be indexed at one level, so list$matrices[c(2,4,6)] works fine.
And I don't really get your question about saving the indices in a numeric vector, what's to stop you from this code?
indices <- c(2,4,6)
mysubset <- list(list$matrices[indices], list$treatment[indices])
EDIT, adding new info for edited question:
I see you actually have an 3-D array now. Which is kind of weird, as there is no clear convention of what can be seen as "components". I mean, from your question I understand that list$array[,,n] refers to the n-th individual, but from a pure code-point of view there is no reason why something like list$array[n,,] couldn't refer to that.
Maybe you got the idea from other languages, but this is not really R-ish, your earlier example with a list of matrices made more sense to me. And I think the most logical would have been a data.frame with columns matrix and treatment (which is conceptually close to a list with a vector and a list of matrices, but it's clearer to others what you have).
But anyway, what is your desired output?
If it's just subsetting: with this structure, as there are no constraints on what could have been the content, you just have to tell R exactly what you want. There is no one operator that takes a subset of a vector and the 3rd index of an array at the same time. You're going to have to tell R that you want 3rd index to use for subsetting, and that you want to use the same index for subsetting a vector. Which is basically just the code you already have:
idx <- c(2,4,6)
output <- list(list$array[,,idx], list$treatment[idx])
The way that you use for subsetting multiple matrices actually gives an error since you are giving extra dimension although you already specify which sublist you are in. Hence in order to subset matrices for the given indices you can usemy_list[[1]][indices] or directly my_list$matrices[indices]. It is the same for the case treatement my_list[[2]][indices] or my_list$treatement[indices]
I am trying to find an efficient way to create a new array by repeating each element of an old array a different, specified number of times. I have come up with something that works, using array comprehensions, but it is not very efficient, either in memory or in computation:
LENGTH = 1e6
A = collect(1:LENGTH) ## arbitrary values that will be repeated specified numbers of times
NumRepeats = [rand(20:100) for idx = 1:LENGTH] ## arbitrary numbers of times to repeat each value in A
B = vcat([ [A[idx] for n = 1:NumRepeats[idx]] for idx = 1:length(A) ]...)
Ideally, what I would like would be a structure akin to the sparse matrix apparatus that Julia has but that would instead store data efficiently based on the indices where repeated values occur. Barring that, I would at least like an efficient way to create a vector such as B in the example above. I looked into the repeat() function, but as far as I can tell from the documentation and my experimentation with the function, it is just for repeating slices of an array the same number of times for each slice. What is the best way to approach this?
Sounds like you're looking for run-length encoding. There's an RLEVectors.jl package here: https://github.com/phaverty/RLEVectors.jl. Not sure how usable it is. You could also make your own data type fairly easily.
Thanks for trying RLEVectors.jl. Some features and optimizations had been languishing on master without a version bump. It can definitely be mixed with other vectors for element-wise arithmetic. I'll put the linear algebra operations on the feature request list. Any additional feature suggestions would be most welcome.
RLEVectors.jl has a rep function that works like R's and RLEVectors.inverse_ree is like StatsBase.inverse_rle, but it works on run ends rather than lengths.
I want to test some of the newer sparse linear solvers and I want to know if there is a fast way of filling in the matrix. The format I'm interested is CSR (http://goo.gl/hLXYd). Let's say the matrix, in CSR format, is given by:
values(num non-zero elements)
columns(num non-zero elements)
rowIndex(num rows + 1)
The sparse matrix under consideration derives from networks. So, I have thousands of nodes and some of them are connected between them by lines. So, the matrix is structurally symmetric. Each connection (i,j) adds something to the diagonal terms (i,i) and (j,j) and to the off-diagonal (i,j) and (j,i). I could have several connections between the same nodes (i,j,1), (i,j,2)... So, I might need to revisit the 2 diagonal and 2 off-diagonal elements more than once.
I know I can get the beginning of the row by doing rowIndex(i). Then, I would have to run through the elements columns(rowIndex(i):rowIndex(i+1)-1) to find where is j situated.
The question:
Is there a way of accessing the elements faster, while in CSR format, without having to do a search every time I want to update an element?
Some clarifications:
I just need to fill in the matrix from scratch. The matrix is structurally symmetric and not really symmetric. The values saved have to do with network data (impedances, resistances etc), they have real values. In general Value(i,j)<>Value(j,i). I have tuples of the form (name1,i1,j1,value1), (name2,i2,j2,value2) etc. These tuples are not sorted, and 2 tuples can refer to the same i,j values, meaning they need to be added
Thanks in advance!
What you have is so called triplet sparse format. Creation of CRS, including removing duplicate entries and summing the values, can be implemented very efficiently. Before programing it yourself, have a look at the SuiteSparse library. It is written in C, but I'm sure you will understand the principle. What interests you is the cholmod_triplet.c file, which implements the functionality you need.
Essentially, the conversion is performed using two phase bucket sort on your row and column indices. This algorithm has linear complexity, which is important if you are interested in processing large data sets.
Edit If you want to skip explicit creation of the triplet format all together, you can do that by generating the (row, col) connectivities on the fly and adding them to a dynamic sparse structure. I usually do it using insertion sort and sorted lists, which is in practice the fastest. It is also faster than triplet to CRS conversion, and uses much less memory. The method goes as follows:
if you know approximately, how many non-zero entries there are in every row, for every row you pre-allocate an array of (empty) column indices, and a separate array for the values (not linked list, but a simple array) of that size. Something like
static_lists_cols[row] = malloc(sizeof(int)*expected_number_of_non_zeros)
static_lists_vals[row] = malloc(sizeof(double)*expected_number_of_non_zeros)
If you do not know that, you choose an initial size and reallocate as needed (using some block size large enough to avoid reallocation overhead) when the row lists are full.
for every (row, col) pair you insert the col into the sorted list corresponding to row using insertion sort. For small (up to a few hundred) non-zeros per row linear search is the fastest. For larger number of non-zeros per row you can use bisection to locate the correct place to insert the col index.
col is inserted into rowth sorted list by moving the non-zero entries with higher column index in the sorted list. This is cache-friendly, since the rows are in practice small enough to fit into any cache nowadays.
After you finish you need to assemble the individual sorted lists into a valid CRS structure by copying the individual row lists into the final columns. The same with values.
You could actually avoid the last step by pre-allocating a static 'array of lists' if you are ok that some of the rows can have zero entries. You will hence have a constant number of entries per row, some of which might be zero. Sometimes that is ok.
This method is faster than using triplet to sparse conversion, at least for FEM models, for which I use it. The general reason is that memory bandwidth is the bottleneck here, and the above scheme uses much less memory:
creating the triplet format takes time, and you need to write the triplets to memory
conversion to CRS requires reading and writing the triplets at least once to sort them (actually a bit more than once, if you look at the algorithm. You sort twice, and you need auxiliary data structures.)
depending on the connectivity structure, you may end up having a large number of (row, col) duplicates in the triplet format, which are removed during the assembly by adding the corresponding values. This overhead does not exist in the method above - if the col already exists in the row list, you simply update the corresponding value.
updating the sorted lists can be done in parallel if you assign row ranges to individual workers. No communication, nor synchronization is needed. Assuring load balancing is another story...
Have a look at a performance comparison of using those two methods (Figure 1) for triangular elements in 2D. Note that the performance difference depends on the ratio of the number of entries in the triplet to assembled sparse matrix format (Table 2). But in general, the method is never worse than triplet to crs conversion, and triplets need to be created in the first place. You can also download a MATLAB MEX function sparse_create, which is a part of mutils package (see the downloads section).
Your question seems to confuse 2 rather different questions:
What is a fast way of creating a matrix in CSR form ?
Is there a faster way of reading values from a matrix already stored in CSR form ? (Faster, that is, than the straightforward approach you describe)
So here are 2 answers:
In general, read the network data from whatever form it is in into something like a dictionary of keys (other intermediate forms are available and may be more appealing to you for speed or other reasons); then turn that intermediate structure into the CSR form of the matrix. More on this below.
I don't believe so, not with a matrix stored in CSR form. This relative slowness of access is part of the price you pay for saving space. You've traded time for space, or space for time, depending on your point of view.
Your description of your input data suggests that you should consider devising your own intermediate form into which to marshal the raw data. Since your adjacency matrix is symmetric you only need to store, in any form, half of it. Further, you probably don't need to store the elements along the main diagonal -- I'm guessing either that node i is always connected to node i or never so that the nature of the network determines the value stored at (i,i). I'm a little uncertain of the information you want to store at each node of the matrix, is it the number of connections between i and j or something else ?