Fast accessing elements of Compressed Sparse Row (CSR) sparse matrix - graph

I want to test some of the newer sparse linear solvers and I want to know if there is a fast way of filling in the matrix. The format I'm interested is CSR (http://goo.gl/hLXYd). Let's say the matrix, in CSR format, is given by:
values(num non-zero elements)
columns(num non-zero elements)
rowIndex(num rows + 1)
The sparse matrix under consideration derives from networks. So, I have thousands of nodes and some of them are connected between them by lines. So, the matrix is structurally symmetric. Each connection (i,j) adds something to the diagonal terms (i,i) and (j,j) and to the off-diagonal (i,j) and (j,i). I could have several connections between the same nodes (i,j,1), (i,j,2)... So, I might need to revisit the 2 diagonal and 2 off-diagonal elements more than once.
I know I can get the beginning of the row by doing rowIndex(i). Then, I would have to run through the elements columns(rowIndex(i):rowIndex(i+1)-1) to find where is j situated.
The question:
Is there a way of accessing the elements faster, while in CSR format, without having to do a search every time I want to update an element?
Some clarifications:
I just need to fill in the matrix from scratch. The matrix is structurally symmetric and not really symmetric. The values saved have to do with network data (impedances, resistances etc), they have real values. In general Value(i,j)<>Value(j,i). I have tuples of the form (name1,i1,j1,value1), (name2,i2,j2,value2) etc. These tuples are not sorted, and 2 tuples can refer to the same i,j values, meaning they need to be added
Thanks in advance!

What you have is so called triplet sparse format. Creation of CRS, including removing duplicate entries and summing the values, can be implemented very efficiently. Before programing it yourself, have a look at the SuiteSparse library. It is written in C, but I'm sure you will understand the principle. What interests you is the cholmod_triplet.c file, which implements the functionality you need.
Essentially, the conversion is performed using two phase bucket sort on your row and column indices. This algorithm has linear complexity, which is important if you are interested in processing large data sets.
Edit If you want to skip explicit creation of the triplet format all together, you can do that by generating the (row, col) connectivities on the fly and adding them to a dynamic sparse structure. I usually do it using insertion sort and sorted lists, which is in practice the fastest. It is also faster than triplet to CRS conversion, and uses much less memory. The method goes as follows:
if you know approximately, how many non-zero entries there are in every row, for every row you pre-allocate an array of (empty) column indices, and a separate array for the values (not linked list, but a simple array) of that size. Something like
static_lists_cols[row] = malloc(sizeof(int)*expected_number_of_non_zeros)
static_lists_vals[row] = malloc(sizeof(double)*expected_number_of_non_zeros)
If you do not know that, you choose an initial size and reallocate as needed (using some block size large enough to avoid reallocation overhead) when the row lists are full.
for every (row, col) pair you insert the col into the sorted list corresponding to row using insertion sort. For small (up to a few hundred) non-zeros per row linear search is the fastest. For larger number of non-zeros per row you can use bisection to locate the correct place to insert the col index.
col is inserted into rowth sorted list by moving the non-zero entries with higher column index in the sorted list. This is cache-friendly, since the rows are in practice small enough to fit into any cache nowadays.
After you finish you need to assemble the individual sorted lists into a valid CRS structure by copying the individual row lists into the final columns. The same with values.
You could actually avoid the last step by pre-allocating a static 'array of lists' if you are ok that some of the rows can have zero entries. You will hence have a constant number of entries per row, some of which might be zero. Sometimes that is ok.
This method is faster than using triplet to sparse conversion, at least for FEM models, for which I use it. The general reason is that memory bandwidth is the bottleneck here, and the above scheme uses much less memory:
creating the triplet format takes time, and you need to write the triplets to memory
conversion to CRS requires reading and writing the triplets at least once to sort them (actually a bit more than once, if you look at the algorithm. You sort twice, and you need auxiliary data structures.)
depending on the connectivity structure, you may end up having a large number of (row, col) duplicates in the triplet format, which are removed during the assembly by adding the corresponding values. This overhead does not exist in the method above - if the col already exists in the row list, you simply update the corresponding value.
updating the sorted lists can be done in parallel if you assign row ranges to individual workers. No communication, nor synchronization is needed. Assuring load balancing is another story...
Have a look at a performance comparison of using those two methods (Figure 1) for triangular elements in 2D. Note that the performance difference depends on the ratio of the number of entries in the triplet to assembled sparse matrix format (Table 2). But in general, the method is never worse than triplet to crs conversion, and triplets need to be created in the first place. You can also download a MATLAB MEX function sparse_create, which is a part of mutils package (see the downloads section).

Your question seems to confuse 2 rather different questions:
What is a fast way of creating a matrix in CSR form ?
Is there a faster way of reading values from a matrix already stored in CSR form ? (Faster, that is, than the straightforward approach you describe)
So here are 2 answers:
In general, read the network data from whatever form it is in into something like a dictionary of keys (other intermediate forms are available and may be more appealing to you for speed or other reasons); then turn that intermediate structure into the CSR form of the matrix. More on this below.
I don't believe so, not with a matrix stored in CSR form. This relative slowness of access is part of the price you pay for saving space. You've traded time for space, or space for time, depending on your point of view.
Your description of your input data suggests that you should consider devising your own intermediate form into which to marshal the raw data. Since your adjacency matrix is symmetric you only need to store, in any form, half of it. Further, you probably don't need to store the elements along the main diagonal -- I'm guessing either that node i is always connected to node i or never so that the nature of the network determines the value stored at (i,i). I'm a little uncertain of the information you want to store at each node of the matrix, is it the number of connections between i and j or something else ?

Related

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

For memory, what should be done when you need to constantly grow a vector to an unknown upper limit?

Suppose that you are dealing with a potentially infinite amount of data. Suppose further that you do not have this data stored in memory, but can generate individual terms at will. Finally, suppose that you want to do some experiment on this data that will involve checking a large but unknown amount of terms in a way that necessitates keeping a great many of them in memory. Toy problems with Recamán's sequence, like "find the minimum number terms needed in that sequence for the first 25 even numbers to have appeared", are what I have in mind as typical examples.
The obvious solution to this sort of problem would be to write some code like:
list<-c(first term)
while([not found enough terms yet])
{
nextTerm<-Whatever
if(this term worked){list<-c(list,nextTerm)}
}
However, building a big vector like this by adding one new term at a time is your memory's worst nightmare. The alternative that I often see suggested is to pre-allocate a big vector in memory by making the first line of your code something like list<-numeric(10^6), but those solutions suppose that we have some rough idea of how many terms we need to check, which isn't always the case. So what can we do when we are dealing with an ever-growing list of unknown required length?
This is very popular subject in R check this answer: https://stackoverflow.com/a/45195098/5442527
Summing up:
Do not use c() to bind as providing value by index [ is much faster. I know that it might seem surprising that you could grow pre-allocated vector. Make an iter variable before while loop and increase the index inside the if statement.
Normally like in Python you do not have to care about it when using append. Even starting with empty list is not an problem as the list (reserved memory) grows expotentialy (x2x2x1.5x1.2...) when you pass some perimeter number of elements. Link Over-allocating

How to change to a less memory-hungry data type in R?

I am working with a big (500000*2000) matrix containing data that can be one of 4 values. Keeping it in the standard R data type is pushing the capabilities of my workstation.
Is there a data type in R that allows for more efficient memory usage by allocating only 2 bits to each one of these values? This would increase the efficiency of my code by a lot.
Thanks
Depends on what kind of analysis you are doing. Using the sparse matrix functions from package Matrix (as Shinobi_Atobe suggested above) might be helpful if your matrix is sparse, that is, contains "lots" of zero values, whereas the simplest operational definition of "lots of zero values" is: try it out (i.e., convert your data to a sparse matrix class) and see if it helps.
You can also make sure that your data is stored as (a) integer [check out 1L vs 1] or (b) factor [which is, technically, integer] but not character or "long" (i.e., non-integer but numeric). Integer seems to be R's least memory-hungry tata type, even truth values (TRUE vs FALSE) do not seem to occupy less memory than integers. (I'm not completely sure about that, have tried only a very simple comparison: object.size(rep(T, 100)) == object.size(rep(1L, 100)) but see ?storage.mode).
So converting your data to integer (using as.integer will disentangle your matrix so it's a little bit trickier than that) might help. At least a little.
Beyond that, the possibilities include increasing your memory allowance to R[*], dividing your matrix into sub-parts (if that does not ruin your analytic strategy; even a list of smaller matrices can be more efficient than a big matrix for some purposes; so instead of a single 500000*2000 mtx you could have, say, a list of 100 5000*2000 matrices), and doing some parts of analysis in another language within R (e.g., Rcpp) or completely without it (e.g., an external python script).
[*] Increasing (or decreasing) the memory available to R processes

Is there a way to store data-points instead of creating multi-dimensional array?

I am trying to read data available and write it to a NetCDF file.
Say, I am reading temperature along different time, depth, latitude and longitude values, I will have to create a whole 4D grid of time, depth, latitude and longitude as dimensions.
However, the data I am recording has values at very few points. For example, in one of the cases, I had data at 155 points, while the grid was of 50x16x16x18 along time, depth latitude and longitude respectively. Thus I had data in only 155 points out of a grid having 230400 cells. Rest all points had fill values.
It seemed quite useless to have so many fill values. Is it possible to write a legitimate netCDF file with only the points that had data or maybe fewer use of fill values?
I am using NetCDF Java library for the process.
Thank you so much in advance.
It should be possible to represent the data at each grid point using one of the discrete sampling geometries (DSG) outlined by the CF Conventions (here are some examples). Perhaps one of these representations would work for your case (maybe timeSeries or timeSeriesProfile)? The DSGs are often talked about in the context of observational data, but they should apply to sub-sampled model output as well.
Any N-dimensional sparse array can be represented as a list (or 1-D array) of tuples, where each tuple has N coordinate value, and one data value.
If the array is sufficiently sparse, the list-based representation occupies less space ... on disk, and in memory.
Now the simple list-based representation is NOT good for random access because you need to scan through list to access the value at any point in the original array. You can improve on this (in the in-memory version):
If you order the list based on the coordinates and use an ArrayList, you can perform a binary search to find the value for a set of coordinates. This gives O(log N) indexing, with no additional memory overhead.
If you use a HashMap<Coords, Value>, you can get O(1) lookup. However, this comes at a significant addition memory cost. Probably around 50 to 80 additional bytes per entry compared to using an ArrayList representation.

hash table suffix tree explanation

I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!
Thanks in advance.
OK, lets use bioinformatics to help illustrate the differences.
Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.
If we want to use a hashtable
A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.
One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.
Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.
We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.
A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.
Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.
suffix tree
A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.
We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.
In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.
Hope that helps

Resources