I have probems converting an existing mesh from to vtk legacy format.
VTK-Unstructured Grids have the following structure:
DATASET UNSTRUCTURED_ GRID
POINTS n dataType
p0x p0y p0z
p1x p1y p1z
...
p (n-1) x p (n-1) y p (n-1) z
CELLS n size
numPoints0 ,i, j, k, l,...
numPoints1 ,i, j, k, l,...
numPoints2 ,i, j, k, l,...
...
numPointsn-1 ,i, j, k, l,...
Simple Legacy Formats 5
CELL_ TYPES n
...
The cells are specified by their connectivity points IDs and their type.
It seems that point ID always corresponds to the order of the points specified in the block before.
Is there a way to label the points in abitrary order?
For instance the first entries in the point block should correspond to index 3 instead of index 0.
I would really appreciate your help! TIA.
Legacy VTK format does not support that.
You will have to write a script that will reorder your points in the POINTS section accordingly.
Moreover, I don't see such a capability even in the more flexible XML-based VTU formats. So, I would suggest:
Write a script that is going to reorder the points in [0, N-1] in the way they are going to be referenced for connectivity in the CELLS section.
Or vice versa, write a script that will renumber the connectivity information in the CELLS section accordingly.
(optional) Consider switching to XML-based formats anyway, as they are more flexible, have better-supported libraries (IMHO) for IO, and allow for proper parallel IO, random access, and portable data compression.
Related
I need an efficient data structure to store a multidimensional sparse array.
There are only 2 operations over the array:
batch insert of values, usually of a larger number of new values that existed in the array before. Very unlikely that there is a key collision on insert, however if it happens then the value is not updated.
query values in certain range (e.g. read range from index [2, 3, 10, 2] to [2, 3, 17, 6] in order)
From the start I know number of dimensions (usually between 3 to 10) and their sizes (each index can be stored in Int64 and product of all sizes doesn't exceed 2^256) and the upper limit on possible number of the array cells (usually 2^26-2^32).
Currently I use a balanced binary tree for storing the sparse array, the UInt256 key is formed as usual:
key = (...(index_0 * dim_size_1 + index_1) + ... + index_n-1) * dim_size_n + index_n
with operation time complexities (and I understand it can't be any better):
insert in O(log N)
search in O(log N)
Current implementation has problems:
expensive encoding of an index tuple into the key and a key back into the indexes
lack of locality of reference which would be beneficial during range queries
Is it a good idea to replace my tree with a skip list for the locality of reference?
When is it better to have a recursive (nested) structure of sparse arrays for each dimension instead of a single array with the composite key if the array sparseness is given?
I'm interested in any examples of efficient in-memory multidimensional array implementations and in specialized literature on the topic.
It depends on how sparse your matrix is. It's hard to give give numbers, but if it is "very" sparse then you may want to try using a PH-Tree (disclaimer: self advertisement). It is essentially a multidimensional radix-tree.
It natively supports 64bit integers (Java and C++). It is not balanced but depth is inherently limited to the number of bits per dimension (usually 64). It is natively a "map", i.e. it allows only one value per coordinate (there is also a multimap version that allows multiple values). The C++ version is limited to 62 dimensions.
Operations are in the order of O(log N) but should be significantly faster than a (balanced) binary tree.
Please note that the C++ version doesn't compile with MSVC at the moment but there is a patch coming. Let me know if you run into problems.
Or the question can be paraphrased like this:
Why may one need a datatype with a non-zero lower bound?
Consider the following example:
struct S {
int a;
int b;
float c;
float d;
} array[N];
If I had an array of type S[] and I wanted to send only values of fields b and
d, I would create a datatype with the type map { (4, MPI_INT), (12, MPI_FLOAT) }.
At first, it seems that such a type could be used to correctly send an array of
struct S:
MPI_Send(array, N, datatype, ...);
But this doesn't work if N > 1.
Such a type would have lb = 4, ub = 16 and extent = ub - lb = 12. That is,
MPI would consider that the second element of the array starts 12 bytes from the
first one, which is not true.
Well, that may be not a big deal. After all, generally, for such partially sent structures
we have to specify the exact size of the structure:
MPI_Type_create_resized(datatype, 0, sizeof(struct S), &resized);
But I wonder why we always need to specify a zero lower bound. Why would
someone need a non-zero lower bound? The datatypes with non-zero lower bounds looks extremely confusing to me, and I cannot make any sense of them.
If I were to design a type system for MPI, I would describe a type with a single
parameter - its size (extent), which is the stride between two adjacent elements of an array. In terms of MPI, I would always set lb = 0 and extent = ub. Such a system looks much clearer to me, and it would work correctly in the example described above.
But MPI has chosen the different way. We have two independent parameters instead: the lower
and the upper bounds. Why is it so? What's the use of this additional flexibility? When should one use datatypes with a non-zero lower bound?
You have no idea what kind of weird and complex structures one finds in scientific and engineering codes. The standard is designed to be as general as possible and to provide maximum flexibility. Section 4.1.6 Lower-Bound and Upper-Bound Markers begins like this:
It is often convenient to define explicitly the lower bound and upper bound of a type map, and override the definition given on page 105. This allows one to define a datatype that has "holes" at its beginning or its end, or a datatype with entries that extend above the upper bound or below the lower bound. Examples of such usage are provided in Section 4.1.14.
Also, the user may want to overide [sic] the alignment rules that are used to compute upper bounds and extents. E.g., a C compiler may allow the user to overide [sic] default alignment rules for some of the structures within a program. The user has to specify explicitly the bounds of the datatypes that match these structures.
The simplest example of a datatype with non-zero lower bound is a structure with absolute addresses used as offsets, useful in, e.g., sending structures with pointers to data scattered in memory. Such a datatype is used with MPI_BOTTOM specified as the buffer address, which corresponds to the bottom of the memory space (0 on most systems). If the lower bound would be fixed to 0, then you have to find the data item with the lowest address fist and compute all offsets relative to it.
Another example is the use of MPI_Type_create_subarray to create a datatype that describes a subarray of an n-dimensional array. With zero lower bounds you will have to provide a pointer to the beginning of the subarray. With non-zero lower bounds you just give a pointer to the beginning of the whole array instead. And you can also create a contiguous datatype of such subarray datatypes in order to send such n-dimensional "slices" from an n+1-dimensional array.
IMO the only reason to have both LB and UB markers is it simplifies the description of datatype construction. The MPI datatypes are described by a type map (list of offsets and types, including possible LB/UB markers) and all the datatype construction calls define the new typemap in terms of the old typemap.
When you have LB/UB markers in the old typemap and you follow the rules of construction of the new typemap from the old, you get a natural definition of the LB/UB marker in the new type which defines the extent of the new type. If extent were a separate property on the side of the typemap, you'd have to define what the new extent is for every datatype construction call.
Other than that I fundamentally I agree with you on the meaninglessness of having LB/UB as two separate pieces of data, when the only thing they're used for is to define the extent. Once you add LB/UB markers, their meaning is completely disconnected from any notion of actual data offsets.
If you wanted to put an int at displacement 4 and have its extent be 8, it would be fine to construct
[(LB,4), (int,4), (UB,12)]
but it would be equally fine to construct any of
[(LB,0),(int,4),(UB,8)]
[(LB,1000000),(int,4),(UB,1000008)]
[(LB,-1000),(int,4),(UB,-992)]
The above are all completely equivalent in behavior because they have the same extent.
When explanations of LB/UB markers talk about how you need to have datatypes where the first data displacement is non-0, I think that's misleading. It's true you need to be able to make types like that, but the LB/UB markers aren't fundamentally connected to the data displacements. I'm concerned that suggesting they are connected will lead an MPI user to write invalid code if they think the LB is intrinsically related to the data offsets.
I am using Images.jl in Julia. I am trying to convert an image into a graph-like data structure (v,w,c) where
v is a node
w is a neighbor and
c is a cost function
I want to give an expensive cost to those neighbors which have not the same color. However, when I load an image each pixel has the following Type RGBA{U8}(1.0,1.0,1.0,1.0), is there any way to convert this into a number like Int64 or Float?
If all you want to do is penalize adjacent pairs that have different color values (no matter how small the difference), I think img[i,j] != img[i+1,j] should be sufficient, and infinitely more performant than calling colordiff.
Images.jl also contains methods, raw and separate, that allow you to "convert" that image into a higher-dimensional array of UInt8. However, for your apparent application this will likely be more of a pain, because you'll have to choose between using a syntax like A[:, i, j] != A[:, i+1, j] (which will allocate memory and have much worse performance) or write out loops and check each color channel manually. Then there's always the slight annoyance of having to special case your code for grayscale and color, wondering what a 3d array really means (is it 3d grayscale or 2d with a color channel?), and wondering whether the color channel is stored as the first or last dimension.
None of these annoyances arise if you just work with the data directly in RGBA format. For a little more background, they are examples of Julia's "immutable" objects, which have at least two advantages. First, they allow you to clearly specify the "meaning" of a certain collection of numbers (in this case, that these 4 numbers represent a color, in a particular colorspace, rather than, say, pressure readings from a sensor)---that means you can write code that isn't forced to make assumptions that it can't enforce. Second, once you learn how to use them, they make your code much prettier all while providing fantastic performance.
The color types are documented here.
Might I recommend converting each pixel to greyscale if all you want is a magnitude difference.
See this answer for a how-to:
Converting RGB to grayscale/intensity
This will give you a single value for intensity that you can then use to compare.
Following #daycaster's suggestion, colordiff from Colors.jl can be used.
colordiff takes two colors as arguments. To use it, you should extract the color part of the pixel with color i.e. colordiff(color(v),color(w)) where v would be RGBA{U8(0.384,0.0,0.0,1.0) value.
I have been reading through the Graph Algorithms recently and saw the notation for various upper bounds of graph algorithms is of the form O(|V| + |E|). especially in DFS/BFS search algorithms where linear time is achieved with above upper bound.
I have seen both the notations interchangeably used, i.e. O(V+E) as well. as far as I understand "|" bar notation is used for absolute values in math world. if V = # of vertices and E = # of Edges, how can they be negative numbers, such that we have need to get the absolute values before computing the linear function. Please help.
|X| refers to the cardinality (size) of X when X is a set.
O(V+E) is technically incorrect, assuming that V and E refer to sets of vertices and edges. This is because the value inside O( ) should be quantitative, rather than abstract sets of objects that have an ambiguous operator applied to them. |V| + |E| is well-defined to be one number plus another, whereas V + E could mean a lot of things.
However, in informal scenarios (e.g. conversing over the internet and in person), many people (including me) still say O(V+E), because the cardinality of the sets is implied. I like to type fast and adding in 4 pipe characters just to be technically correct is unnecessary.
But if you need to be technically correct, i.e. you're in a formal environment, or e.g. you're writing your computer science dissertation, it's best to go with O(|V|+|E|).
In this case, the vertical bars || denote the cardinality or number of elements of a set (i.e. |E| represents the count of elements in the set E).
http://en.wikipedia.org/wiki/Cardinality
If I think of the x,y coordinate plane, x,y is the common notation for an ordered pair, but if I use a two-dime array I have myArray[row][col] and row is the y and col is the x. Is that backwards or am I just thinking about it wrong? I was thinking it would look like myArray[x][y] but that's wrong if I want real rows and columns (like in a gameboard.) Wouldn't it be myArray[y][x] to truly mimic a row column board?
You have it right, and it does feel a bit backwards. The row number is a y coordinate, and the column number is an x coordinate, and yet we usually write row,col but we also usually write x,y.
Whether you want to write your array as [y][x] or [x][y] depends mostly on how much you actually care about the layout of your array in memory (and if you do, what language you use). And whether you want to write functions/methods that can operate on rows or columns in isolation.
If you are writing C/C++ code, arrays are stored in Row Major Order which means that a single row of data can be treated as 1 dimensional array. But a single column of data cannot. If I remember correctly, VB uses column major order, so languages vary. I'd be surprised of C# isn't also row major order, but I don't know.
This is what I do for my own sanity:
int x = array[0].length;
int y = array.length;
And then for every single array call I make, I write:
array[y][x]
This is particulary useful for graphing algorithms and horizontal/vertical matrix flipping.
It doesn't matter how you store your data in the array ([x][y] or [y][x]). What does matter is that you always loop over the array in a contiguous way. A java two dimensional array is essentially a one dimensional array storing the second array (eg. in the case of [y][x] you have a long array of [y] in which each y holds the corresponding [x] arrays for that line of y).
To efficiently run through the whole array, it's important to access the data in a way so that you don't continuously have to do searches in that array, jumping from one y-array-of-xarrays to another y-array-of-xarrays. What you want to do is access one y element and access all the x's in there before moving to the next y element.
So in an Array[y][x] situation. always have the first variable in the outer loop and the second in the inner loop:
for (int ys = 0; ys < Array.length; ys++)
for (int xs = 0; xs < Array[y].length; xs++)
{
do your stuff here
}
And of course pre-allocate both Array.lengths out of the loop to prevent having to get those values every cycle.
I love the question. You’re absolutely right. Most of the time we are either thinking (x, y) or (row, col). It was years before I questioned it. Then one day I realized that I always processed for loops as if x was a row and y was a column, though in plane geometry it’s actually the opposite. As mentioned by many, it really doesn’t matter in most cases, but consistency is a beautiful thing.
Actually, It's up to you. There is no right of thinking in your question. For example i usually think of a one-dimension array as a row of cell. So, in my mind it is array[col][row]. But it is really up to you...
I bet there are a lot of differing opinions on this one. Bottom line is, it doesn't really matter as long as you are consistent. If you have other libraries or similar that is going to use the same data it might make sense to do whatever they do for easier integration.
If this is strictly in your own code, do whatever you feel comfortable with. My personal preference would be to use myArray[y][x]. If they are large, there might be performance benefits of keeping the items that you are going to access a lot at the same time together. But I wouldn't worry about that until at a very late stage if at all.
Well not really, if you think of a row as elements on the x axis and then a 2d array is a bunch of row elements on the y axis, then it's normal to use y to operate on a row, as you already know the x (for that particular row x is always the same, it's y that's changing with its indices) and then use x to operate on the multiple row elements (the rows are stacked vertically, each one at a particular y value)
For better or for worse, the inconsistent notation was inherited from math.
Multidimensional arrays follow matrix notation where Mi,j represents the matrix element on row i and column j.
Multidimensional arrays therefore are not backward if used to represent a matrix, but they will seem backward if used to represent a 2D Cartesian plane where (x, y) is the typical ordering for a coordinate.
Also note that 2D Cartesian planes typically are oriented with the y-axis growing upward. However, that also is backward from how 2D arrays/matrices are typically visualized (and with the coordinate systems for most raster images).