Distributing blocks with validation and non-dependant list generation - graph

Problem
Suppose I have a system of nodes that can communicate with a parent node, but not among each other. Suppose then a file on the parent node is split up into blocks and divided among the children. The file is then deleted from the parent node.
If the parent were to then request the blocks back from the children, how can the original order be reconstructed without retaining a list of all the files on the parent. Additionally, to prevent one of the nodes from maliciously modifying a block, the parent would also have to validate the blocks coming back.
Optimal Solution
A system of naming the blocks of a file, where the list of files can be generated on any node given a seed. Given the list, a parent should be able to use the list somehow to validate the blocks coming back from children.
Attempt #1
So what I have got so far is the ability to minimally store a list of the blocks. I do so by naming the blocks as such:
block_0 = hash(file_contents)
block_n = hash(block_n-1) [hashing the name of the previous file]
This enables the order of the files to be retained by just keeping the seed (name of block_0), and the number of blocks (e.g. 5d41402abc4b2a76b9719d911017c592,5 --> seed,files). However this will not allow the files to be validated independently.
Attempt #2
Simply take the hash of each block and store that in a list. However this is not efficient and will result in a large amount of memory allocated to this task alone if a large number of blocks need to be tracked. This will not do.

I'm not sure if I've got the problem, but I guess this is a possible solution:
| Distribution:
parent | buffer = [hash(key, id)), data[id]]; send(buffer);
nodes | recv(buffer); h_id, data = buffer;
The parent node uses some local key to generate a hashed value (h_id) to the id part of data it is sending, and the local nodes will receive both the resulting h_id and the data itself.
| Reduction:
nodes | buffer = [h_id, data]; send(buffer);
parent | recv(buffer); h_id, data_id = buffer;
On the counter flow, the nodes must send both the original h_id and the data formerly received, otherwise, the following verification will fail:
hash(key, data) == h_id
Since key is only known in the parent node, it would be hard for the local nodes to alter data and h_id in such a way that hash(key, data_id), in the parent node, would still be valid.
Concerning the ordering, you could simply assume that the four initial bytes of data store the number of the partition -- for later reconstruction.
Edit:
I may have not noticed this extra storage you pointed, but here is what I've tried to propose. Consider four machines, A, B, C, and P, with the initial data:
P{key, data[3]}
____|____
/ | \
A{} B{} C{}
Then, P distributes the data among the machines, sending both the data shard itself, and the generated hash:
P{key, data[3]}
____|____
/ | \
A | C
{data[0], hash(key, data[0])} | {data[2], hash(key, data[2])}
B
{data[1], hash(key, data[1])}
If you assume the first bytes in data[i] store a global index, you're able to rebuild the initial base data[3] in the original order. Also, if you allow each machine to store/receive key, you'll be able to later un hash data[i] and rebuild data[3] on every local node.
Notice the addition of errors can only take place over the shards of data data[i], and over the received key hash(key, data[i]), as you must assume key to be globally valid. The main point here is that the list of hash(key, data[i]) values are also distributed among the machines, not only the data partitions themselves, i.e., you need no list with all files to be stored in any machine alone.
Considering you can afford to maintain key in every node, or at least to send key to the one node trying to rebuild the original data, here goes an example of a reduction step, say, for node B. A and C send their local {data[i], hash(key, data[i])} to node B, and P sends key to B, so this node can un hash the received data:
P{key, data[3]}
|
A | C
{data[0], hash(key, data[0])} | {data[0], hash(key, data[0])}
\ | /
B
{data[1], hash(key, data[1])}
Then, B computes:
/ {data[1], hash(key, data[1])} \ {data[1]}
unhash( {data[0], hash(key, data[0])} ) => {data[0]} => {data[3]}
\ {data[2], hash(key, data[2])} / {data[2]}
Which restores the original data with the correct ordering.

Related

DynamoDB Global Secondary Index "Batch" Retrieval

I've see older posts around this but hoping to bring this topic up again. I have a table in DynamoDB that has a UUID for the primary key and I created a secondary global index (SGI) for a more business-friendly key. For example:
| account_id | email | first_name | last_name |
|------------ |---------------- |----------- |---------- |
| 4f9cb231... | linda#gmail.com | Linda | James |
| a0302e59... | bruce#gmail.com | Bruce | Thomas |
| 3e0c1dde... | harry#gmail.com | Harry | Styles |
If account_id is my primary key and email is my SGI, how do I query the table to get accounts with email in ('linda#gmail.com', 'harry#gmail.com')? I looked at the IN conditional expression but it doesn't appear to work with SGI. I'm using the go SDK v2 library but will take any guidance. Thanks.
Short answer, you can't.
DDB is designed to return a single item, via GetItem(), or a set of related items, via Query(). Related meaning that you're using a composite primary key (hash key & sort key) and the related items all have the same hash key (aka partition key).
Another way to think of it, you can't Query() a DDB Table/index. You can only Query() a specific partition in a table or index.
Scan() is the only operation that works across partitions in one shot. But scanning is very inefficient and costly since it reads the entire table every time.
You'll need to issue a GetItem() for every email you want returned.
Luckily, DDB now offers BatchGetItem() with will allow you to send multiple, up to 100, GetItem() requests in a single call. Saves a little bit of network time and automatically runs the requests in parallel; but otherwise is the little different from what your application could do itself directly with GetItem(). Make no mistake, BatchGetItem() is making individual GetItem() requests behind the scenes. In fact, the requests in a BatchGetItem() don't even have to be against the same tables/indexes. The cost for each request in a batch will be the same as if you'd used GetItem() directly.
One difference to make note of, BatchGetItem() can only return 16MB of data. So if your DDB items are large, you may not get as many returned as your requested.
For example, if you ask to retrieve 100 items, but each individual
item is 300 KB in size, the system returns 52 items (so as not to
exceed the 16 MB limit). It also returns an appropriate
UnprocessedKeys value so you can get the next page of results. If
desired, your application can include its own logic to assemble the
pages of results into one dataset.
Because you have a GSI with PK of email (from what I understand) you can use PartiQL command to get your batch of emails back. The API is called ExecuteStatment and you use a SQL like syntax:
SELECT * FROM mytable.myindex WHERE email IN ['email#email.com','email1#email.com']

Referencing graph nodes by integer ID

As a bit of a learning project, I am working to replace a somewhat slow program in perl with a Chapel implementation. I've got the algorithms down, but I'm struggling with the best way to reference the data in Chapel. I can do a direct translation, but it seems likely I'm missing a better way.
Details of existing program:
I have a graph with ~32000 nodes and ~2.1M edges. State is saved in
data files, but it's run as a daemon that keeps data in memory.
Each node has a numeric ID (assigned by another system) and have a variety
of other attributes defined by string, integer, and boolean values.
The edges are directional and have a couple of boolean values
attributed to them.
I have an external system that interacts with this daemon that I cannot change. It makes requests, such as "Add node (int) with these attributes", "find shortest path from node (int) to node (int)", or "add edges from node (int) to node(s) (int, int, int)"
In Perl, the program uses hashes with common integer IDs for node and edge attributes. I can certainly replicate this in Chapel with associative arrays.
Is there a better way to bundle this all together? I've been trying to wrap my head around ways to have opaque node and edge with each item defined, but struggling with how to reference them with the integer IDs in an easy fashion.
If somebody can provide an ideal way to do the following, it would get me the push I need.
Create two nodes with xx attributes identified by integer ID.
Create an edge between the two with xx attribues
Respond to request "show me the xx attribute of node (int)"
Cheers, and thanks.
As you might expect, there are a number of ways to approach this in Chapel, though I think given your historical approach and your external system's interface, associative domains and arrays are definitely an appropriate way to go. Specifically, given your desire to refer to nodes by integer IDs makes associative domains/arrays a natural match.
For Chapel newbies: associative domains are essentially sets of arbitrary values, like the set of integer node IDs in this case. Associative arrays are mappings from the indices of an associative domain to elements (variables) of a given type . Essentially, the domain represents the keys and the array the values in a key-value store or hash table.
To represent the nodes and edges themselves, I'm going to take the approach of using Chapel records. Here's my record for a node:
record node {
var id: int;
var str: string,
i: int,
flag: bool;
var edges: [1..0] edge;
}
As you can see, it stores its id as an integer, arbitrary attribute fields of various types (a string str, an integer i, and a boolean flag — you can probably come up with better names for your program), and an array of edges which I'll return to in a second. Note that it may or may not be necessary for each node to store its ID... perhaps in any context where you'd have the node, you would already know its ID, in which case storing it could be redundant. Here I stored it just to show you could, not because you must.
Returning to the edges: In your question, it sounded as though edges might have their own integer IDs and get stored in the same pool as the nodes, but here I've taken a different approach: In my experience, given a node, I typically want the set of edges leading out of it, so I have each node store an array of its outgoing edges. Here, I'm using a dense 1D array of edges which is initially empty (1..0 is an empty range in Chapel since 1 > 0). You could also use an associative array of edges if you wanted to give them each a unique ID. Or you could remove the edges from the node data structure altogether and store them globally. Feel free to ask follow-up questions if you'd prefer a different approach.
Here's my record for representing an edge:
record edge {
var from, to: int,
flag1, flag2: bool;
}
The first two fields (from and to) indicate the nodes that the edge connects. As with the node ID above, it may be that the from field is redundant / unnecessary, but I've included it here for completeness. The two flag fields are intended to represent the data attributes you'd associate with an edge.
Next, I'll create my associative domain and array to represent the set of node IDs and the nodes themselves:
var NodeIDs: domain(int),
Nodes: [NodeIDs] node;
Here, NodeIDs is an associative domain (set) of integer IDs representing the nodes. Nodes is a an associative array that maps from those integers to values of type node (the record we defined above).
Now, turning to your three operations:
Create two nodes with xx attributes identified by integer ID.
The following declaration creates a node variable named n1 with some arbitrary attributes using the default record constructor/initializer that Chapel provides for records that don't define their own:
var n1 = new node(id=1, "node 1", 42, flag=true);
I can then insert it into the array of nodes as follows:
Nodes[n1.id] = n1;
This assignment effectively adds n1.id to the NodeIDs domain and copies n1 into the corresponding array element in Nodes. Here's an assignment that creates a second anonymous node and adds it to the set:
Nodes[2] = new node(id=2, "node 2", i=133);
Note that in the code above, I've assumed that you want to choose the IDs for each node explicitly (e.g., perhaps your data file establishes the node IDs?). Another approach (not shown here) might be to have them be automatically determined as the nodes are created using a global counter (maybe an atomic counter if you're creating them in parallel).
Having populated our Nodes, we can then iterate over them serially or in parallel (here I'm doing it in parallel; replacing forall with for will make them serial):
writeln("Printing all node IDs (in an arbitrary order):");
forall nid in NodeIDs do
writeln("I have a node with ID ", nid);
writeln("Printing all nodes (in an arbitrary order):");
forall n in Nodes do
writeln(n);
The order in which these loops print the IDs and nodes is arbitrary for two reasons: (1) they're parallel loops; (2) associative domains and arrays store their elements in an arbitrary order.
Create an edge between the two with xx attribues
Since I associated the edges with nodes, I took the approach of creating a method on the node type that will add an edge to it:
proc node.addEdge(to: int, flag1: bool, flag2: bool) {
edges.push_back(new edge(id, to, flag1, flag2));
}
This procedure takes the destination node ID, and the attributes as its arguments, creates an edge using that information (and supplying the originating node's ID as the from field), and uses the push_back() method on rectangular arrays to add it to the list of edges.
I then call this routine three times to create some edges for node 2 (including redundant and self-edges since I only have two nodes so far):
Nodes[2].addEdge(n1.id, true, false);
Nodes[2].addEdge(n1.id, false, true);
Nodes[2].addEdge(2, false, false);
And at this point, I can loop over all of the edges for a given node as follows:
writeln("Printing all edges for node 2: (in an arbitrary order):");
forall e in Nodes[2].edges do
writeln(e);
Here, the arbitrary printing order is only due to the use of the parallel loop. If I'd used a serial for loop, I'd traverse the edges in the order they were added due to the use of a 1D array to represent them.
Respond to request "show me the xx attribute of node (int)"
You've probably got this by now, but I can get at arbitrary attributes of a node simply by indexing into the Nodes array. For example, the expression:
...Nodes[2].str...
would give me the string attribute of node 2. Here's a little helper routine I wrote to get at (and print) some various attributes):
proc showAttributes(id: int) {
if (!NodeIDs.member(id)) {
writeln("No such node ID: ", id);
return;
}
writeln("Printing the complete attributes for node ", id);
writeln(Nodes[id]);
writeln("Printing its string field only:");
writeln(Nodes[id].str);
}
And here are some calls to it:
showAttributes(n1.id);
showAttributes(2);
showAttributes(3);
I am working to replace a somewhat slow program in perl with a Chapel implementation
Given that speed is one of your reasons for looking at Chapel, once your program is correct, re-compile it with the --fast flag to get it running quickly.

Neo4j Cypher query to find nodes that are not connected too slow

Given we have the following Neo4j schema (simplified but it shows the important point). There are two types of nodes NODE and VERSION. VERSIONs are connected to NODEs via a VERSION_OF relationship. VERSION nodes do have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited. NODEs can be connected via a HAS_CHILD relationship. Again these relationships have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited.
EDIT: The validity dates on VERSION nodes and HAS_CHILD relations are independent (even though the example coincidentally shows them being aligned).
The example shows two NODEs A and B. A has two VERSIONs AV1 until 6/30/17 and AV2 starting from 7/1/17 while B only has one version BV1 that is unlimited. B is connected to A via a HAS_CHILD relationship until 6/30/17.
The challenge now is to query the graph for all nodes that aren't a child (that are root nodes) at one specific moment in time. Given the example above, the query should return just B if the query date is e.g. 6/1/17, but it should return B and A if the query date is e.g. 8/1/17 (because A isn't a child of B as of 7/1/17 any more).
The current query today is roughly similar to that one:
MATCH (n1:NODE)
OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION)
WHERE (c.from <= {date} <= c.until)
AND (nv2.from <= {date} <= nv2.until)
WITH n1 WHERE c IS NULL
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
RETURN n1, nv1
ORDER BY toLower(nv1.title) ASC
SKIP 0 LIMIT 15
This query works relatively fine in general but it starts getting slow as hell when used on large datasets (comparable to real production datasets). With 20-30k NODEs (and about twice the number of VERSIONs) the (real) query takes roughly 500-700 ms on a small docker container running on Mac OS X) which is acceptable. But with 1.5M NODEs (and about twice the number of VERSIONs) the (real) query takes a little more than 1 minute on a bare-metal server (running nothing else than Neo4j). This is not really acceptable.
Do we have any option to tune this query? Are there better ways to handle the versioning of NODEs (which I doubt is the performance problem here) or the validity of relationships? I know that relationship properties cannot be indexed, so there might be a better schema for handling the validity of these relationships.
Any help or even the slightest hint is greatly appreciated.
EDIT after answer from Michael Hunger:
Percentage of root nodes:
With the current example data set (1.5M nodes) the result set contains about 2k rows. That's less than 1%.
ITEM_VERSION node in first MATCH:
We're using the ITEM_VERSION nv2 to filter the result set to ITEM nodes that have no connection other ITEM nodes at the given date. That means that either no relationship must exist that is valid for the given date or the connected item must not have an ITEM_VERSION that is valid for the given date. I'm trying to illustrate this:
// date 6/1/17
// n1 returned because relationship not valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...)
// n1 not returned because relationship and connected item n2 valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...)
// n1 returned because connected item n2 not valid even though relationship is valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)
No use of relationship-types:
The problem here is that the software features a user-defined schema and ITEM nodes are connected by custom relationship-types. As we can't have multiple types/labels on a relationship the only common characteristic for these kind of relationships is that they all start with X_. That's been left out of the simplified example here. Would searching with the predicate type(r) STARTS WITH 'X_' help here?
What Neo4j version are you using.
What percentage of your 1.5M nodes will be found as roots at your example date, and if you don't have the limit how much data comes back? Perhaps the issue is not in the match so much as in the sorting at the end?
I'm not sure why you had the VERSION nodes in your first part, at least you don't describe them as relevant for determining a root node.
You didn't use relationship-types.
MATCH (n1:NODE) // matches 1.5M nodes
// has to do 1.5M * degree optional matches
OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until)
WITH n1 WHERE c IS NULL
// how many root nodes are left?
// # root nodes * version degree (1..2)
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
// has to sort all those
WITH n1, nv1, toLower(nv1.title) as title
RETURN n1, nv1
ORDER BY title ASC
SKIP 0 LIMIT 15
I think a good start for improvement would be to match on nodes using an index so you can quickly get a smaller relevant subset of nodes to search. Your approach right now must inspect all your :NODEs and all their relationships and patterns off of them every single time, which, as you've found, won't scale with your data.
Right now the only nodes in your graph with date/time properties are your :ITEM_VERSION nodes, so let's start with those. You'll need an index on :ITEM_VERSION's from and until properties for fast lookup.
The nulls are going to be problematic for your lookups, as any inequality against a null value returns null, and most workarounds to working with nulls (using COALESCE() or several ANDs/ORs for null cases) seem to prevent usage of index lookups, which is the point of my particular suggestion.
I would encourage you to replace your nulls in from and until with min and max values, which should let you take advantage of finding nodes by index lookup:
MATCH (version:ITEM_VERSION)
WHERE version.from <= {date} <= version.until
MATCH (version)<-[:VERSION_OF]-(node:NODE)
...
That should at least provide quick access to a smaller subset of nodes at the start for continuing your query.

What does "bucket entries" mean in the context of a hashtable?

What does "bucket entries" mean in the context of a hashtable?
A bucket is simply a fast-access location (like an array index) that is the the result of the hash function.
The idea with hashing is to turn a complex input value into a different value which can be used to rapidly extract or store data.
Consider the following hash function for mapping people's names into street addresses.
First take the initials from the first and last name and turn them both into numeric values (0 through 25, from A through Z). Multiply the first by 26 and add the second, and this gives you a value from 0 to 675 (26 * 26 distinct values, or bucket IDs). This bucket ID is then to be used to store or retrieve the information.
Now you can have a perfect hash (where each allowable input value maps to a distinct bucket ID) so that a simple array will suffice for the buckets. In that case, you can just maintain an array of 676 street addresses and use the bucket ID to find the one you want:
+-------------------+
| George Washington | -> hash(GW)
+-------------------+ |
+-> GwBucket[George's address]
+-------------------+
| Abraham Lincoln | -> hash(AL)
+-------------------+ |
+-> AlBucket[Abe's address]
However, this means that George Wendt and Allan Langer are going to cause problems in the future.
Or you can have an imperfect hash (such as one where John Smith and Jane Seymour would end up with the same bucket ID).
In that case, you need a more complex backing data structure than a simple array, to maintain a collection of addresses. This could be as simple as a linked list, or as complex as yet another hash:
+------------+ +--------------+
| John Smith | | Jane Seymour |
+------------+ +--------------+
| |
V V
hash(JS) hash(JS)
| |
+-----> JsBucket <----+
\/
+-----------------------------------+
| John Smith -> [John's address] |
| Jane Seymour -> [Jane's address] |
+-----------------------------------+
Then, as well as the initial hash lookup, an extra level of searching needs to be carried out within the bucket itself, to find the specific information.
From Wikipedia:
hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.
Each entry in the array/vector is called as an Bucket.
I think Bucket is a structure that at least contain hash value, which works as indexes, (hash values are generated by hash functions), but the structure itself might contain entries (data) or not.
illustration:
[hash value][points to actual data] ---> [actual data]
|<------------bucket structure------>|
[hash value][actual data]
|-----bucket structure--->|
It is the [hash value] part works as indexes.
I found these photos from hash_table Wikipedia are pretty straightforward.
The photos below indicates that entries (data) can be stored within buckets or it can be stored with its own data structure, while bucket simply points to the data.
Both rehashing and coalesced hashing assume fixed table sizes determined in advance. If the number of records grow beyond the number of table positions, it is impossible to insert them without allocating a larger table and recomputing hash.
Another method of resolving hash clashes is separate chaining. Term Bucket is generally used with separate chaining. Separate chaining involves keeping a distinct linked list for all records whose keys hash into a particular value.
Suppose that hash function produces values between 0 and tablesize - 1. Then an array bucket of header nodes of size tablesize is declared. This array is called the hash table.
Bucket[i], bucket entry, points to the list of all records who keys hash into i. To insert a record, the list head bucket[i] is accessed and record is inserted at the tail end.

ASP.NET and a One-to-Many-to-Many Scenario

I'm new to ASP.NET but not to programming. I am migrating our current site from PHP/MySQL to ASP.NET(3.5)/SqlServer. I've been lurking here since the site's launch, so I'm confident that one (or more) of you can help me out. Here's the scenario:
This is a training department site and the dept. has a course catalog stored in the table course. Each course may have many prerequisite courses, For example, A and B are prerequisites for C. I would normally store this either as a comma-delimited column in course or in a separate table course_prereq or course_course as a recursive relationship. This part I can do.
However, the business rules require that each course can have multiple sets of prerequisites. Fore example, N requires A, B and C, or N requires X and Y. This is where I'm stuck.
Previously, I stored this information in a column for row N as A,B,C|X,Y, parsed the ids into a PHP 2D-array, submitted a second query for all the rows whose id was in that array, then used PHP to separate those rows into their respective groups. Once all this processing is done, the groups of prerequisites are displayed as separate tables on the web page, like so:
| A | A's information goes here |
| B | B's information goes here |
| C | C's information goes here |
- - - - - - - OR - - - - - - - -
| X | X's information goes here |
| Y | Y's information goes here |
How would I accomplish this using ASP.NET?
Add a table to hold Prerequisite Sets. This table holds a set ID and key back to the courses table for each course in the set. The table may have several rows for a given set ID, so your primary key will be the set ID plus the course ID. Then in your course_prereq table you relate courses to the different prerequisite sets. An OR relationship can be assumed there because any ANDs are enforced in the sets themsevles.
Have a table called PrerequisiteSet that FKs to each prereq. Then have a Course_PrerequisiteSet many to many table that FKs to Course and PrerequisiteSet. Most of the time there will only be one entry in Course_PrerequistieSet, but if there are more than one, then it will be an OR relationship.
Both the answers above were very helpful. I ended up using just one database table instead of the suggested two. The table contains a course_id, prereq_id, and set_id, which all together form the primary key.
In the ASP.NET page, I use a repeater to loop over the sqldatasource stored procedure that returns a course's prerequisite sets, and a gridview inside that repeater that reads the individual prerequisite information from a second sqldatasource stored procedure. Like this:
RepeaterSqlDataSource (returns set ids)
Repeater
. . . GridViewSqlDataSource (returns course info for each prereq_id in set
. . . GridView
Hope this is helpful to anyone else looking at a similar scenario.

Resources