Performance of Grakn match query is extremely slow on nearly empty graph - vaticle-typedb

(Using the latest version of Grakn community)
I have a graph with some rules and entities, of which one entity type is mostly isolated (not part of any relations but does act a role).
I’ve defined this entity type, user, below:
containment sub relation,
relates container,
relates containee;
uuid sub attribute, value string;
last-seen sub attribute, value datetime;
user sub entity,
has uuid,
has last-seen,
plays containee;
The attributes are only owned by user. I currently have 5 user instances in my graph that I put in for testing, and when I run match queries I get severely degraded performance.
For example, consider the following query:
match $u isa user, has uuid "test"; get; offset 0; limit 1;
This query clocks in at ~100 calls over a 30 second period, which IMO is extremely slow considering how often we might want to get a very specific user from the database. More complex queries that involve querying the user with relations are exponentially slower.
Timing it, even with a nearly empty graph it's still faster to run a find-populate in Mongo. With a larger graph of say 100 relations, it's not even a contest as the Grakn query can take several minutes to return.
However, it takes less than a second to complete 2000 insert queries (without match clauses).
This doesn't seem to make any sense to me as Grakn is a graph database optimized for graphs and should also resolve transitive closures faster, and yet even with a shallow empty graph it seems to be struggling with queries. I assume there is something wrong with my schema or query.
This leads me to believe there is some aspect of my entity definition or syntax that is resulting in very slow match queries. I’m wondering where to even begin looking (or if this is actually expected speed)?

not sure what was the cause of that slowness, but feel free to test out Grakn 2.0 (completely rewritten from scratch, for performance) - the alpha version is out now (missing some things such as rules), and the full version will be out soon!

Related

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

CosmosDB/DocumentDB partitioning with multiple types in same collection

Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?

How to implement gapless, user-friendly IDs in NHibernate?

I'm designing an application where my Order objects need to have a sequential and user-friendly Id field. I'm avoiding the HiLo algorithm because of the rather large gaps it produces (see here). Naturally, Guid values would make my corporate users go bananas. I'm also avoiding Oracle sequences because of the major disadvantages of it:
(From: NHibernate POID Generators revealed)
Post insert generators, as the name
suggest, assigns the id’s after the
entity is stored in the database. A
select statement is executed against
database. They have many drawbacks,
and in my opinion they must be used
only on brownfield projects. Those
generators are what WE DO NOT SUGGEST
as NH Team.
> Some of the drawbacks are the
following:
Unit Of Work is broken with the use of
those strategies. It doesn’t matter if
you’re using FlushMode.Commit, each
Save results in an insert statement
against DB. As a best practice, we
should defer insertions to the commit,
but using a post insert generator
makes it commit on save (which is what
UoW doesn’t do).
Those strategies
nullify batcher, you can’t take the
advantage of sending multiple queries
at once(as it must go to database at
the time of Save).
Any ideas/experience on implementing user-friendly IDs without major gaps between them?
Edit:
User friendly Id fields are ones my corporate users can memorize and even discuss and/or have phone conversations talking about a particular Order by its code, e.g. "I'm calling to know why the order #1625 was denied.".
The Id doesn't need to be strictly gapless, but I am worried that my users would get confused when they see gaps like 100, 201, 305. For my older projects, I currently implement NHibernate using Oracle sequences which occasionally lose a few sequences when exceptions are thrown, but yet keep a rather tidy order to them. The downside to them is how they break the Unit of Work which results in additional hits to the database for every Save command with or without the Session.Flush.
One option would be to keep a key-table that simply stores an incrementing value. This can introduce a few problems, namely possible locking issues as well as additional hits to the database.
Another option might be to refine what you mean by "User-friendly Id". This could consist of a combination of a Date/Time and a customer-specific sequence (or including the customer id as well). Also, your order id does not necessarily have to be the actual key on the table. There is nothing to say that you can't use a surrogate key with a separate "calculated" column which represents the order id.
The bottom-line is that it sounds like you want to use a surrogate key, but have the benefits of a natural key. It can be very difficult to have it both ways and a lot comes down to how you actually plan on using the data, how users interpret the data, and personal preference.

Riak link-walking like a join?

I am looking to store pictures in a NoSQL database (<5MB) and link them to articles in a different bucket. What kind of speed does Riak's link walking feature offer? Is it like a RDBMS join at all?
Links are not at all similar to JOINs (which involve a Cartesian product), but they can be used for similar purposes in some senses. They are very similar to links in an HTML document.
With link-walking you either start with a single key, or you create a map-reduce job that starts with multiple keys. (Link-walking/traversal is actually a special case of map-reduce.) Those values are fetched, their links filtered against your specification (bucket, tag) and then the matched links are passed along to the next phase (or back to the client). Of course, all of this is done in parallel (unlike a JOIN) with high data-locality.
Also, map-reduce isn't slow by itself, you just don't have a sophisticated query planner to do the hard work for you; you have to think about how you will query and organize your data around that as necessary.
Think one-way relationships and as fast as querying normally. Not as slow as MapReduce.
From:
http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/
The first way that Riak deals with
this is with link-walking. Every datum
stored in Riak can have one-way
relationships to other data via the
Link HTTP header. In the canonical
example, you know the key of a band
that you have stored in the “artists”
bucket (Riak buckets are like database
tables or S3 buckets). If that artist
is linked to its albums, which are in
turn linked to the tracks on the
albums, you can find all of the tracks
produced in a single request. As I’ll
describe in the next section, this is
much less painful than a JOIN in SQL
because each item is operated on
independently, rather than a table at
a time. Here’s what that query would
look like:
GET
/raw/artists/TheBeatles/albums,,/tracks,_,1
“/raw” is the top of the URL
namespace, “artists” is the bucket,
“TheBeatles” is the source object key.
What follows are match specifications
for which links to follow, in the form
of bucket,tag,keep triples, where
underscores match anything. The third
parameter, “keep” says to return
results from that step, meaning that
you can retrieve results from any step
you want, in any combination. I don’t
know about you, but to me that feels
more natural than this:
SELECT tracks.* FROM tracks INNER
JOIN albums ON tracks.album_id =
albums.id INNER JOIN artists ON
albums.artist_id = artists.id WHERE
artists.name = "The Beatles" The
caveat of links is that they are
inherently unidirectional, but this
can be overcome with little difficulty
in your application. Without
referential integrity constraints in
your SQL database (which ActiveRecord
has made painful in the past), you
have no solid guarantee that your
DELETE or UPDATE won’t cause a row to
become orphaned, anyway. We’re kind of
spoiled because ActiveRecord handles
the linkage of associations
automatically.
The place where the link-walking
feature really shines is in
self-referential and deep transitive
relationships (think has_many :through
writ large). Since you don’t have to
create a virtual table via a JOIN and
alias different versions of the same
table, you can easily do things like
social network graphs
(friends-of-friends-of-friends), and
data structures like trees and lists.

Any SQL Server multiple-recordset stored procedure gotchas?

Context
My current project is a large-ish public site (2 million pageviews per day) site running a mixture of asp classic and asp.net with a SQL Server 2005 back-end. We're heavy on reads, with occasional writes and virtually no updates/deletes. Our pages typically concern a single 'master' object with a stack of dependent (detail) objects.
I like the idea of returning all the data required for a page in a single proc (and absolutely no unnecesary data). True, this requires a dedicated proc for such pages, but some pages receive double-digit percentages of our overall site traffic so it's worth the time/maintenance hit. We typically only consume multiple-recordsets from our .net code, using System.Data.SqlClient.SqlDataReader and it's NextResult method. Oh, yeah, I'm not doing any updates/inserts in these procs either (except to table variables).
The question
SQL Server (2005) procs which return multiple recordsets are working well (in prod) for us so far but I am a little worried that multi-recordset procs are my new favourite hammer that i'm hitting every problem (nail) with. Are there any multi-recordset sql server proc gotchas I should know about? Anything that's going to make me wish I hadn't used them? Specifically anything about it affecting connection pooling, memory utilization etc.
Here's a few gotchas for multiple-recordset stored procs:
They make it more difficult to reuse code. If you're doing several queries, odds are you'd be able to reuse one of those queries on another page.
They make it more difficult to unit test. Every time you make a change to one of the queries, you have to test all of the results. If something changed, you have to dig through to see which query failed the unit test.
They make it more difficult to tune performance later. If another DBA comes in behind you to help performance improve, they have to do more slicing and dicing to figure out where the problems are coming from. Then, combine this with the code reuse problem - if they optimize one query, that query might be used in several different stored procs, and then they have to go fix all of them - which makes for more unit testing again.
They make error handling much more difficult. Four of the queries in the stored proc might succeed, and the fifth fails. You have to plan for that.
They can increase locking problems and incur load in TempDB. If your stored procs are designed in a way that need repeatable reads, then the more queries you stuff into a stored proc, the longer it's going to take to run, and the longer it's going to take to return those results back to your app server. That increased time means higher contention for locks, and the more SQL Server has to store in TempDB for row versioning. You mentioned that you're heavy on reads, so this particular issue shouldn't be too bad for you, but you want to be aware of it before you reuse this hammer on a write-intensive app.
I think multi recordset stored procedures are great in some cases, and it sounds like yours maybe one of them.
The bigger (more traffic), you site gets, the more important that 'extra' bit of performance is going to matter. If you can combine 2-3-4 calls (and possibly a new connections), to the database in one, you could be cutting down your database hits by 4-6-8 million per day, which is substantial.
I use them sparingly, but when I have, I have never had a problem.
I would recommend having invoking in one stored procedure several inner invocations of stored procedures that return 1 resultset each.
create proc foo
as
execute foobar --returns one result
execute barfoo --returns one result
execute bar --returns one result
That way when requirments change and you only need the 3rd and 5th result set, you have a easy way to invoke them without adding new stored procedures and regenerating your data access layer. My current app returns all reference tables (e.g. US states table) if I want them or not. Worst is when you need to get a reference table and the only access is via a stored procedure that also runs an expensive query as one of its six resultsets.

Resources