What are some of the factors that impact query performance in Grakn? - vaticle-typedb

I'm querying for concepts in Grakn, at what point does query performance on match get degrade / how many concepts/variables can I get in one query without impacting performance?

There are 3 factors involved in how long a query takes:
How long it takes to make the query plan(s)
How long a plan takes to execute
How much reasoning is needed and how expensive that is
Keeping the number of variables low helps with (1) and (2).
The time taken for (1) depends on the number of variables in the query, which shouldn't be a problem under 50 variables.
(2) depends entirely on the nature of your query and the shape and quantity of your data.
(3) Depends on the number of rules triggered, and the number of instances that the reasoner needs to check if they match the rule(s)

Related

The fastest way to SELECT MAX( ) from table

What would be the most efficient and elegant way to select MAX value from transparent table?
For example, let's try to select such simple table as T100, Messages. Let us imagine we should select maximum message number within some application area.
We can do this in a multiple ways. The most obvious is using MAX aggregate function
select MAX(msgnr)
from t100
where arbgb = '/ASU/GENERAL'
And the other is using UP TO 1 ROWS clause
select msgnr
from t100
where arbgb = '/ASU/GENERAL'
and ROWNUM = 1
order by msgnr DESC
Herein above, all SQL statements are given in a native Oracle SQL, as I was doing tests in DBACOCKPIT where this is mandatory.
Built-in cockpit tool showed almost identical results for both requests:
estimated costs is 3 for both of them, however, second query has only 2 steps in plan whereas the first has 3.
estimated CPU costs for the first query is 21.564 per step, however second has total cost of 21.964.
So we can conclude that second variant is more efficient here.However, would this be a point in a common case? Would this result be different on different DBs or we can treat it as a rule of thumb?
Performance measurements always vary depending on a rather large set of parameters, the DBMS being not the least of them. Some DBMS might, for example, be able to use an index to efficiently determine a MAX(...) statement while others might have to resort to some more inefficient form. Also, don't trust the optimizer's predictions - measure the real performance. The optimizer might be wrong for a number of reasons (outdated statistics and broken indexes being the most common ones as far as I have seen), and it's of little help if the optimizer estimates a total cost of 12.345 if the actual cost turns out to be 987.654. Hence, my recommendations would be
select only what you need (avoid SELECT *, use WHERE sensibly)
measure, measure, measure
be wary of most of the unproven and alleged performance rules-of-thumb out there.

Spark tasks with Cassandra

I am new to Spark and Cassandra.
We are using Spark on top of Cassandra to read data, since we have requirement to read data using non-primary key columns.
One observation is, number of tasks for a spark job increasing w.r.t data growth. Due to this we are facing lot of latency in fetching data.
What would be the reasons for the spark job task count increase?
What should be considered to increase performance in Spark with Cassandra?
Please suggest me.
Thanks,
Mallikarjun
The input split size is controlled by the configuration spark.cassandra.input.split.size_in_mb. Each split will generate a task in Spark, therefore, the more data in Cassandra, the longer it will take to process (which is what you would expect)
To improve performance, make sure you are aligning the partitions using joinWithCassandraTable. Don't use context.cassandraTable(...) unless you absolutely need all the data in the table and optimize the retrieved data using select to project only the columns that you need.
If you need data from some rows, it would make sense to build a secondary table where the id of those rows is stored.
Secondary indexes could also help to select subsets of the data, but I've seen reports of if being not highly performant.
What would be the reasons for the spark job task count increase?
Following on from maasgs answer, rather than setting the spark.cassandra.input.split.size_in_mb. on the SparkConf, it can be useful to use the ReadConf config when reading from different keyspaces/datacentres in a single job:
val readConf = ReadConf(
splitCount = Option(500),
splitSizeInMB = 64,
fetchSizeInRows = 1000,
consistencyLevel = ConsistencyLevel.LOCAL_ONE,
taskMetricsEnabled = true
)
val rows = sc.cassandraTable(cassandraKeyspace, cassandraTable).withReadConf(readConf)
What should be considered to increase performance in Spark with
Cassandra?
As far as increasing performance is concerned, this will depend on the jobs you are running and the types of transformations required. Some general advice to maximise Spark-Cassandra performance (As can be found here) is outlined below.
Your choice of operations and the order in which they are applied is critical to performance.
You must organize your processes with task distribution and memory in mind.
The first thing is to determine if you data is partitioned appropriately. A partition in this context is merely a block of data. If possible, partition your data before Spark even ingests it. If this is not practical or possible, you may choose to repartition the data immediately following the load. You can repartition to increase the number of partitions or coalesce to reduce the number of partitions.
The number of partitions should, as a lower bound, be at least 2x the number of cores that are going to operate on the data. Having said that, you will also want to ensure any task you perform takes at least 100ms to justify the distribution across the network. Note that a repartition will always cause a shuffle, where coalesce typically won’t. If you’ve worked with MapReduce, you know shuffling is what takes most of the time in a real job.
Filter early and often. Assuming the data source is not preprocessed for reduction, your earliest and best place to reduce the amount of data spark will need to process is on the initial data query. This is often achieved by adding a where clause. Do not bring in any data not necessary to obtain your target result. Bringing in any extra data will affect how much data may be shuffled across the network, and written to disk. Moving data around unnecessarily is a real killer and should be avoided at all costs
At each step you should look for opportunities to filter, distinct, reduce, or aggregate the data as much as possible prior to proceeding to the operation.
Use pipelines as much as possible. Pipelines are a series of transformations that represent independent operations on a piece of data and do not require a reorganization of the data as a whole (shuffle). For example: a map from a string -> string length is independent, where a sort by value requires a comparison against other data elements and a reorganization of data across the network (shuffle).
In jobs which require a shuffle see if you can employ partial aggregation or reduction before the shuffle step (similar to a combiner in MapReduce). This will reduce data movement during the shuffle phase.
Some common tasks that are costly and require a shuffle are sorts, group by key, and reduce by key. These operations require the data to be compared against other data elements which is expensive. It is important to learn the Spark API well to choose the best combination of transformations and where to position them in your job. Create the simplest and most efficient algorithm necessary to answer the question.

Allocating datastore long ID's - but segmented so different Kinds have different ranges

My program has 3 kinds that are closely related and I want to be able to store and manipulate their long id's interchangeably, e.g. I might have an array of long id's that can be for any of the 3 Kind's.
Using the allocateIds API I can allocate the ID's for the 3 kinds in the same namespace, but I also sometimes need to be able to tell which Kind one of these id's referred to (e.g. in order to do a datastore operation on the right Kind).
I understand that the 'normal' way to this is to store the whole Key type, rather then just the long id, but there will be a huge number of these - it will be more efficient if I can just use 'long' values rather then Key values.
So, I'd like to be able to segment the ID ranges, so I can call a simple function with an ID and it will tell me which of the 3 Kind's the ID is for.
(I'm using Java, but I don't think that matters.)
Allocate my own ID's
I guess the most straight-forward way to do this is to simply allocate my own ID's. I believe that, in order to allocate sequential ID's, I would need to do an extra datastore write for every allocation (to track the allocations), or get into some complicated system of pre-allocating ranges of ID's to each live instance. This sounds like a bad idea.
So I could generate random 54 bit ID's - reserving 2 bits to use as flags to indicate the type. But it is my understand that random or hash allocation dramatically reduces the number of allocations that can be made safely. The Internet tells me that the chance of a collision is approximately k^2 / 2N, where k is number of allocations and N is the size of the allocation space. So, if I'm willing to accept 0.1% chance of collision then k=sqrt(2*2^54/1000) = ~1.9 million. Since I really have no idea how many entities I will need to store, this is unacceptable.
Reserve some bits in the Long ID to indicate the Kind
Another solution would be to use 2 bits of the long value as flags to indicate the type. The easiest way to do this would be to take advantage of the fact that the allocator now only uses the low 56 bits of a long. So I could use the high bits as flags to indicate the Kind. The problem with that solution is that I lose the ability to manipulate these numbers in javascript - the reason for the 56 bit limit in the first place.
An alternative to this - to maintain the option of manipulating these numbers in js - is to use allocateIdRange and pre-allocate (and throw away) the ID ranges corresponding to bits 54 and 55. Actually, I could use any bits, but specifying the ID ranges is much easier if I use the high bits.
But I know little of how the datastore and how the allocator actually work, so I don't know if this 'pre-allocate and discard' technique is a good idea.

Time complexity of Hash table

I am confused about the time complexity of hash table many articles state that they are "amortized O(1)" not true order O(1) what does this mean in real applications. What is the average time complexity of the operations in a hash table, in actual implementation not in theory, and why are the operations not true O(1)?
It's impossible to know in advance how many collisions you will get with your hash function, as well as things like needing to resize. This can add an element of unpredictability to the performance of a hash table, making it not true O(1). However, virtually all hash table implementations offer O(1) on the vast, vast, vast majority of inserts. This is the same as array inserting - it's O(1) unless you need to resize, in which case it's O(n), plus the collision uncertainty.
In reality, hash collisions are very rare and the only condition in which you'd need to worry about these details is when your specific code has a very tight time window in which it must run. For virtually every use case, hash tables are O(1). More impressive than O(1) insertion is O(1) lookup.
For some uses of hash tables, it's impossible to create them of the "right" size in advance, because it is not known how many elements will need to be held simultaneously during the lifetime of the table. If you want to keep fast access, you need to resize the table from time to time as the number of element grows. This resizing takes linear time with respect to the number of elements already in the table, and is usually done on an insertion, when the number elements passes a threshold.
These resizing operations can be made seldom enough that the amortized cost of insertion is still constant (by following a geometric progression for the size of the table, for instance doubling the size each time it is resized). But one insertion from time to time takes O(n) time because it triggers a resize.
In practice, this is not a problem unless you are building hard real-time applications.
Inserting a value into a Hash table takes, on the average case, O(1) time. The hash function is
computed, the bucked is chosen from the hash table, and then item is inserted. In the worst case scenario,
all of the elements will have hashed to the same value, which means either the entire bucket list must be
traversed or, in the case of open addressing, the entire table must be probed until an empty spot is found.
Therefore, in the worst case, insertion takes O(n) time
refer: http://www.cs.unc.edu/~plaisted/comp550/Neyer%20paper.pdf (Hash Table Section)

A couple of questions about Hash Tables

I've been reading a lot about Hash Tables and how to implement on in C and I think I have almost all the concepts in my head so I can start to code my own, I just have a couple of questions that I have yet to properly understand.
As a reference, I've been reading this:
http://eternallyconfuzzled.com/jsw_home.aspx
1) As I've read on the site above, a power of two or a prime number is recommended for the Hash Table size. This is basically an array and an array has a fixed size so I can quickly look up for the value I'm looking for. I can't declare a small array if I have a large input as it won't fit and I can't declare a very large array if my input data is not that large cause it's wasted memory.
What is the optimum size for the Hash Table? What should I base my decision on?
2) Also, on that site, there's a couple of hashing functions which I have yet to read them all. It also states that it's always best to use a good known algorithm and to roll my own. And I might do just that, I'll pick one from that site and test it out on my code and see if it minimizes collisions based on my input data.
What's bugging me is how I control the hash range? The hash can't return and integer larger than the Hash Table size or we'll have a serious problem. How do I deal with this?
1) What you are referring to is the load factor of the hash table - the percentage of buckets that are expected to be filled. Wikipedia has this to say:
With a good hash function, the average
lookup cost is nearly constant as the
load factor increases from 0 up to 0.7
or so. Beyond that point, the
probability of collisions and the cost
of handling them increases.
I believe the Java implementation (and probably others) resizes periodically to keep the load factor within an acceptable range.
2) Just use the modulo operator (%) to keep the bucket index legal. The second operator should be the size of your bucket array.
Pick a small size for your hash table. As you add stuff to your table, check to see what percentage of the table is being used; when it is greater than 70% full, make the table bigger. This also holds true as you remove elements-- make the table smaller when it is less than 60% full, for instance. Wikipedia has a good description of some strategies for dynamic resizing, but that's the general idea.
I only say this because you seem to have known input data:
If you know the rough order of magnitude of the amount of data you will be storing in the hash table, it's generally good enough to just create a table about that big. (You shouldn't worry about whether everything will fit. Instead, the right thing to think about is how many collisions you will have and how you will handle them.)
As for the right hash function, it's possible that the structure of your input will suggest which one will be correct. For instance, what aspects of your input are likely to be evenly distributed?

Resources