Is Kusto query ingestion (.set-or-append) guaranteed if run async

Is Kusto query ingestion (.set-or-append) guaranteed if run async - azure-data-explorer

I cannot find the answer from the document.
If I run query ingestion using
.set-or-append async
will the result be guaranteed?
Currently, we are running all those data grooming operations without the async keyword. Sometimes they jam our cluster, but we know they fail and can recover from a queue.
If we fire them async, then we don't have any control to know if they fail and recover.
What are the recommended ways to handle this?
For more context detail, we have to groom the data from 2 very large tables to the other 4 tables with aggregated data. UpdatePolicy is not really applied in this use case. We only need to run it once a week, since it's more like a weekly or monthly based aggregation. Unfortunately, when they run, we seem to easily get throttled.

Currently we are running all those data grooming operations without the async keyword. Sometimes they jam our cluster, but we know they fail and can recover from a queue.
This isn't accurate - these commands don't [currently] get queued. if a command fails, due to whatever reason (transient failure, throttling due to hitting ingestion capacity, etc.) - it fails, and it's the responsibility of the caller to retry it (if it makes sense to do so, based on the kind of the failure)
For more context detail, we have to groom the data from 2 very large tables to other 4 tables with aggregated data. UpdatePolicy is not really applied in this use case. We only need to run it once a week, since it's more like a weekly or monthly based aggregation. Unfortunately, when they run, we seem to easily get throttled.
The commands being frequently throttled is an indication that your cluster is frequently fully utilizing its ingestion capacity. as ingestion capacity scales linearly with the amount of cores in the cluster, you could choose to scale out/up your cluster, to get additional ingestion capacity (you could also do that on a schedule, temporarily, only when your additional ingestion load is expected to run).
note that you can (and probably should) monitor the state/status of commands you run using the async keyword - that could be done using the .show operations command (doc)

Related

Neptune and Gremlin - wild CPU utilization

Issue
I'm seeing extremely high CPU utilization when making [what seems like] a fairly common ask to a graph database. In fact, the utilization is so large that Amazon Neptune seems to "tap out" when I execute multiple of the queries in rapid succession / simultaneously.
I've experimented with even the largest (db.r5d.24xlarge) instance, which costs $14k/ month (😅), and I still see the general behavior -- memory is fine, CPU utilization goes wild.
Further, the query isn't exactly quick (~5 seconds), so I'm kind of getting the worst of both worlds...
Ask
Can someone help me understand what I can do to address this / review my query? My hope was, just as a relational database can handle thousands of concurrent requests, Amazon Neptune could do similar.
I suspect I might be using .limit() in the wrong spot. That said, the results produced are correct. If the limit() is moved up before the .and(), I'm concerned I'd end up with suboptimal results, since -- due to the until() clauses -- it could just be paths that stopped w/o making it to the toPortId.
Update: I’m wondering if it might make sense to add a timeLimit() clause to the repeat() clause … anecdotally, after some testing, it seems like I’m paying a significant time penalty to rigorously verify that there are, in fact, no routes to certain places.
Detail
I'm new to Gremlin / Neptune / graph databases in general, so it's possible/likely that I'm either doing something wrong (or have something incorrectly configured) -or- I overestimated the ability of these tools to handle queries.
My use-case requires that I "qualify" (or disqualify) a large number of candidate destinations -- that is:
Starting at ORIGIN (fromPortId) are there / what are the paths to CANDIDATE (toPortId) for which the duration properties sums to less than LIMIT (travelTimeBudget)?
Offending Query
const result = await g.withSack(INITIAL_CHECKIN_PENALTY)
.V(fromPortId)
.repeat(
(__.outE().hasLabel('HAS_VOYAGE_TO').sack(operator.sum).by('duration'))
.sack(operator.sum).by(__.constant(LAYEROVER_PENALTY))
.inV()
.simplePath()
)
.until(
__.or(
__.has('code', toPortId),
__.sack().is(gte(travelTimeBudget)),
__.loops().is(maxHops))
)
.and(
__.has('code', toPortId),
__.sack().is(lte(travelTimeBudget))
)
.limit(4)
.order().by(__.sack(), order.asc)
.local(
__.union(
__.path().by('code').by('duration'),
__.sack()
).fold()
)
.local(
__.unfold().unfold().fold()
)
.toList()
.then(data => {
return data
})
.catch(error => {
console.log('ERROR', error);
return false
});
Execution
Database: Amazon Neptune (db.r5.large)
Querying from / runtime: AWS Lambda running Node.js
Graph
5,300 vertices (Port)
42,000 edges (HAS_VOYAGE_TO)
Data Model
Note: For this issue, I'm only executing the "upper plane" / top half.

To address the CPU discussion, perhaps a small explanation of the Neptune instance architecture will be beneficial. Each Neptune instance has a worker thread pool. The number of workers in the pool is twice the number of vCPU the instance has. That controls how many queries (maximum) can be running concurrently on a given instance. Each instance also has a query queue that will hold queries sent to the instance until a worker becomes available. For a somewhat complex query, against a small instance type (like the db.r5.large), seeing 20 to 30% CPU utilization is not unexpected. A significant portion of the instance memory is used to cache graph data locally in a most recently used fashion. The remaining memory is allocated to the worker threads for query execution. So larger instances have (a) more workers (b) more memory for caching graph data locally and (c) more memory for use during query execution.
Without having the data it's hard to attempt to modify your query and test it, but it may well be possible to optimize various parts of it to improve its overall efficiency.
As there are a lot of topics that we could discuss here, and I am more than happy to do that, it might be easiest, if you are able, to open an AWS support case and ask that the support engineer create a ticket and have it assigned to me. I will be happy to then get on the phone with you and we can spend some time discussing your use case if that would help.
If you are not able to open a support case, we can connect other ways. I'm easy to find on LinkedIn if you prefer to reach out that way.
I very much want to help you with this and your other questions, but I feel the most expeditious way might be for us to get on the phone.
I'm also happy to keep discussing here of course.

Gremlin console keeps returning "Connection to server is no longer active" error

I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?

The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.

timing queries in gremlin-console

I am trying to compare the response time of my queries in gremlin-console (the graph database is janusgraph, and the backend database is hbase). To do that, there is the "clock()" step, that can run the query multiple times and return the average response time.
But as stated in the documentation, there is a "warm up" phase :
The warm up simply consists of running the query one time before
timing starts. This means that for a single timing iteration, the
human perceived time will be roughly double the time returned by the
clock analysis.
Because of that warm up phase, all the graph needed for the traversal is always in the cache, which will not be true in the real world.
For example, the query I am working on takes 6 minutes to complete because there is a lot of data to fetch from the hbase backend, but the clock() step display a execution time of 10s, which could only be true in the best scenario.
Is there another, better way to get a correct execution time of my queries using gremlin-console ?

I think you can still use clock(). Just rollback the transaction between executions:
clock { g.V().iterate();g.tx().rollback() }

How to write integration test for systems that interact asynchronously

Assume that i have function called PlaceOrder, which when called inserts the order details into local DB and puts a message(order details) into a TIBCO EMS Queue.
Once message received, a TIBCO BW will then invoke some other system(say ExternalSystem) to pass on the order details.
Now the way i wrote my integration tests is
Call the Place Order
Sleep, and check details exists in local DB
Sleep and check details exists in ExternalSystem.
Is the above approach correct? Above test gives me confidence that, End to End integration is working, but are there any better way to test above scenario?

The problem you describe is quite common, and your approach is a very typical solution.
The problem with this solution is that if the delay is too short, your tests may sometimes pass and sometimes fail, but if the delay is very long, then your just wasteing time waiting, and with many tests, it can add a lot of delay. But unless you can get some signal to tell you the order arrived in the database, then you just have to wait.
You can reduce the delay by doing lots of checks with short intervals. If you're order is not there after timeout, then you would fail the test.
In "Growing Object-Oriented Software, Guided by Tests"*, there is a chapter on this very subject, so you might want to get a copy if you will be doing a lot of this sort of testing.
"There are two ways a test can observe the system: by sampling its observable
state or by listening for events that it sends out. Of these, sampling is
often the only option because many systems don’t send any monitoring
events. It’s quite common for a test to include both techniques to interact
with different “ends” of its system"
(*) http://my.safaribooksonline.com/book/software-engineering-and-development/software-testing/9780321574442

How to serialize data reliably

Good day, I receive data from a communication channel and display it. Parallel, I serialize it into a SQLite database (using normal SQL INSERT statements). After my application exit I do a .commit on the sqlite object.
What happens if my application is terminated brutally in the middle? Will the latest (reasonably - not say 100 microsec ago, but at least a sec ago) data be safely in the database even without a .commit is made? Or should I have periodic commit? What are best patterns for doing these things?
I tried autocommit on (sqlite's option) and this slows code a lot by a factor ~55 (autocommit vs. just one commit at end). Doing commit every 100 inserts brings performance within 20% of the optimal mode. So autocommit is very slow for me.
My application pumps lots data into DB - what can I do to make it work well?

You should be performing this within a transaction, and consequently performing a commit at appropriate points in the process. A transaction will guarantee that this operation is atomic - that is, it either works or doesn't work.
Atomicity states that database
modifications must follow an “all or
nothing” rule. Each transaction is
said to be “atomic” if when one part
of the transaction fails, the entire
transaction fails. It is critical that
the database management system
maintain the atomic nature of
transactions in spite of any DBMS,
operating system or hardware failure.
If you've not committed, then the inserts won't be visible (and be rolled back) when your process is terminated.
When do you perform these commits ? When your inserts represent something consistent and complete. e.g.. if you have to insert 2 pieces of information for each message, then commit after you've inserted both pieces of info. Don't commit after each one, since your info won't be consistent or complete.

The data is not permanent in the database without a commit. Use an occasional commit to balance the speed of performing many inserts in a transaction (the more frequent the commit, the slower) with the safety of having more frequent commits.

You should do a COMMIT every time you complete a logical change.
One reason for transaction is to prevent uncommitted data from a transaction to be visible from outside. That is important because sometimes a single logical change can translate into multiple INSERT or UPDATE statements. If one of the latter queries of the transaction fails, the transaction can be cancelled with ROLLBACK and no change at all is recorded.
Generally speaking, no change performed in a transaction is recorded in the database until COMMIT succeeds.
does not this slow down considerably my code? – zaharpopov
Frequent commits, might slow down your code, and as an optimization you could try grouping several logical changes in a single transaction. But this is a departure from the correct use of transactions and you should only do this after measuring that this significantly improves performance.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex