I am trying to compare the response time of my queries in gremlin-console (the graph database is janusgraph, and the backend database is hbase). To do that, there is the "clock()" step, that can run the query multiple times and return the average response time.
But as stated in the documentation, there is a "warm up" phase :
The warm up simply consists of running the query one time before
timing starts. This means that for a single timing iteration, the
human perceived time will be roughly double the time returned by the
clock analysis.
Because of that warm up phase, all the graph needed for the traversal is always in the cache, which will not be true in the real world.
For example, the query I am working on takes 6 minutes to complete because there is a lot of data to fetch from the hbase backend, but the clock() step display a execution time of 10s, which could only be true in the best scenario.
Is there another, better way to get a correct execution time of my queries using gremlin-console ?
I think you can still use clock(). Just rollback the transaction between executions:
clock { g.V().iterate();g.tx().rollback() }
Related
...aside from the benefit in separate performance monitoring and logging.
For logging, I am confident I can get granularity through manually adding the name of the "routine" to each call. This is how it is now with several discrete Functions for different parts of the system:
There are multiple automatic logs: start and finish of the routine, for example. It would be more challenging to find out how expensive certain routines are, but it would not be impossible.
The reason I want the entire logic of the application handled by a single handle function is because of reducing cold starts: one function means only one container that can be persistently kept alive when there are very few users of the app.
If a month is ~2.6m seconds and we assume the system uses 1 GB RAM and 1 GHz CPU frequency at all times, that's:
2600000 * 0.0000025 + 2600000 * 0.000001042 = USD$9.21 a month
...for one minimum instance.
I should also state that all of my functions have the bare minimum amount of global scope code; it just sets up Firebase assets (RTDB and Firestore).
From a billing, performance (based on user wait time), and user/developer experience perspective, is there any reason why it would be smart to keep all my functions discrete?
I'd also accept an answer saying "one single function for all logic is reasonable" as long as there's a reason for it.
Thanks!
If you have very small app with ~5 end points and very low traffic. Sure you could do something like this. But why not do it:
billing and performance
The important thing to realize is that with every request a new instance of your function is created. Which means there could be 10s of them running at the same time.
If you would like to have just 1 instance handling all the traffic you should explore GCP Cloud run, where you have 1 container handling multiple requests and scaling only when it's not sufficient.
Imagine you have several end-points and every one of them have different performance requirements.
1 can need only 128MB or RAM
1 can need 1GB RAM
(FYI: You can control the CPU MHz of the function via the RAM settings too - which can speed up execution in some cases)
If you had only 1 function with 1GB of ram. Every request would allocate such function and in some cases most of the memory could go to waste.
But if you split it into multiple, some requests will require much less resources and can save you $ when we talk about bigger amount of executions / month. (tens of thousands+).
Let's imagine function, 3 second execution, 10k executions/month:
128MB would cost you $0.0693
1024MB would cost you $0.495
As you can see, with small app the difference could be nothing. But if you scale it matters. (*The cost can vary based on datacenter)
As for the logging, I don't think it matters. Usually in bigger systems there could be messages traveling trough several functions so you have to deal with that anyway.
As for the cold start. You just need good UI to facilitate that. At first I was worry about it in our apps but later on, you just get used to it that some action can take ~2s to execute (cold start). And you should have the UI "loading" regardless, because you don't know if the function will take ~100ms or 3s due to bad connection.
Recently i got the task to optimize a quite huge PLSQL script which prior to my changes took about 1 hour +/- 10mins.
So I got to do some reallocation of some methods and generally just some replacement of big views with simpler subquery or with statements. I noticed that if I ran the scheduled job by right-clicking it and execute job I would in most cases see the run duration change (in a positive way). But if I enabled the job and let it run by its schedule it takes the original hour no matter what changes you do to it.
Now my question here is: Is there any way to monitor the RAM or CPU usage of the session/job or is there a difference in general how many resources are allocated to background processes? Because my suspicion here is the "manual" run job somehow gets some priorities the scheduler doesn't get or doesn't take.
Either way for troubleshooting purposes you can't take a few hours a work day just to wait for results.
I cannot find the answer from the document.
If I run query ingestion using
.set-or-append async
will the result be guaranteed?
Currently, we are running all those data grooming operations without the async keyword. Sometimes they jam our cluster, but we know they fail and can recover from a queue.
If we fire them async, then we don't have any control to know if they fail and recover.
What are the recommended ways to handle this?
For more context detail, we have to groom the data from 2 very large tables to the other 4 tables with aggregated data. UpdatePolicy is not really applied in this use case. We only need to run it once a week, since it's more like a weekly or monthly based aggregation. Unfortunately, when they run, we seem to easily get throttled.
Currently we are running all those data grooming operations without the async keyword. Sometimes they jam our cluster, but we know they fail and can recover from a queue.
This isn't accurate - these commands don't [currently] get queued. if a command fails, due to whatever reason (transient failure, throttling due to hitting ingestion capacity, etc.) - it fails, and it's the responsibility of the caller to retry it (if it makes sense to do so, based on the kind of the failure)
For more context detail, we have to groom the data from 2 very large tables to other 4 tables with aggregated data. UpdatePolicy is not really applied in this use case. We only need to run it once a week, since it's more like a weekly or monthly based aggregation. Unfortunately, when they run, we seem to easily get throttled.
The commands being frequently throttled is an indication that your cluster is frequently fully utilizing its ingestion capacity. as ingestion capacity scales linearly with the amount of cores in the cluster, you could choose to scale out/up your cluster, to get additional ingestion capacity (you could also do that on a schedule, temporarily, only when your additional ingestion load is expected to run).
note that you can (and probably should) monitor the state/status of commands you run using the async keyword - that could be done using the .show operations command (doc)
I need to find out the peak read capacity units consumed in the last 20 seconds in one of my dynamo DB table. I need to find this pro-grammatically in java and set an auto-scaling action based on the usage.
Please can you share a sample java program to find the peak read capacity units consumed in the last 20 seconds for a particular dynamo DB table?
Note: there are unusual spikes in the dynamo DB requests on the database and hence needs dynamic auto-scaling.
I've tried this:
result = DYNAMODB_CLIENT.describeTable(recomtableName);
readCapacityUnits = result.getTable()
.getProvisionedThroughput().getReadCapacityUnits();
but this gives the provisioned capacity but I need the consumed capacity in last 20 seconds.
You could use the CloudWatch API getMetricStatistics method to get a reading for the capacity metric you require. A hint for the kinds of parameters you need to set can be found here.
For that you have to use Cloudwatch.
GetMetricStatisticsRequest metricStatisticsRequest = new GetMetricStatisticsRequest()
metricStatisticsRequest.setStartTime(startDate)
metricStatisticsRequest.setEndTime(endDate)
metricStatisticsRequest.setNamespace("AWS/DynamoDB")
metricStatisticsRequest.setMetricName('ConsumedWriteCapacityUnits',)
metricStatisticsRequest.setPeriod(60)
metricStatisticsRequest.setStatistics([
'SampleCount',
'Average',
'Sum',
'Minimum',
'Maximum'
])
List<Dimension> dimensions = []
Dimension dimension = new Dimension()
dimension.setName('TableName')
dimension.setValue(dynamoTableHelperService.campaignPkToTableName(campaignPk))
dimensions << dimension
metricStatisticsRequest.setDimensions(dimensions)
client.getMetricStatistics(metricStatisticsRequest)
But I bet you'd results older than 5 minutes.
Actually current off the shelf autscaling is using Cloudwatch. This does have a drawback and for some applications is unacceptable.
When spike load is hitting your table it does not have enough capacity to respond with. Reserved with some overload is not enough and a table starts throttling. If records are kept in memory while waiting a table to respond it can simply blow the memory up. Cloudwatch on the other hand reacts in some time often when spike is gone. Based on our tests it was at least 5 mins. And rising capacity gradually, when it was needed straight up to the max
Long story short. We have created custom solution with own speedometers. What it does is counting whatever it has to count and changing tables's capacity accordingly. There is a still a delay because
App itself takes a bit of time to understand what to do
Dynamo table takes ~30 sec to get updated with new capacity details.
On a top we also have a throttling detector. So if write/read request has got throttled we immediately rise a capacity accordingly. Some times level of capacity looks all right but throttling because of HOT key issue.
Good day, I receive data from a communication channel and display it. Parallel, I serialize it into a SQLite database (using normal SQL INSERT statements). After my application exit I do a .commit on the sqlite object.
What happens if my application is terminated brutally in the middle? Will the latest (reasonably - not say 100 microsec ago, but at least a sec ago) data be safely in the database even without a .commit is made? Or should I have periodic commit? What are best patterns for doing these things?
I tried autocommit on (sqlite's option) and this slows code a lot by a factor ~55 (autocommit vs. just one commit at end). Doing commit every 100 inserts brings performance within 20% of the optimal mode. So autocommit is very slow for me.
My application pumps lots data into DB - what can I do to make it work well?
You should be performing this within a transaction, and consequently performing a commit at appropriate points in the process. A transaction will guarantee that this operation is atomic - that is, it either works or doesn't work.
Atomicity states that database
modifications must follow an “all or
nothing” rule. Each transaction is
said to be “atomic” if when one part
of the transaction fails, the entire
transaction fails. It is critical that
the database management system
maintain the atomic nature of
transactions in spite of any DBMS,
operating system or hardware failure.
If you've not committed, then the inserts won't be visible (and be rolled back) when your process is terminated.
When do you perform these commits ? When your inserts represent something consistent and complete. e.g.. if you have to insert 2 pieces of information for each message, then commit after you've inserted both pieces of info. Don't commit after each one, since your info won't be consistent or complete.
The data is not permanent in the database without a commit. Use an occasional commit to balance the speed of performing many inserts in a transaction (the more frequent the commit, the slower) with the safety of having more frequent commits.
You should do a COMMIT every time you complete a logical change.
One reason for transaction is to prevent uncommitted data from a transaction to be visible from outside. That is important because sometimes a single logical change can translate into multiple INSERT or UPDATE statements. If one of the latter queries of the transaction fails, the transaction can be cancelled with ROLLBACK and no change at all is recorded.
Generally speaking, no change performed in a transaction is recorded in the database until COMMIT succeeds.
does not this slow down considerably my code? – zaharpopov
Frequent commits, might slow down your code, and as an optimization you could try grouping several logical changes in a single transaction. But this is a departure from the correct use of transactions and you should only do this after measuring that this significantly improves performance.