When continuous export job run time overlaps with intervalBetweenRuns in Kusto - azure-data-explorer

I have created a continuous export job with 5m as intervalBetweenRuns. What will happen if the export itself takes more than 5mins? Is there any real problem here other than export getting slightly delayed? Lets say export goes on for 10 mins, will it still try to take care of the export run that it missed or it will keep waiting for the next window before running? The following diagram illustrates my question:-
Now consider a scenario -- the interval is set to 5 mins and a single run of the export job takes 10-15 mins to run. It's all right if the export continuously keeps running back to back but is there any real issue other than just data delay? Data delay (data arriving in the destination) is fine with me.

If some runs take longer than the interval between runs, that's fine. The continuous export will catch up. If this constantly happens, then the continuous export will start developing a lag, and will never be able to catch up, of course. In any case, no data is ever skipped.

Related

I get "Buffered data was truncated after reaching the output size limit.", my session still running

I'm running code to predict and show the result. After a number of iteration and time, I got message "Buffered data was truncated after reaching the output size limit." and my session still running in the cell code.
This is the snippet image of what I got
I have read some information telling that the machine keeps running the program in the background and process the output without display in out colab's page. I've just faced this issue for the first time and didn't setting a code for saving the output in a file. Is there any solution to save the output from the background after our program finished the running?

How to run Airflow DAG for specific number of times?

How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible

timing queries in gremlin-console

I am trying to compare the response time of my queries in gremlin-console (the graph database is janusgraph, and the backend database is hbase). To do that, there is the "clock()" step, that can run the query multiple times and return the average response time.
But as stated in the documentation, there is a "warm up" phase :
The warm up simply consists of running the query one time before
timing starts. This means that for a single timing iteration, the
human perceived time will be roughly double the time returned by the
clock analysis.
Because of that warm up phase, all the graph needed for the traversal is always in the cache, which will not be true in the real world.
For example, the query I am working on takes 6 minutes to complete because there is a lot of data to fetch from the hbase backend, but the clock() step display a execution time of 10s, which could only be true in the best scenario.
Is there another, better way to get a correct execution time of my queries using gremlin-console ?
I think you can still use clock(). Just rollback the transaction between executions:
clock { g.V().iterate();g.tx().rollback() }

How to use VSTS Loadtest Goal based load pattern to achieve a constant test per second

I am using Visual Studio TS Load Test for running WebTest (one client/controls hitting one server). How can I configure goal based load pattern to achieve a constant test / second?
I tried to use the counter 'Tests/Sec' under 'LoadTest:Test' but it does not seem to do anything.
I've recently tested against the Tests / Sec, and I've confirmed it working.
For the settings on the Goal Based Load Pattern, I used:
Category: LoadTest:Test
Counter: Tests/Sec
Instance: _Total
When the load test starts, verify it doesn't show an error re: not being able to access that Performance Counter.
Tests I ran for my own needs:
Set Initial User Load quite low (10), and gave it 5 minutes to see if
it would reach the target Tests / Sec target, and stabilise. In my case, it stabilised after about 1 minute 40.
Set the Maximum User Count [Increment|Decrement] to 50. Turns out the
user load would yo-yo up and down quite heavily, as it would keep
trying to play catch-up. (As the tests took 10-20 seconds each)
Set the Initial User Load quite close to the 'answer' from test 1,
and watched it make small but constant adjustments to the user
volume.
Note: When watching the stats, watch the value under "Last". I believe the "Average" is averaged over a relatively long period, and may appear out of step with the target.

QTimer firing issue in QGIS(Quantum GIS)

I have been involved in building a custum QGIS application in which live data is to be shown on the viewer of the application.
The IPC being used is unix message queues.
The data is to be refreshed at a specified interval say, 3 seconds.
Now the problem that i am facing is that the processing of the data which is to be shown is taking more than 3 seconds,so what i have done is that before the app starts to process data for the next update,the refresh QTimer is stopped and after the data is processed i again restart the QTimer.The app should work in such a way that after an update/refresh(during this refresh the app goes unresponsive) the user should get ample time to continue to work on the app apart from seeing the data being updated.I am able to get acceptable pauses for the user to work-- in one scenario.
But on different OS(RHEL 5.0 to RHEL 5.2) the situation is something different.The timer goes wild and continues to fire without giving any pauses b/w the successive updates thus going into an infinite loop.Handling this update data definitely takes longer than 3 sec,but for that very reason i have stopped-restarted the timer while processing..and the same logic works in one scenario while in other it doesnt.. The other fact that i have observed is that when this quick firing of the timer happens the time taken by the refreshing function to exit is very small abt 300ms so the start-stop of the timer that i have placed at the start-and-end of this function happens in that small time..so before the actual processing of the data finishes,there are 3-4 starts of the timer in queue waiting to be executed and thus the infinite looping problem gets worse from that point for every successive update.
The important thing to note here is that for the same code in one OS the refresh time is shown to be as around 4000ms(the actual processing time taken for the same amount of data) while for the other OS its 300ms.
Maybe this has something to do with newer libs on the updated OS..but I dont know how to debug it because i am not able to get any clues why its happening as such... maybe something related to pthreads has changed b/w the OSs??
So, my query is that is there any way that will assure that some processing in my app is timerised(and which is independent of the OS) without using QTimer as i think that QTimer is not a good option to achieve what i want??
What option can be there?? pthreads or Boost threads? which one would be better if i am to use threads as an alternate??But how can i make sure atleast a 3 second gap b/w successive updates no matter how long the update processing takes?
Kindly help.
Thanks.
If I was trying to get an acceptable, longer-term solution, I would investigate updating your display in a separate thread. In that thread, you could paint the display to an image, updating as often as you desire... although you might want to throttle the thread so it doesn't take all of the processing time available. Then in the UI thread, you could read that image and draw it to screen. That could improve your responsiveness to panning, since you could be displaying different parts of the image. You could update the image every 3 seconds based on a timer (just redraw from the source), or you could have the other thread emit a signal whenever the new data is completely refreshed.

Resources