Is there a way to define human in the loop tasks

Is there a way to define human in the loop tasks - airflow

In apache airflow: What I want is to show a small web ui / form to a human to let him make a decision that we cannot implmement in code (yet). I haven't found anything in the docs or with google for that yet - so is there any architectural problem that would prevent that?

At this point in time, there is no such thing in Airflow.
(thinking out loud) You could set up a form yourself, which sets a True/False somewhere, such as a file or database. In Airflow, you'd then use a sensor to check whether True or False is set. A few examples; you could use the SqlSensor for a database query, the S3KeySensor for a file to be present on AWS S3, or the generic PythonSensor to provide your own condition.

Related

In Airflow, what do they want you to do instead of pass data between tasks

In the docs, they say that you should avoid passing data between tasks:
This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom that is described in the section XComs.
I fundamentally don't understand what they mean. If there's no data to pass between two tasks, why are they part of the same DAG?
I've got half a dozen different tasks that take turns editing one file in place, and each send an XML report to a final task that compiles a report of what was done. Airflow wants me to put all of that in one Operator? Then what am I gaining by doing it in Airflow? Or how can I restructure it in an Airflowy way?

fundamentally, each instance of an operator in a DAG is mapped to a different task.
This is a subtle but very important point: in general if two operators need to share
information, like a filename or small amount of data, you should consider combining them
into a single operator
the above sentence means that if you want any information that needs to be shared between two different tasks then it is best you could combine them into one task instead of using two different tasks, on the other hand, if you must use two different tasks and you need to pass some information from one task to another then you can do it using
Airflow's XCOM, which is similar to a key-value store.
In a Data Engineering use case, file schema before processing is important. imagine two tasks as follows :
Files_Exist_Check : the purpose of this task is to check whether particular files exist in a directory or not
before continuing.
Check_Files_Schema: the purpose of this task is to check whether the file schema matches the expected schema or not.
It would only make sense to start your processing if Files_Exist_Check task succeeds. i.e. you have some files to process.
In this case, you can "push" some key to xcom like "file_exists" with the value being the count of files present in that particular directory in Task Files_Exist_Check.
Now, you "pull" this value using the same key in Check_Files_Schema Task, if it returns 0 then there are no files for you to process hence you can raise exception and fail the task or handle gracefully.
hence sharing information across tasks using xcom does come in handy in this case.
you can refer following link for more info :
https://www.astronomer.io/guides/airflow-datastores/
Airflow - How to pass xcom variable into Python function

What you have to do for avoiding having everything in one operator is saving the data somewhere. I don't quite understand your flow, but if for instance, you want to extract data from an API and insert that in a database, you would need to have:
PythonOperator(or BashOperator, whatever) that takes the data from the API and saves it to S3/local file/Google Drive/Azure Storage...
SqlRelated operator that takes the data from the storage and insert it into the database
Anyway, if you know which files are you going to edit, you may also use jinja templates or reading info from a text file and make a loop or something in the DAG. I could help you more if you clarify a little bit your actual flow

I've decided that, as mentioned by #Anand Vidvat, they are making a distinction between Operators and Tasks here. What I think is that they don't want you to write two Operators that inherently need to be paired together and pass data to each other. On the other hand, it's fine to have one task use data from another, you just have to provide filenames etc in the DAG definition.
For example, many of the builtin Operators have constructor parameters for files, like the S3FileTransformOperator. Confusing documentation, but oh well!

Firebase/Firestore Testing Pattern: mock fake data without code duplication (DRY)

This is a question about programming style and patterns when it comes to writing tests for complex systems written in Firebase/Firestore. I'm writing a web app using Firebase, Typescript, Angular, Firestore, etc...
Objective:
I have written basic security rules tests to test my users collection. I'll also be testing the function that writes the document, and e2e testing that the document is written when a new user signs up. So far so good.
The tests are clean so far - I manually define a few user objects, write it to the database before the tests with beforeEach() and run the tests. The tests depend on me knowing what data I wrote in the first place - what were the document ids, what were the field values etc... and then I check for certain operations to pass, certain operations to fail, depending on the custom claims provided. Similarly, with functions tests and e2e tests, I'll be checking if correct data was written, generated, etc...
The next step would be to test the chat functionality. Here's where I run into code duplication issues.
Issue:
So let's say I want to test the chat functionality or the transaction functionality. For these modules to be tested, the database needs to have fake user data, transaction data, test data etc... and, furthermore, within the tests, I need to have access to the fields, document Ids, etc.. that were written to the database, so I can access the documents for tests, etc...
For example, whether or not a chat message can be written to firestore depends on whether certain fields exist on the user document.
This would require me to manually define all the objects I plan to write to the database in the same file as the tests themselves, so I have access to what I wrote. As I test more complex and dependent modules of the system, since each test for each module is in a separate file, I either have to manually write out each object, or require it from another test file and write it. Each document has its payload and its Id that I need to keep track of, and even the fake user token objects I have to pass to firestore (or actual auth user records I have to import for online testing). This would mean a whole bunch of boilerplate code and duplication simply writing objects as such in all of my files:
const fakeUserPayload: User = {
handle: 'username',
email: faker.internet.email(),
...etc
}
// And so on and on for all test users, chat docs, transaction docs, etc...
Potential Solution: so there are a few potential solutions I came up with, but none seem to solve the problem.
For example, I thought I would write a module that simply populates the firestore database at the start of each test. The module would have a userPayloadFactory and other loops (using faker module) to automatically populate the Db with fake data.
Problem: If I did this, I wouldn't have access to the document Ids and field data in my tests, since it's automatically generated. I thought, maybe I could populate firestore with fake data, and then use an administrative db connection to read the fake user documents and their Ids back into the tests, and then use this data to conduct the tests. For example, I would find the user id and then generate a chat document and test for correct data / success. Except it seems incredibly messy to write data in one module and then read it back in another, especially since most tests require a specific document to be written to test for certain cases/scenarios. Which makes auto-populated mock data useless - so we're back to square one, where I have to manually define and write out a large number of fake objects in order to test rules, functions and functionality.
Potential Solution: I could maybe keep auth and firestore data in a JSON backup file, (so they remain static) and import them into the database with a shell command before each test suite. However, this has its own issues as it's not easy to dynamically generate new test cases or edge cases, and also difficult to continually re-export and update the JSON backup files as the project grows.
What is a better way to structure and write my tests so that I can automatically generate the documents and payloads I need, while having control and access over what gets written?
I'm hoping there's some kind of factory or pattern that can make this easier, scalable and more consistent and robust.

You're asking a really huge question, writing tests for big environments is a complex task and even more when it's coupled to database state. I will try to answer to the best of my knowledge so take my words with a grain of salt.
I believe you're dealing with two similar yet different concerns, automatic creation of mock data and edge-cases that require very specific document setup. These two are tightly coupled within the tests itself as you need both kinds of data to run them, however the requirements differ one another and therefore their creation should take that in account. Let's talk about your potential solutions from that perspective.
The JSON backup provides a static and consistent dataset that allows to repeat the tests over time being sure the environment hasn't changed and it's a good candidate to address the edge-case problem. It's downsides are that is hardly maintainable because any object modification to accommodate changes in TestA may break the expectations of TestB that also relies on it, it's almost assured you will loss the track of these nuances at some point; You can add new objects to accommodate code and test changes but this could lead to a combinatoric explosion of the objects you need to take care of as your project grows. Finally JSON files are not the way to go if you are going to require dynamically generated data.
The factory method is a great option to deal with the creation of arbitrary mock data since there are less restrictions placed on it, so writing a generator seems a good idea. You disliked this based on the fact that you need to know your data while running the tests but I think that's solvable. Your test might load the Factory module, then create the data and store it in-memory/HDD in addition to commit this changes to Firestore, there's no need to read the data from the database.
Your other other concern were the corner case documents which is trickier because you need very specifically shaped data. You might handcraft the documents yourself but then you got a poorly scalable solution. The alternative is trying to look for constraints/invariants in the shape of edge-case documents that you can abstract into factory methods. The worst scenario here is that when some edge-cases do not share any similarity with the rest you will need to write a whole method for each of these. I won't consider this a downside as it improves the modularity and maintainability of the Factory .
Overall, I would stick with the Factory pattern because it already offers techniques to follow the DRY principle by the means of isolating the creation of distinct objects, decoupling data creation from test execution and facilities to avoid disruptive breaks as the tests evolve with the project.
With that being said a little research got me to this page about the Builder Pattern that you may find interesting. Also this thread about code duplication in tests might be of interest. And finally just to comment out that Firebase has some testing functionality that can be found here.
Hope this helps.

How can one set a variable for use only during a certain dag_run

How do I set a variable for use during a particular dag_run. I'm aware of setting values in xcom, but not all the operators that I use has xcom support. I also would not like to store the value into the Variables datastore, in case another dag run begins while the current one is running, that need to store different values.

The question is not clear, but from whatever I can infer, I'll try to clear your doubts
not all the operators that I use has xcom support
Apparently you've mistaken xcom with some other construct because xcom feature is part of TaskInstance and the functions xcom_push() and xcom_pull() are defined in BaseOperator itself (which is the parent of all Airflow operators)
I also would not like to store the value into the Variables datastore,
in case another dag run begins while the current one is running, that
need to store different values.
It is straightforward (and no-brainer) to separate-out Variables on per-DAG basis (see point (6)); but yes for different DagRuns of a single DAG, this kind of isolation would be a challenge. I can think of xcom to be the easiest workaround for this. Have a look at this for some insights on usage of Xcoms.
Additionally, if you want to manipulate Variables (or any other Airflow model) at runtime (though I would recommend you to avoid it particularly for Variables), Airflow also gives completely liberty to exploit the underlying SQLAlchemy ORM framework for that. You can take inspiration from this.

Can you get a log file of 'reads' on specific RECID(Tablename) in Progress-4GL/Openedge at RunTime without access to Source Code?

I want to know which tables are being read by a query.
for each Customer where CustomerID = 12345.
Eventually this customer will be found in the following example, but progress must 'read' many tables before getting to customer 12345.
How do I know exactly which tables are read (By CustomerID), prior to getting to customer 12345?
*NOTE: I do not have access to modify the code being run for this selection. Ideally I would run a separate set of code that is executed at the same time as the customer query above to track the reads.
EDIT: More clearly - Can you track reads from a given program (.p) OR ProcessID and output either a RECID or the PrimaryKey to a file?
I understand the information is being read off the Disk and probably stored in a database buffer. So how would I get at the information in the database buffer?

You seem to be mixing up a few different things.
In a situation like your example where you FIND a specific record in one, and only one table then there is just a single record read. Progress will find that record by first scanning a relevant index. That might be 2 or 3 "logical reads" of the b-tree to get to the proper node. The record block and index blocks may, or may not be read from disk - that depends on what has happened previously.
There are "Virtual System Tables" available that can tell you how many READ operations take place against a particular table or index. But they do not trace the specific ROWID or other identifying data. _TableStat and _IndexStat are aggregates for all users on the system, _UserTableStat and _UserIndexStat are specific to a particular user's activity. You do need to set the -tablerangesize and -indexrangesize parameters adequately to take advantage of these.
If you have enabled the table and index statistics then you can use a tool like ProTop - http://protop.wss.com to get insight into this activity. Or you can write your own code.
OpenEdge Auditing does not track reads. That would be prohibitively expensive.
It's probably not really a good idea but, in theory, you could write FIND triggers for the tables you are interested in. That doesn't require access to the application source but you would need a development license. It will probably kill performance to do this though - so unless this is a non-production test environment that you just want to fiddle with I wouldn't really do that.
You mention wanting to know how you got to that point. That sounds more like you might need to have a "4gl trace". One easy way to get the stack trace of a running process is to execute:
$DLC/bin/proGetStack PID (UNIX)
or
%DLC%\bin\proGetStack PID (Windows)
This command will generate a "protrace.pid" file containing a 4gl stack trace and other interesting information.
There are also more complicated ways to get that info like using PROMON and the "client statement cache" or setting various log entry types at session startup. But proGetStack is pretty convenient and requires no code or scripting changes.

Some great options from Tom above. And all of them may be relevant to you. The option he only skirts around is the logging options. I feel obliged to expand on this because I'm giving a talk on it in a couple of weeks!
Assuming you are running a modern version of Progress, or even 10.2B08, then you have client logging available to you. Start your session with these additional options:
-clientlog "\somefolder\somefile.txt"
-logentrytypes "QryInfo:3"
This will log all the info of all the queries in your session to the file you specified above. If you navigate to the point in the system where you want to analyse your query and empty the logfile and save it, you can then run the offending query and see all the detail you need.
The output tells you all sorts of useful info, including the number of reads on each table, compared with the number returned to the user. You also get the index selected.
Using Tom's advice and/or this will get you what you need.

How to use multiple namespaces in DoctrineCacheBundle cacheDriver?

I know I can setup multiple namespaces for DoctrineCacheBundle in config.yml file. But Can I use one driver but with multiple namespaces?
The case is that in my app I want to cache all queries for all of my entities. The problem is with flushing cache while making create/update actions. I want to flush only part of my cached queries. My app is used by multiple clients. So when a client updates sth in his data for instance in Article entity, I want to clear cache only for this client only for Article. I could add proper IDs for each query and remove them manually but the queries are dynamically used. In my API mobile app send version number for which DB should return data so I don't know what kind of IDs will be used in the end.

Unfortunately I don't think what you want to do can be solved with some configuration magic. What you want it some sort of indexed cache, and for that you have to find a more powerful tool.
You can take a look at doctrines second level cache. Don't know how good it is now (tried it once when it was in beta and did not make the cut for me).
Or you can build your own cache manager. If you do i recommend using redis. The data structures will help you keep you indexes (Can be simulated with memcached, but it requires more work). What I meen by indexes.
You will have a key like client_1_articles where 1 is the client id. In that key you will store all the ids of the articles of client 1. For every article id you will have a key like article_x where x is the id the of article. In this example client_1_articles is a rudimentary index that will help you, if you want at some point, to invalidated all the caches of articles coming from client 1.
The abstract implementation for the above example will end up being a graph like structure over your cache, with possibly
-composed indexes 'client_1:category_1' => {article_1, article_2}
-multiple indexes for one item eg: 'category_1'=>{article_1, article_2, article_3}, 'client_1' => {article_1, article_3}
-etc.
Hope this help you in some way. At least that was my solution for a similar problem.
Good luck with your project,
Alexandru Cosoi

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex