Is it possible to add additional input to a later step of an mrjob? - emr

I have an mrjob that consists of 3 steps.
The second step expects as input the results of the first step plus some more content from S3.
I understand that I can always "stream" it through the first step, meaning emit is as is, and only use it in the second step, but I would like to avoid this.
Is there a way to define additional input to later steps in mrjob?

Instead of grouping the steps into a single job, you might consider using a persistent job flow to separate your task into the parts before and after the secondary input:
Re-use Amazon Elastic MapReduce instance
http://pythonhosted.org/mrjob/guides/emr-advanced.html

Related

In Airflow, what do they want you to do instead of pass data between tasks

In the docs, they say that you should avoid passing data between tasks:
This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom that is described in the section XComs.
I fundamentally don't understand what they mean. If there's no data to pass between two tasks, why are they part of the same DAG?
I've got half a dozen different tasks that take turns editing one file in place, and each send an XML report to a final task that compiles a report of what was done. Airflow wants me to put all of that in one Operator? Then what am I gaining by doing it in Airflow? Or how can I restructure it in an Airflowy way?
fundamentally, each instance of an operator in a DAG is mapped to a different task.
This is a subtle but very important point: in general if two operators need to share
information, like a filename or small amount of data, you should consider combining them
into a single operator
the above sentence means that if you want any information that needs to be shared between two different tasks then it is best you could combine them into one task instead of using two different tasks, on the other hand, if you must use two different tasks and you need to pass some information from one task to another then you can do it using
Airflow's XCOM, which is similar to a key-value store.
In a Data Engineering use case, file schema before processing is important. imagine two tasks as follows :
Files_Exist_Check : the purpose of this task is to check whether particular files exist in a directory or not
before continuing.
Check_Files_Schema: the purpose of this task is to check whether the file schema matches the expected schema or not.
It would only make sense to start your processing if Files_Exist_Check task succeeds. i.e. you have some files to process.
In this case, you can "push" some key to xcom like "file_exists" with the value being the count of files present in that particular directory in Task Files_Exist_Check.
Now, you "pull" this value using the same key in Check_Files_Schema Task, if it returns 0 then there are no files for you to process hence you can raise exception and fail the task or handle gracefully.
hence sharing information across tasks using xcom does come in handy in this case.
you can refer following link for more info :
https://www.astronomer.io/guides/airflow-datastores/
Airflow - How to pass xcom variable into Python function
What you have to do for avoiding having everything in one operator is saving the data somewhere. I don't quite understand your flow, but if for instance, you want to extract data from an API and insert that in a database, you would need to have:
PythonOperator(or BashOperator, whatever) that takes the data from the API and saves it to S3/local file/Google Drive/Azure Storage...
SqlRelated operator that takes the data from the storage and insert it into the database
Anyway, if you know which files are you going to edit, you may also use jinja templates or reading info from a text file and make a loop or something in the DAG. I could help you more if you clarify a little bit your actual flow
I've decided that, as mentioned by #Anand Vidvat, they are making a distinction between Operators and Tasks here. What I think is that they don't want you to write two Operators that inherently need to be paired together and pass data to each other. On the other hand, it's fine to have one task use data from another, you just have to provide filenames etc in the DAG definition.
For example, many of the builtin Operators have constructor parameters for files, like the S3FileTransformOperator. Confusing documentation, but oh well!

Using response data from one scenario to another

With Karate, I'm looking to simulate an end-to-end test structure where I do the following:
Make a GET request to specific data
Store a value as a def variable
Use that information for a separate scenario
This is what I have so far:
Scenario: Search for asset
Given url "https://foo.bar.buzz"
When method get
Then status 200
* def responseItem = $.items[0].id // variable initialized from the response
Scenario: Modify asset found
Given url "https://foo.bar.buzz/" + responseItem
// making request payload
When method put.....
I tried reading the documentation for reusing information, but that seemed to be for more in-depth testing.
Thoughts?
It is highly recommended to model flows like this as one scenario. Please refer to the documentation: https://github.com/intuit/karate#script-structure
Variables set using def in the Background will be re-set before every
Scenario. If you are looking for a way to do something only once per
Feature, take a look at callonce. On the other hand, if you are
expecting a variable in the Background to be modified by one Scenario
so that later ones can see the updated value - that is not how you
should think of them, and you should combine your 'flow' into one
scenario. Keep in mind that you should be able to comment-out a
Scenario or skip some via tags without impacting any others. Note that
the parallel runner will run Scenario-s in parallel, which means they
can run in any order.
That said, maybe the Background or hooks is what you are looking for: https://github.com/intuit/karate#hooks

Do I need a separate command when using an oracle?

I want to issue something e.g. a new option. Inside the flow where I'm issuing this new option, I need to get info from two separate oracles that need to provide data for the output state.
How should I do this... should I have one output and 3 commands? command with data from Oracle 1, command with data from Oracle 2 and then the issue command? or can this be done with one command?
It's entirely up to you - your command can contain whatever you data you want, so in theory, you could do the whole with one command.
Having said that, I would probably split it out into at least two commands for clarity and privacy. The privacy element is you can build a filtered transaction for the oracle to sign that only contains the oracle command.
If you don't mind the two oracles seeing the data sent to each for signing, you could encapsulate the data in one command e.g.
class OracleCommand(val spotPrice: SpotPrice, val volatility: Volatility) : CommandData
Where one oracle attests to spotPrice and the other to Volatility.
However, you would find it hard to determine what part of the data they attested to since they will both sign over the entire filtered transaction.
Unless you knew the design of the oracle could specifically pick out the correct data, you're probably better off going with three separate commands.

Is there a way to define a trigger that runs reliably at a datetime specified as a field in the updated/created object?

The question
Is it possible (and if so, how) to make it so when an object's field x (that contains a timestamp) is created/updated a specific trigger will be called at the time specified in x (probably calling a serverless function)?
My Specific context
In my specific instance the object can be seen as a task. I want to make it so when the task is created a serverless function tries to complete the task and if it doesn't succeed it updates the record with the partial results and specifies in a field x when the next attempt should happen.
The attempts should not span at a fixed interval. For example, a task may require 10 successive attempts at approximately every 30 seconds, but then it may need to wait 8 hours.
There currently is no way to (re)trigger a Cloud Function on a node after a certain timespan.
The closest you can get is by regularly scheduling a cron job to run on the list of tasks. For more on that, see this sample in the function-samples repo, this blog post by Abe, and this video where Jen explains them.
I admit I never like using this cron-job approach, since you have to query the list to find the items to process. A while ago, I wrote a more efficient solution that runs a priority queue in a node process. My code was a bit messy, so I'm not quite ready to share it, but it wasn't a lot (<100 lines). So if the cron-trigger approach doesn't work for you, I recommend investigating that direction.

How to keep the sequence file created by map in hadoop

I'm using Hadoop and working with a map task that creates files that I want to keep, currently I am passing these files through the collector to the reduce task. The reduce task then passes these files on to its collector, this allows me to retain the files.
My question is how do I reliably and efficiently keep the files created by map?
I know I can turn off the automatic deletion of map's output, but that is frowned upon are they any better approaches?
You could split it up into two jobs.
First create a map only job outputting the sequence files you want.
Then, taking your existing job (doing really nothing in the map anymore but you could do some crunching depending on your implementation & use cases) and reducing as you do now inputting the previous map only job through as your input to the second job.
You can wrap this all up in one jar running the 2 jars as such passing the output path as an argument to the second jobs input path.

Resources