Submit a new script after all parallel jobs in R have completed - r

I have an R script that creates multiple scripts and submits these simultaneously to a computer cluster, and after all of the multiple scripts have completed and the output has been written in the respective folders, I would like to automatically launch another R script that works on these outputs.
I haven't been able to figure out whether there is a way to do this in R: the function 'wait' is not what I want since the scripts are submitted as different jobs and each of them completes and writes its output file at different times, but I actually want to run the subsequent script after all of the outputs appear.
One way I thought of is to count the files that have been created and, if the correct number of output files are there, then submit the next script. However to do this I guess I would have to have a script opened that checks for the presence of the files every now and then, and I am not sure if this is a good idea since it probably takes a day or more before the completion of the first scripts.
Can you please help me find a solution?
Thank you very much for your help
-fra

I think you are looking at this the wrong way:
Not an R problem at all, R happens to be the client of your batch job.
This is an issue that queue / batch processors can address on your cluster.
Worst case you could just wait/sleep in a shell (or R script) til a 'final condition reached' file has been touched
Inter-dependencies can be expressed with make too

Related

How to keep track of the number of runs of an R script per day

I am currently looking to write a function in R that can keep track of the number of completed runs of an .R file within any particular day. Notice that the runs might be conducted at different time periods of a day. I did some research on this problem and came across this post (To show how many times user has run the script). So far I am unable to build upon the first commenter's code when converting into R (the main obstacle is to replicate the try....except ). However, I need to add the restriction that the count is measured only within a day (exactly from 00:00:00 AM EST to 24:00:00 AM EST).
Can someone please offer some help on how to accomplish this goal?
Either I didn't get the problem, or it seems a rather easy one: create a temporary file (use Sys.Date() to name it) and store the current run number there; at the beginning of your .R file, read the temporary file, increment the number, write the file back.

Airflow - Questions on batch jobs and running a task in a DagRun multiple times

I am trying to solve the following problem with airflow:
I have a data pipeline where I want to run several processes on a number of excel documents (eg: 5,000 excel files a day). My idea for a DAG is below:
Task 1 = Take an excel file, and adds a new sheet to it.
Task 2 = Convert this returned excel to a PDF.
Task 1 and 2 in the DAG would call a processing tool running outside airflow via an API call (So the actual data processing isnt happening inside airflow).
I seem to be going around in circles with figuring out the best approach to this workflow. Some questions I keep having are:
Should each DagRun be one excel, or should the DagRun take in a batch
of excels?
If taking in a batch (which I presume is the correct approach), what is the recommend batch amount?
How would I pass the returned values from task 1 to task 2. Would it be an XCOM dictionary with a reference to each newly saved excel? I read somewhere that the max size of an xcom should be 48kb. So if i have a XCOM of 5,000 excel filepaths, that will probabaly be larger than 48kb.
The last, most tricky question I have is, I would obviously want to start processing task 2 as soon as even 1 excel from Task 1 had completed, because i wouldnt want to wait for the entire batch of Task 1 to complete before starting Task 2. How can I run Task 2, multiple times within the same DagRun for each new result that Task 1 produces? Or should Task 2 be its own DAG?
Am I approaching this problem the right way? How should I be tackling this problem?
Assumptions
I made some assumptions since I don't know all the details of the Excel file processing:
You cannot merge the Excel files since you need them separate.
Excel files are accessible from Airflow DAG (same filesystem or similar).
If something of that is not true, please clarify accordingly.
Answers
That being said, I'll first answer your questions and then comment on some thoughts:
I think you can do in batches, since using one run per file will be very slow (because of the scheduler time mostly, that will add time between Excel files processing). You're also not using all the available resources, so better push Airflow to be more busy.
The batch amount will depend on the processing load and the task design. From your question I assume you're thinking about having the batch inside the task, but if the service that process the Excel files could handle good parallelism, I'd rather recommend one task per Excel file. Having 5000 tasks (one for each file) will be a bad idea (because that'll be difficult so see in the UI), but the exact number of processes per batch depends on your resources and service SLA mostly.
From my experience I recommend using one task for everything, since you can call the service in parallel and right after the service completes, you can directly transform the Excel file in PDF.
This gets solved with the answer from question #3.
Solution overview
The solution I imagine is something like:
First task for checking existence of pending files. You can do a fork using a BranchPythonOperator (example here).
Then you have X parallel tasks to process Excel (call the service) and transform that to PDF. Could be one PythonOperator task. If you use Airflow 2, you can simply use #task() decorator to simplify the code. The X could be from 10 to 100 for example, depending on the resources and the service throughput.
Have a final task that triggers the DAG again to process more files. This could be implemented using a TriggerDagRunOperator (example here).

Finding total edit time for R files

I am trying to determine how much time I have spent on a project, which has mainly been done in .R files. I know the file.info function will extract metadata for me on that file, but since I have opened it several times over several days, I don't know how to use that information to determine total time editing. Is there a function to find this information, or a way to go through the file system to find it?
Just a thought: you could maintain a log file to which you write the following from your R script: start time, stop-time and R script file name.
You can add simple code in your script that would do this. You would then require a separate script that would analyse the logs and inform you about how much time was spent using the scipt.
For a single user this would work.
Note: this catches script execution time and not the time spent on editing the files. The log would still have merit: you would have a record of when you were working on the script under the assumption that you run your scripts frequently when developing code.
How about using an old-fashioned time sheet for the purpose of recording development time? Tools such as JIRA are very suitable for that purpose.
For example at the start of the script:
logFile <-file("log.txt")
writeLines(paste0("Scriptname start: ", Sys.time()), logFile)
close(logFile)
And at the end of the script:
logFile <-file("log.txt")
writeLines(paste0("Scriptname stop: ", Sys.time()), logFile)
close(logFile)

synchronize multiple map reduce jobs in hadoop

I have a use case where multiple jobs can run at the same time. The output of all the jobs will have to merged with a common master file in HDFS(containing key value pairs) that has no duplicates. I'm not sure how to avoid the race condition that could crop up in this case. As an example both Job 1 and Job 2 simultaneously write the same value to the master file resulting in duplicates. Appreciate your help on this.
Apache Hadoop doesn't support parallel writing to the same file. Here is the reference.
Files in HDFS are write-once and have strictly one writer at any time.
So, multiple maps/jobs can't write to the same file simultaneously. Another job/shell or any other program has to be written to merge the output of multiple jobs.

How to keep the sequence file created by map in hadoop

I'm using Hadoop and working with a map task that creates files that I want to keep, currently I am passing these files through the collector to the reduce task. The reduce task then passes these files on to its collector, this allows me to retain the files.
My question is how do I reliably and efficiently keep the files created by map?
I know I can turn off the automatic deletion of map's output, but that is frowned upon are they any better approaches?
You could split it up into two jobs.
First create a map only job outputting the sequence files you want.
Then, taking your existing job (doing really nothing in the map anymore but you could do some crunching depending on your implementation & use cases) and reducing as you do now inputting the previous map only job through as your input to the second job.
You can wrap this all up in one jar running the 2 jars as such passing the output path as an argument to the second jobs input path.

Resources