Using already setup web scraper with Airflow? - web-scraping

the company I'm currently interning for wants me to schedule there already existing web scraper on to Airflow. I have 0 experience with webscrapers and airflow, however, I am writing to ask for some help.
First of all, the webscraper uses celery, selenium and rabbitMQ and is working perfectly fine without airflow. Now, my question is, to move this process onto airflow, is all I need to do is to import the functions into the airflow DAG that I wish to create and call them in the respective order that we want to them to be called in? Or is this a very simplistic view of things? Are there any things I need to keep in mind? I have been trying to read up on airflow for the past week, however, I cannot seem to make that break into scaling it to fit the company's code.
Apologies for the complete noob question.

If you have it already working as a script, all you have to do is to import the function that runs the entire code into a DAG and schedule it as per the interval you want.

Related

Creating an API multi thread in R

I'm thinking about creating an API with Rscript doing all my stuff.
But, we know that R is single-thread, and plumber also is.
Someone know a way to create an API multi thread in R? I dont think that an single-thread can help me in my case, I may have many users using my model in prod, then im afraid to use plumber.
I already see people saying about using RServer and Java to create an API multi thread. (but i dont know how do this).
Any suggestion, or links about this discussion is welcome.
Thank you all!
Just to document things that are possible to other people that may have the same question, and dont know how do this, I will put here some links that i found that can be helpful.
I dont test any idea yet to say which one is faster, or cheaper, or optimized...
But, what i found is:
You can use https://restrserve.org/ that is an alternative to plumber, even when plumber was just single-thread, restrserve already was helpful to create APIs multi-thead.
8 days ago, plumber release the version 1.0.0, that can support APIs multi-thread. Link to release: https://www.rplumber.io/news/index.html#plumber-router
Even in the past, without plumber v1.0.0, or if you dont want use the RestRserve, you could create an API single-thread with the plumber (in the past), use the Docker to Build your API and then use Kubernetes to manage the requests, creating "copys" of your API, and then the Kubernetes manage the requests, choosing the copy_API that will be used.
An talk at RSTUDIO::CONF2020 about an model that has 1.000.000 acces per day, with an R API: https://rstudio.com/resources/rstudioconf-2020/we-re-hitting-r-a-million-times-a-day-so-we-made-a-talk-about-it/

Reusing Tasks (operators) across DAGs

I'm exploring AirFlow as a workflow execution engine.
I like the fact that it's very flexible, allow multiple operators, such as Python functions. However, I'm afraid I may be missing something fundamental - Task reuse. I want to run existing operators in multiple DAGs without having to redefine them.
As far as I can tell - this is not supported. Am I wrong? and If so, I'll be happy if someone can point me to a solution.
The only (awkward) solution that comes to mind is to have a dummy-DAG for every Operator, and then building my DAG on top of these dummy-DAGs with a DagRunOperator.
Many thanks!
The recommended way to achieve this would be to create your own Airflow plugin.
(From the Airflow Documentation on Plugins)
Airflow has a simple plugin manager built-in that can integrate external features to its core by simply dropping files in your $AIRFLOW_HOME/plugins folder.
The python modules in the plugins folder get imported, and hooks, operators, sensors, macros, executors and web views get integrated to Airflow’s main collections and become available for use.
So if you were to create a custom operator in your plugin, you would be able to re-use that same operator across multiple DAGs.
This repo may also be helpful as it has quite a few examples of Airflow plugins: https://github.com/airflow-plugins

Django Web application using R analysis

I have done some data analysis on R, Now I am willing to display the results & visualizations on Django Web application How should I do it.?
1) Save results in database and make a Django app independently while display results by fetching from database.
2) I am not sure but what what purpose rpy2 does here? Should I call my R function in python and make a Django app (pardon if this point doesn't make sense)
From experience, using Rpy2 is a bit of a mess. I would go with option 1 if you have time. If you're in a rush, you can use Rpy2, both techniques will work but the first one is less dirty probably.
You might want to take a look at some R to build web applications like this one?

Is It Possible To Use Celery With Another Programming Language?

I heard about celery and I really like it. But now I'm writing an application with node.js and I have to manage (asynchronous) tasks and I want to use celery to this. I've installed it in my development environment and played around with some python scripts. It all works well but I want to "call" the tasks with node.js. Has anyone tried to do something like this (with any programming language)?
I saw this example, but the base of this HTTP Gateway idea is a django application and I don't want to create a django app to only handle these calls.
I thought about creating a SimpleXMLRPCServer and use the node-xmlrpc module to connect with that. What do you think? There is a better way to do this? Is there another app or service that works natively with node.js?
Thanks in advance.
Celery will force you to inherit a whole Python stack for a simple message queue - seems like a messy pain to me. Check out coffee-resque for a simple and native solution.

How do you handle scheduled tasks for your websites running on IIS?

I have a website that's running on a Windows server and I'd like to add some scheduled background tasks that perform various duties. For example, the client would like users to receive emails that summarize recent activity on the site.
If sending out emails was the only task that needed to be performed, I would probably just set up a scheduled task that ran a script to send out those emails. However, for this particular site, the client would like a variety of different scheduled tasks to take place, some of them always running and some of them only running if certain conditions are met. Right now, they've given me an initial set of things they'd like to see implemented, but I know that in the future there will be more.
What I am wondering is if there's a simple solution for Windows that would allow me to define the tasks that needed to be run and then have one scheduled task that ran daily and executed each of the scheduled tasks that had been defined. Is a batch file the easiest way to do this, or is there some other solution that I could use?
To keep life simple, I would avoid building one big monolithic exe and break the work to do into individual tasks and have a Windows scheduled task for each one. That way you can maintain the codebase more easily and change functionality at a more granular level.
You could, later down the line, build a windows service that dynamically loads plugins for each different task based on a schedule. This may be more re-usable for future projects.
But to be honest if you're on a deadline I'd apply the KISS principle and go with a scheduled task per task.
I would go with a Windows Service right out of the gates. This is going to be the most extensible method for your requirements, creating the service isn't going to add much to your development time, and it will probably save you time not too far down the road.
We use Windows Scheduler Service which launches small console application that just passes parameters to the Web Service.
For example, if user have scheduled reports #388 and #88, scheduled task is created with command line looking like this:
c:\launcher\app.exe report:388 report:88
When scheduler fires, this app just executes web method on web service, for example, InternalService.SendReport(int id).
Usually you already have all required business logic available in your Web application. This approach allows to use it with minimal efforts, so there is no need to create any complex .exe or windows service with pluggable modules, etc.
The problem with doing the operations from the scheduled EXE, rather than from inside a web page, is that the operations may benefit from, or even outright require, resources that the web page would have -- IIS cache and an ORM cache are two things that come to mind. In the case of ORM, making database changes outside the web app context may even be fatal. My preference is to schedule curl.exe to request the web page from localhost.
Use the Windows Scheduled Tasks or create a Windows Service that does the scheduling itself.

Resources