How to start Scrapyd and set number of parallel processes via shell - web-scraping

I would like to automatically set up scrapyd with a maximum number of parallel processes automatically without editing the config file.
I know there is a config file to set up max_proc and max_proc_per_cpu. I am wondering if it's possible to start scrapyd via shell like:
scrapy --max_proc=32
I haven't found a helpful parameter when showing all available commands using scrapyd -h.
Does anyone know if there is a solution? Maybe editing the file using Python?

Related

How to control the execution of tasks that depends on the download of a file

I have a DAG that runs every two minutes. The first task tries to download a file and the subsequents tasks manipulate this downloaded file.
I'm using a control file that sets a True value when the download is done successfully and then my other scripts check first if the download is set to True in this control file.
I was just wondering if there is a better way to execute my other scripts instead of running them all in every two minutes.
Could you give more precisions your problem?
If I understood your problem here some indications:
Instead of using a control file use xcom to pass parameters between tasks. This isn't a solution to your problem but don't use files to pass the parameters since you could end up with concurrency issues.
To verify the download you could use a file sensor instead. And then define the dependencies as follow : download_task >> file_sensor >> script_to_exec_task. Don't forget to correctly configure the timeout on the sensor depending on your constraints and needs.

Identifying the jar file association on Task Manager

I have a server which runs multiple jar file at the same time as of now.
Currently we just make a bat file, call the java -jar xxxx.jar program, and the window is pop-ed up on the screen so we know which to terminate when we'd like to turn one of them off.
But as we progress we prefer those program to run at the background hence we'd prefer to use javaw -jar xxxx.jar instead.
However when we open up the task manager all it shows is many javaw.exe processes, without telling us which jar file its associated to.
Is there any parameter we can specify when we start javaw, so there's some indication on task manager's process list?
There is an official product named Process Explorer that can do what you want.

Drupal Scheduler module cron

I'm using the Scheduler module to publish and unpublish my content at certain times, however I would like to get a more frequent publish than using the Drupal cron itself.
There is an option within the scheduler for using a lightweight cron specifically for scheduler but I have never written a cron task before and I just simply do not know what I am doing, it gives me an example of how I would write one which is
/usr/bin/wget -O - -q "http://example.com/scheduler/cron"
To make sure I am getting this correctly, would this line (modified to point to my address) go into a file called cron.php?
I have tried doing the above but it doesnt appear to be publishing my content
No, you'll need to add this line to your crontab on the server. Talk to your hosting provider, they should be able to help you.
If you're running your own server, run this command from in the shell:
crontab -e
And add your line last in that file.
See here for more info.

Auto triggering a UNIX shell script

I have a main script in a folder called main.ksh (in /home/pkawar/folder), and its input input file inputfile.xls (in /home/pkawar/folder/ipfile).
When I run main.ksh, it uses inputfile.xls and deliver the output to a mail address.
The inputfile.xls is loaded to path /home/pkawar/folder/ipfile via ftp commands.
Is it possible to run main.ksh automatically and output will be sent via mail when the file inputfile.xls is loaded successfully?
The first option would be to use cron, but from your question it doesn't seem that you want to go that path.
My question would be, what is creating the *.xml file? Is it possible that whatever is creating that file to know when its finished and then calling the shell script, or better yet, have the xml file streamed to the shell script on the fly?
The first thing you should do is write a script that does whatever it is you want done. If your script performs correctly, you can use cron via a crontab file to have the script executed on whatever schedule you desire.
See man crontab for details.

Pass commands to a running R-Runtime

Is there a way to pass commands (from a shell) to an already running R-runtime/R-GUI, without copy and past.
So far I only know how to call R via shell with the -f or -e options, but in both cases a new R-Runtime will process the R-Script or R-Command I passed to it.
I rather would like to have an open R-Runtime waiting for commands passed to it via whatever connection is possible.
What you ask for cannot be done. R is single threaded and has a single REPL aka Read-eval-print loop which is, say, attached to a single input as e.g. the console in the GUI, or stdin if you pipe into R. But never two.
Unless you use something else as e.g. the most excellent Rserve which (when hosted on an OS other than Windoze) can handle multiple concurrent requests over tcp/ip. You may however have to write your custom connection. Examples for Java, C++ and R exist in the Rserve documentation.
You can use Rterm (under C:\Program Files\R\R-2.10.1\bin in Windows and R version 2.10.1). Or you can start R from the shell typing "R" (if the shell does not recognize the command you need to modify your path).
You could try simply saving the workspace from one session and manually loading it into the other one (or any kind of variation on this theme, like saving only the objects you share between the 2 sessions with saveRDS or similar). That would require some extra load and save commands but you could automatise this further by adding some lines in your .RProfile file that is executed at the beginning of every R session. Here is some more detailed information about R on startup. But I guess it all highly depends on what are you doing inside the R sessions. hth

Resources