how to use the example of scrapy-redis - web-scraping

I have read the example of scrapy-redis but still don't quite understand how to use it.
I have run the spider named dmoz and it works well. But when I start another spider named mycrawler_redis it just got nothing.
Besides I'm quite confused about how the request queue is set. I didn't find any piece of code in the example-project which illustrate the request queue setting.
And if the spiders on different machines want to share the same request queue, how can I get it done? It seems that I should firstly make the slave machine connect to the master machine's redis, but I'm not sure which part to put the relative code in,in the spider.py or I just type it in the command line?
I'm quite new to scrapy-redis and any help would be appreciated !

If the example spider is working and your custom one isn't, there must be something that you have done wrong. Update your question with the code, including all relevant parts, so we can see what went wrong.
Besides I'm quite confused about how the request queue is set. I
didn't find any piece of code in the example-project which illustrate
the request queue setting.
As far as your spider is concerned, this is done by appropriate project settings, for example if you want FIFO:
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Don't cleanup redis queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True
# Schedule requests using a queue (FIFO).
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderQueue'
As far as the implementation goes, queuing is done via RedisSpider which you must inherit from your spider. You can find the code for enqueuing requests here: https://github.com/darkrho/scrapy-redis/blob/a295b1854e3c3d1fddcd02ffd89ff30a6bea776f/scrapy_redis/scheduler.py#L73
As for the connection, you don't need to manually connect to the redis machine, you just specify the host and port information in the settings:
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
And the connection is configured in the Ä‹onnection.py: https://github.com/darkrho/scrapy-redis/blob/a295b1854e3c3d1fddcd02ffd89ff30a6bea776f/scrapy_redis/connection.py
The example of usage can be found in several places: https://github.com/darkrho/scrapy-redis/blob/a295b1854e3c3d1fddcd02ffd89ff30a6bea776f/scrapy_redis/pipelines.py#L17

Related

How to resolve celery.backends.rpc.BacklogLimitExceeded error

I am using Celery with Flask after working for a good long while, my celery is showing a celery.backends.rpc.BacklogLimitExceeded error.
My config values are below:
CELERY_BROKER_URL = 'amqp://'
CELERY_TRACK_STARTED = True
CELERY_RESULT_BACKEND = 'rpc'
CELERY_RESULT_PERSISTENT = False
Can anyone explain why the error is appearing and how to resolve it?
I have checked the docs here which doesnt provide any resolution for the issue.
Possibly because your process consuming the results is not keeping up with the process that is producing the results? This can result in a large number of unprocessed results building up - this is the "backlog". When the size of the backlog exceeds an arbitrary limit, BacklogLimitExceeded is raised by celery.
You could try adding more consumers to process the results? Or set a shorter value for the result_expires setting?
The discussion on this closed celery issue may help:
Seems like the database backends would be a much better fit for this purpose.
The amqp/RPC result backends needs to send one message per state update, while for the database based backends (redis, sqla, django, mongodb, cache, etc) every new state update will overwrite the old one.
The "amqp" result backend is not recommended at all since it creates one queue per task, which is required to mimic the database based backends where multiple processes can retrieve the result.
The RPC result backend is preferred for RPC-style calls where only the process that initiated the task can retrieve the result.
But if you want persistent multi-consumer result you should store them in a database.
Using rabbitmq as a broker and redis for results is a great combination, but using an SQL database for results works well too.

How to prevent "Execution failed:[Errno 32] Broken pipe" in Airflow

I just started using Airflow to coordinate our ETL pipeline.
I encountered the pipe error when I run a dag.
I've seen a general stackoverflow discussion here.
My case is more on the Airflow side. According to the discussion in that post, the possible root cause is:
The broken pipe error usually occurs if your request is blocked or
takes too long and after request-side timeout, it'll close the
connection and then, when the respond-side (server) tries to write to
the socket, it will throw a pipe broken error.
This might be the real cause in my case, I have a pythonoperator that will start another job outside of Airflow, and that job could be very lengthy (i.e. 10+ hours), I wonder if what is the mechanism in place in Airflow that I can leverage to prevent this error.
Can anyone help?
UPDATE1 20190303-1:
Thanks to #y2k-shubham for the SSHOperator, I am able to use it to set up a SSH connection successfully and am able to run some simple commands on the remote site (indeed the default ssh connection has to be set to localhost because the job is on the localhost) and am able to see the correct result of hostname, pwd.
However, when I attempted to run the actual job, I received same error, again, the error is from the jpipeline ob instead of the Airflow dag/task.
UPDATE2: 20190303-2
I had a successful run (airflow test) with no error, and then followed another failed run (scheduler) with same error from pipeline.
While I'd suggest you keep looking for a more graceful way of trying to achieve what you want, I'm putting up example usage as requested
First you've got to create an SSHHook. This can be done in two ways
The conventional way where you supply all requisite settings like host, user, password (if needed) etc from the client code where you are instantiating the hook. Im hereby citing an example from test_ssh_hook.py, but you must thoroughly go through SSHHook as well as its tests to understand all possible usages
ssh_hook = SSHHook(remote_host="remote_host",
port="port",
username="username",
timeout=10,
key_file="fake.file")
The Airflow way where you put all connection details inside a Connection object that can be managed from UI and only pass it's conn_id to instantiate your hook
ssh_hook = SSHHook(ssh_conn_id="my_ssh_conn_id")
Of course, if your'e relying on SSHOperator, then you can directly pass the ssh_conn_id to operator.
ssh_operator = SSHOperator(ssh_conn_id="my_ssh_conn_id")
Now if your'e planning to have a dedicated task for running a command over SSH, you can use SSHOperator. Again I'm citing an example from test_ssh_operator.py, but go through the sources for a better picture.
task = SSHOperator(task_id="test",
command="echo -n airflow",
dag=self.dag,
timeout=10,
ssh_conn_id="ssh_default")
But then you might want to run a command over SSH as a part of your bigger task. In that case, you don't want an SSHOperator, you can still use just the SSHHook. The get_conn() method of SSHHook provides you an instance of paramiko SSHClient. With this you can run a command using exec_command() call
my_command = "echo airflow"
stdin, stdout, stderr = ssh_client.exec_command(
command=my_command,
get_pty=my_command.startswith("sudo"),
timeout=10)
If you look at SSHOperator's execute() method, it is a rather complicated (but robust) piece of code trying to achieve a very simple thing. For my own usage, I had created some snippets that you might want to look at
For using SSHHook independently of SSHOperator, have a look at ssh_utils.py
For an operator that runs multiple commands over SSH (you can achieve the same thing by using bash's && operator), see MultiCmdSSHOperator

How to execute a query in eval.xqy file with app-server id

I need to run query with importing modules from pod.
Without importing modules if I run simple query with Database Id using below, it is working.
let $queryParam := fn:concat("?query=",xdmp:url-encode($query),"&eval=",$dataBaseId,":123")
let $url := fn:concat($hostcqport,"/eval.xqy",$queryParam)
let $response := xdmp:http-post($url, $options)[2]
If I have import modules statements then it is throwing Error(File Not Found).
So I tried getting the app-server id and tried passing that instead of database-id as below,
let $queryParam := fn:concat("?query=",xdmp:url-encode($query),"&eval=",$serverId,":123")
let $url := fn:concat($hostcqport,"/eval.xqy",$queryParam)
let $response := xdmp:http-post($url, $options)[2]
How to pass the server-id to make the query executing against particular app-server.
Is this MarkLogic 8 or earlier (I ask because rewrite options on 8 allow for dynamic switching of module databases before execution (among lots of other amazing goodies). This may be what you want because you can look at the query parameters at this point and build logic into the rewite rules.
Otherwise, Can you explain in more detail what you are trying to accomplish in the end. By the time your code ran, it was already executed in the context of a particular App server - so asking to execute against a another app server by analysing the query parameters is a bit too late (because you are already using the app server).
[edit] The following is in response to the comments since provided. This is a messy response because the actual ticket and comments are still not a completely clear picture. But if you stitch them together, then a problem statement does now exist for which I can respond.
The original author of the question confirmed via comments that they are "trying to hit an app server on a different node than the one that you actually posted to"
OK.. This is the response to that clarification:
That is not possible. Your request is already being processed by a thread on the node that you hit with your http request. Marklogic is a cluster, but it does not share threads (or anything else for that matter). Choices are:
a redirect to the proper node
possibly use the current node to make the request on your behalf.
But that ties up the first thread and the thread on the other node and has the HTTP communication overhead - and you need to have an app server listening for this purpose.
If this is a fire-and-forget type of situation, then you can hit any node and save the data/request in a document in the DB using a URI naming convention that indicates what app server it is for, and by way of insert triggers (with a URI-prefix for their server id), pick up the request from the DB and process it.

Monitoring SaltStack

Is there anything out there to monitor SaltStack installations besides halite? I have it installed but it's not really what we are looking for.
It would be nice if we could have a web gui or even a daily email that showed the status of all the minions. I'm pretty handy with scripting but I don't know what to script.
Anybody have any ideas?
In case by monitoring you mean operating salt, you can try one of the following:
SaltStack Enterprise GUI
Foreman
SaltPad
Molten
Halite (DEPRECATED by SaltStack)
These GUI will allow you more than just knowing whether or not minions are alive. They will allow you to operate on them in the same manner you could with the salt client.
And in case by monitoring you mean just whether the salt master and salt minions are up and running, you can use a general-purpose monitoring solutions like:
Icinga
Naemon
Nagios
Shinken
Sensu
In fact, these tools can monitor different services on the hosts they know about. The host can be any machine that has an ip address and the service can be any resource that can be queried via the underlying OS. Example of host can be a server, router, printer... And example of service can be memory, disk, a process, ...
Not an absolute answer, but we're developing saltpad, which is a replacement and improvement of halite. One of its feature is to display the status of all your minions. You can give it a try: Saltpad Project page on Github
You might look into consul while it isn't specifically for SaltStack, I use it to monitor that salt-master and salt-minion are running on the hosts they should be.
Another simple test would be to run something like:
salt --output=json '*' test.ping
And compare between different runs. It's not amazing monitoring, but at least shows your minions are up and communicating with your master.
Another option might be to use the salt.runners.manage functions, which comes with a status function.
In order to print the status of all known salt minions you can run this on your salt master:
salt-run manage.status
salt-run manage.status tgt="webservers" expr_form="nodegroup"
I had to write my own. To my knowledge, there is nothing out there which will do this, and halite didn't work for what I needed.
If you know Python, it's fairly easy to write an application to monitor salt. For example, my app had a thread which refreshed the list of hosts from the salt keys from time to time, and a few threads that ran various commands against that list to verify they were up. The monitor threads updated a dictionary with a timestamp and success/fail for each host after they ran. It had a hacked together HTML display color coded to reflect the status of each node. Took me a about half a day to write it.
If you don't want to use Python, you could, painfully, do something similar to this inefficient, quick, untested hack using command line tools in bash.
minion_list=$(salt-key --out=txt|grep '^minions_pre:.*'|tr ',' ' ') # You'
for minion in ${minion_list}; do
salt "${minion}" test.ping
if [ $? -ne 0 ]; then
echo "${minion} is down."
fi
done
It would be easy enough to modify to write file or send an alert.
halite was depreciated in favour of paid ui version, sad, but true - still saltstack does the job. I'd just guess your best monitoring will be the one you can write yourself, happily there's a salt-api project (which I believe was part of halite, not sure about this), I'd recommend you to use this one with tornado as it's better than cherry version.
So if you want nice interface you might want to work with api once you set it up... when setting up tornado make sure you're ok with authentication (i had some trouble in here), here's how you can check it:
Using Postman/Curl/whatever:
check if api is alive:
- no post data (just see if api is alive)
- get request http://masterip:8000/
login (you'll need to take token returned from here to do most operations):
- post to http://masterip:8000/login
- (x-www-form-urlencoded data in postman), raw:
username:yourUsername
password:yourPassword
eauth:pam
im using pam so I have a user with yourUsername and yourPassword added on my master server (as a regular user, that's how pam's working)
get minions, http://masterip:8000/minions (you'll need to post token from login operation),
get all jobs, http://masterip:8000/jobs (you'll n need to post token from login operation),
So basically if you want to do anything with saltstack monitoring just play with that salt-api & get what you want, saltstack has output formatters so you could get all data even as a json (if your frontend is javascript like) - it lets you run cmd's or whatever you want and the monitoring is left to you (unless you switch from the community to pro versions) or unless you want to use mentioned saltpad (which, sorry guys, have been last updated a year ago according to repo).
btw. you might need to change that 8000 port to something else depending on version of saltstack/tornado/config.
Basically if you want to have an output where you can check the status of all the minions then you can run a command like
salt '*' test.ping
salt --output=json '*' test.ping #To get output in Json Format
salt manage.up # Returns all minions status
Or else if you want to visualize the same with a Dashboard then you can see some of the available options like Foreman, SaltPad etc.

Process scheduler runtime parameter

Can anyone recommend a fairly clean method for determining the process scheduler an app-engine is running on at run-time (NT or UNIX). I need to set a file path that is obviously dependent upon the server the process is being executed on. I understand the GetEnv command can be used, but I don't want to set an environment variable for this particular instance (it does not reside under the PS_FILES) path. I've searched peoplebooks for any kind of built in function or system variable, but was not successful (obviously).
Any suggestions would be appreciated.
Thanks
Okay, I may have asked this question a little too early. I apologize.
It looks like I'll just be able to query the process request table to pull back the server name:
SQLExec("SELECT SERVERNAMERUN FROM PSPRCSRQST WHERE PRCSINSTANCE = :1", &thisProcess, &server);
Evaluate &server
When.......
End-Evaluate;
Exactly :-)
There are a host of Process Request records that can give you the information you need. Glad that you found it.
John

Resources