list spiders from scrapy shell and run individual spider - web-scraping

I want to access scrapy shell from within "scrapy project folder" in terminal. And want to list all the available spiders in my project. I also like to run individual spider and play with the response.
Once i enter into scrapy shell i get following object:
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x10b75cbd0>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x10cba1b90>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
My best guess is, i use methods from "crawler" object to list available spider. But i have no luck. Also please do let me know how to run spider once i list it.

list all the available spiders in my project
Use crawler.spiders.list():
>>> for spider_name in crawler.spiders.list():
... print(spider_name)

Related

With MQTTLibrary for Robot Framework, is it possible to determine the exact topic of a received message?

I'm a new Robot Framework user, and I've added the MQTTLibrary.
I can set up a subscription as per the documentation, and successfully receive messages. It's also possible to subscribe to wildcards, e.g.
${message}= Subscribe topic=test/mqtt_test/+ qos=1 timeout=2
The above will successfully pick up messages published to test/mqtt_test/apples, test/mqtt_test/oranges, test/mqtt_test/pears etc.
However, ${message} appears to only contain the content of the message payload, and I've been unable to work out if it's possible to determine the exact topic of the received message.
Can this be done with MQTTLibrary?
=============
Additional details (to provide an answer to ILostMySpoons's comment):
Sure - it's basically just the message content. So if I use...
mosquitto_pub -h 127.0.0.1 -t test/mqtt_test/apples -m "Hello to you"
...and my robot framework script does...
Log to console ${message}
...I see...
['Hello to you']
The debug output from the mosquitto broker (mosquitto -v) doesn't show message payloads but it does show the full topic path of test/mqtt_test/apples.
I've taken a deeper look into the MQTTLibrary and have come up with a solution. I'm both a Robot Framework and Python noob, so this may not be the best/most appropriate implementation, but it seems to work.
On my installation, the MQTTLibrary source is contained in C:\Python27\Lib\site-packages\MQTTLibrary. Everything of interest is in the MQTTKeywords.py file.
In the _on_message_list() function, change...
self._messages.append(message.payload)
...to...
self._messages.append([message.topic, message.payload])
Use the Subscribe keyword in your Robot Framework script as before, but you'll now have a list of lists; specifically each entry in the list will be a list of [topic, payload]. E.g.
${messages}= Subscribe topic=test/mqtt_test/+ qos=1 timeout=20 limit=0
${third_message}= Get From List ${messages} 2
${topic}= Get From List ${third_message} 0
${payload}= Get From List ${third_message} 1
Log to console \nTopic:\n${topic}
Log to console \nPayload:\n${payload}
The above example assumes that at least 3 messages were received during the 20 second timeout window.
Note that this change would break existing scripts, so a more complete solution would perhaps need to add new keywords (e.g. Subscribe And Get Topics), with additional work to ensure Subscribe still returns just the payloads.

Collect Asterisk cdr records by http post

I'm using Asterisk and want to collect cdr records. I searched for a while, and found that there are already modules which we can use to collect cdr records into CSV files or MySQL.
I'm wondering that whether there is already a module which let me collect cdr records by http post. So that when a call finished, Asterisk will post cdr records to a predefined URL.
Thanks in advanced.
I don't know such a module, but you could execute an Application on Hangup.
You could use the function ${CURL(url[,post-data])}:
exten => h,n,Set(result=${CURL(http://SERVER/cdr.php?\
cdranswer=${CDR(answer)}&exten=${EXTEN}&cidnum=${CALLERID(num)})})
exten => h,n,Noop(${result})
Another Approach is to execute a Script on Hangup:
exten => h,n,System('php -f /opt/scripts/cdr.php \
${CallerID(num)} ${EXTEN} "${CDR(answer)}" ${EPOCH}');
Maybe you could also use ${CDR(billsec)}.
Another option would be to use a CRON job to run a Ruby or PHP script every few minutes to extract and HTTP POST all of the CDRs since the last time the script was run. I've done that for a customer and it works reasonably well.
That said, what I've found is that if I need to have CDRs accessible "off machine", the easiest way to do it over the long haul is set up MySQL replication; write the CDRs to the PBX machine, read them from the replicated copy on the reports machine. It's a bit more set-up intensive at the beginning, but makes everything else a heck of a lot easier later on.

Get File Creation Date Over HTTP

Given a file on a webserver (e.g., http://foo.com/bar.zip -> only accessible through HTTP), is there any way to get the date attributes (e.g., date [created, modified]) without downloading the entire archive in the first place?
Right now, I download the archive and read the attributes programmatically. Trouble is that the archive is dozens of MiB so it seems like a waste of resources to download the entire thing and end up reading off just a couple of bytes of information.
I realize that bandwidth is practically free, but I don't like to be wasteful in any case.
Try to read Last-Modified from header
Be sure to use a HTTP HEAD request instead of a HTTP GET request to read the HTTP headers only. If you do a HTTP GET, you will download the whole file nevertheless, even if you decide just to inspect the HTTP headers.
Just for the sake of simplicity, here's a compilation of the existing (perfect) answers from #ihorko and #JanThomä, that uses curl. Other option are available too, of course, but here's a fully functional answer.
Use curl with the -I option:
-I, --head
(HTTP/FTP/FILE) Fetch the HTTP-header only! HTTP-servers feature the command HEAD which this uses to get nothing but the header of a document. When used on an FTP or FILE file, curl displays the file size and last modification time only.
Also, the -s option is nice here:
-s, --silent
Silent or quiet mode. Don't show progress meter or error messages. Makes Curl mute. It will still output the data you ask for, potentially even to the terminal/stdout unless you redirect it.
Hence, something like this would do the trick:
curl -sI http://foo.com/bar.zip | grep 'Last-Modified' | cut -d' ' -f 2-

Check if file is finished copying

I'm writing an ASP.NET webapp that will copy the contents of a CD to a network share. I need to check periodically if the copy job is finished.
One way of doing this is checking the network share folder to see if the file size has changed since the last check, but that seems kind of dodgy. Does anyone have a better idea how to do this?
Thanks in advance,
Stijn
EDIT
some more explanation:
Basically I'm calling a JsonResult action method every 5 seconds, called getStatus(source,destination). This method needs to check the following:
- if the source dir is still empty, copy cannot start --> return status "waiting"
- if the source dir contains files, copy can start -_> call copy method + return status "copying"
- if the destination dir contains files, and file size stays the same, copy is finished --> return status "finished"
Thanks!
In your webapp, use a blocking file copy operation, such as File.Copy, but run the procedure that does the copying in a background thread. In your background thread, write status information (e.g. "3 of 9 files finished" or "I'm done!" or "Error occurred: ...") into some shared object (static variable, Session object, database, ...). Then write some Status.aspx page which shows the content of that shared object.
Create web services available from client's javascript side with 2 methods: StartCopying, CheckStatus.
Implementation of StartCopying can either start backgorund thread to copy, or have [SoapDocumentMethod(OneWay = true)] that is mean that method returns immediately without waiting accomplishment.
CheckStatus just checks what you have described above, and return to client status of task.

ActiveMQ 5.2.0 + REST + HTTP POST = java.lang.OutOfMemoryError

First off, I am a newbie when it comes to JMS & ActiveMQ.
I have been looking into a messaging solution to serve as middleware for a message producer that will insert XML messages into a queue via HTTP POST. The producer is an existing system written in C++ that cannot be modified (so Java and the C++ API are out).
Using the "demo" examples and some trial and error, I have cobbled together a working example of what I want to do (on a windows box).
The web.xml I configured in a test directory under "webapps" specifies that the HTTP POST messages received from the producer are to be handled by the MessageServlet.
I added a line for the text app in "activemq.xml" ('ow' is the test app dir):
I created a test script to "insert" messages into the queue which works well.
The problem I am running into is that it as I continue to insert messages via REST/HTTP POST, the memory consumption and thread count used by ActiveMQ continues to rise (It happens when I have timely consumers as well as slow or non-existent consumers).
When memory consumption gets around 250MB's and the thread count exceeds 5000 (as shown in windows task manager), ActiveMQ crashes and I see this in the log:
Exception in thread "ActiveMQ Transport Initiator: vm://localhost#3564" java.lang.OutOfMemoryError: unable to create new native thread
It is as if Jetty is spawning a new thread to handle each HTTP POST and the thread never dies.
I did look at this page:
http://activemq.apache.org/javalangoutofmemory.html
and tried but that didn't fix the problem (although I didn't fully understand the implications of the change either).
Does anyone have any ideas?
Thanks!
Bruce Loth
PS - I included the "test message producer" python script below for what it is worth. I created batches of 100 messages and continued to run the script manually from the command line while watching the memory consumption and thread count of ActiveMQ in task manager.
def foo():
import httplib, urllib
body = "<?xml version='1.0' encoding='UTF-8'?>\n \
<ROOT>\n \
[snip: xml deleted to save space]
</ROOT>"
headers = {"content-type": "text/xml",
"content-length": str(len(body))}
conn = httplib.HTTPConnection("127.0.0.1:8161")
conn.request("POST", "/ow/message/RDRCP_Inbox?type=queue", body, headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
conn.close()
## end method definition
## Begin test code
count = 0;
while(count < 100):
# Test with batches of 100 msgs
count += 1
foo()
The error is not directly caused by ActiveMQ but by the Java Runtime. Take a look here:
http://activemq.apache.org/javalangoutofmemory.html
how you can up your memory for the Java HEAP. There is also interessting stuff about WHY this happens and what you might do to prevent it. ActiveMQ is pretty good but needs some customizing here and there in the config files.
You may want to add the following to the URL's query string:
JMSDeliveryMode=persistent
Otherwise, by definition (read "by default"), the messages would be kept in AMQ's memory.

Resources