Scrapy managing dynamic spiders - web-scraping

I am building a project where I need a web crawler which crawls a list of different webpages. This list can change at any time. How is this best implemented with scrapy? Should I create one spider for all websites or dynamically create spiders?
I have read about scrapyd, and I guess that dynamically creating spiders is the best approach. I would need a hint about how to implement it though.

If parsing logic is same then there are two methods,
For large number of webpages, you can create a list and read that list at the start may b in start_requests method or in constructor and assign that list to start_urls
You can pass you webpage link as a parameter to the spider from command line arguments, ans same in requests_method or in constructor you can access this parameter and assign it to start_urls
Passing parameters in scrapy
scrapy crawl spider_name -a start_url=your_url
In scrapyd replace -a with -d

Related

How to make async http request from jmeter while changing path dynamically before each call

Following are the steps that I need to peform
Make http request call to a sevice which returns a json that has many urls.
Extract all the urls using regular expression extractor
Make http request call to all the exctracted urls asynchronously.
Is there a way we can achieve this? I tried parallel controller but, if I am not wrong, it requires all the request to be mentioned as its child sampler. I don't want to write each and every request manually. Is there a way we can change urls dynamically after running the test plan?
It's better to use JSON Extractor if the server returns URLs in JSON format
Once you have the URLs in form of JMeter Variables like:
url_1=http://example.com
url_2=http://example.org
........
........
url_matchNr=X
add Parallel Sampler to your Test Plan
add JSR223 PreProcessor as a child of the Parallel Sampler
Put the following code into "Script" area:
1.upto(vars.get('url_matchNr') as int, { index ->
sampler.addURL(vars.get('url_' + index))
})

How to pass any URL to an APIFY task?

There is a box to configure the "Start URL" in APIFY, but what happens if i don't know the start URL and it depends of my user input? I would like to be able to pass a variable URL to "Start URL"
Configuration of Start URL in APIFY:
I want to pass any URL automatically through an APIFY task and then scrap it.
I tried to make it automatically through Zapier, in the configuration is possible to select the URL input and pass it to APIFY, but finally it stops the task because is not able to read the format passed. Data out log from Zapier:
I think that APIFY probably lets configure dynamic input URL's but by my beginner level, probably there is something that scapes from my knowledge.
I want to be able to pass variable URL's to be scraped by APIFY.
You can check how input looks like in JSON format using Editor/JSON switcher on the top of input configuration.
After you switch to JSON you can easily check the structure of startUrls.
If you want to override startUrls for example in Zapier integration you can do it using Input JSON overrides field in Run Task Apify<>Zapier action.
You can override input same way using API to run the task, where you need to pass JSON as POST payload of the API request.
If you want to read more about Apify<>Zapier integration you can check article Scrape single URL using Zapier.

Crawling case-sensitive URLs but Scraping case-insensitive with Scrapy

I am using Scrapy to crawl and scrape numerous websites. Scrapy needs to crawl the URLs in a case-sensitive mode as this is an important information when requesting a web page. Many websites link to some of the webpages using different casings of the same URLs which fools Scrapy into creating duplicates scrapes.
For example, the page https://www.example.com/index.html links to https://www.example.com/User1.php and https://www.example.com/user1.php
We need Scrapy to collect both pages as when we see the page User1.php, we do not know yet that we will collect a clone of it later through user1.php. We cannot lowercase https://www.example.com/User1.php either during the crawl as the server may return a 404 error when the page https://www.example.com/user1.php is not available.
So what I am looking for is a solution to tell Scrapy to crawl URLs in a case-sensitive mode, but to duplicate filter the pages, once collected, in a case-insensitive mode before they are scraped to eliminate the risks of duplicates.
Does anyone know how to operate Scrapy under both modes at the same time.
You will likely want to create a custom DupeFilter that extends BaseDupeFilter, then set DUPEFILTER_CLASS = "my_package.MyDupeFilter" in your settings.py
You may have plenty of luck just subclassing the existing RFPDupeFilter and inserting a line into def request_seen(self, request) that case-folds the URL before fingerprinting it:
class MyDupeFilter(RFPDupeFilter):
def request_seen(self, request):
lc_req = request.replace(url=request.url.lower())
return super(MyDupeFilter, self).request_seen(lc_req)
In fact, that sounds like such a common feature, if you find that change works for you, then submit a PR to Scrapy to add case_fold = settings.getbool("DUPEFILTER_CASE_INSENSITIVE") so others can benefit from that change

How to show different content based on the path in Racket web servlets?

I'm trying to follow the tutorial on the Racket guide on simple web apps, but can't get one, basic, basic thing.
How can you have a servlet serve different content based on the request URL? Despite my scouring, even the huge blog example was one big file and everything handled with huge get query strings behind my back. How can I do anything based on URLs? Clojure's Noir framework puts this basic feature big up front on the home page (defpage) but how to do this with Racket?
The URL is part of the request structure that the servlet receives as an argument. You can get the URL by calling request-uri, then you can look at it to do whatever you want. The request also includes the HTTP method, headers, and so on.
But that's pretty low-level. A better solution is to use dispatch-rules to define a mapping from URL patterns to handler functions. Here's an example from the docs:
(define-values (blog-dispatch blog-url)
(dispatch-rules
[("") list-posts]
[("posts" (string-arg)) review-post]
[("archive" (integer-arg) (integer-arg)) review-archive]
[else list-posts]))
Make your main servlet handler blog-dispatch. The URL http://yoursite.com/ will be handled by calling (list-posts req), where req is the request structure. The URL http://yoursite.com/posts/a-funny-story will be handled by calling (review-post req "a-funny-story"). And so on.

Designing proper REST URIs

I have a Java component which scans through a set of folders (input/processing/output) and returns the list of files in JSON format.
The REST URL for the same is:
GET http://<baseurl>/files/<foldername>
Now, I need to perform certain actions on each of the files, like validate, process, delete, etc. I'm not sure of the best way to design the REST URLs for these actions.
Since its a direct file manipulation, I don't have any unique identifier for the files, except their paths. So I'm not sure if the following is a good URL:
POST http://<baseurl>/file/validate?path=<filepath>
Edit: I would have ideally liked to use something like /file/fileId/validate. But the only unique id for files is its path, and I don't think I can use that as part of the URL itself.
And finally, I'm not sure which HTTP verb to use for such custom actions like validate.
Thanks in advance!
Regards,
Anand
When you implement a route like http:///file/validate?path you encode the action in your resource that's not a desired effect when modelling a resource service.
You could do the following for read operations
GET http://api.example.com/files will return all files as URL reference such as
http://api.example.com/files/path/to/first
http://api.example.com/files/path/to/second
...
GET http://api.example.com/files/path/to/first will return validation results for the file (I'm using JSON for readability)
{
name : first,
valid : true
}
That was the simple read only part. Now to the write operations:
DELETE http://api.example.com/files/path/to/first will of course delete the file
Modelling the file processing is the hard part. But you could model that as top level resource. So that:
POST http://api.example.com/FileOperation?operation=somethingweird will create a virtual file processing resource and execute the operation given by the URL parameter 'operation'. Modelling these file operations as resources gives you the possibility to perform the operations asynchronous and return a result that gives additional information about the process of the operation and so on.
You can take a look at Amazon S3 REST API for additional examples and inspiration on how to model resources. I can highly recommend to read RESTful Web Services
Now, I need to perform certain actions on each of the files, like validate, process, delete, etc. I'm not sure of the best way to design the REST URLs for these actions. Since its a direct file manipulation, I don't have any unique identified for the files, except their paths. So I'm not sure if the following is a good URL: POST http:///file/validate?path=
It's not. /file/validate doesn't describe a resource, it describes an action. That means it is functional, not RESTful.
Edit: I would have ideally liked to use something like /file/fileId/validate. But the only unique id for files is its path, and I don't think I can use that as part of the URL itself.
Oh yes you can! And you should do exactly that. Except for that final validate part; that is not a resource in any way, and so should not be part of the path. Instead, clients should POST a message to the file resource asking it to validate itself. Luckily, POST allows you to send a message to the file as well as receive one back; it's ideal for this sort of thing (unless there's an existing verb to use instead, whether in standard HTTP or one of the extensions such as WebDAV).
And finally, I'm not sure which HTTP verb to use for such custom actions like validate.
POST, with the action to perform determined by the content of the message that was POSTed to the resource. Custom “do something non-standard” actions are always mapped to POST when they can't be mapped to GET, PUT or DELETE. (Alas, a clever POST is not hugely discoverable and so causes problems for the HATEOAS principle, but that's still better than violating basic REST principles.)
REST requires a uniform interface, which in HTTP means limiting yourself to GET, PUT, POST, DELETE, HEAD, etc.
One way you can check on each file's validity in a RESTful way is to think of the validity check not as an action to perform on the file, but as a resource in its own right:
GET /file/{file-id}/validity
This could return a simple True/False, or perhaps a list of the specific constraint violations. The file-id could be a file name, an integer file number, a URL-encoded path, or perhaps an unencoded path like:
GET /file/bob/dir1/dir2/somefile/validity
Another approach would be to ask for a list of the invalid files:
GET /file/invalid
And still another would be to prevent invalid files from being added to your service in the first place, ie, when your service processes a PUT request with bad data:
PUT /file/{file-id}
it rejects it with an HTTP 400 (Bad Request). The body of the 400 response could contain information on the specific error.
Update: To delete a file you would of course use the standard HTTP REST verb:
DELETE /file/{file-id}
To 'process' a file, does this create a new file (resource) from one that was uploaded? For example Flickr creates several different image files from each one you upload, each with a different size. In this case you could PUT an input file and then trigger the processing by GET-ing the corresponding output file:
PUT /file/input/{file-id}
GET /file/output/{file-id}
If the processing isn't near-instantaneous, you could generate the output files asynchronously: every time a new input file is PUT into the web service, the web service starts up an asynchronous activity that eventually results in the output file being created.

Resources