Crawling case-sensitive URLs but Scraping case-insensitive with Scrapy

Crawling case-sensitive URLs but Scraping case-insensitive with Scrapy - web-scraping

I am using Scrapy to crawl and scrape numerous websites. Scrapy needs to crawl the URLs in a case-sensitive mode as this is an important information when requesting a web page. Many websites link to some of the webpages using different casings of the same URLs which fools Scrapy into creating duplicates scrapes.
For example, the page https://www.example.com/index.html links to https://www.example.com/User1.php and https://www.example.com/user1.php
We need Scrapy to collect both pages as when we see the page User1.php, we do not know yet that we will collect a clone of it later through user1.php. We cannot lowercase https://www.example.com/User1.php either during the crawl as the server may return a 404 error when the page https://www.example.com/user1.php is not available.
So what I am looking for is a solution to tell Scrapy to crawl URLs in a case-sensitive mode, but to duplicate filter the pages, once collected, in a case-insensitive mode before they are scraped to eliminate the risks of duplicates.
Does anyone know how to operate Scrapy under both modes at the same time.

You will likely want to create a custom DupeFilter that extends BaseDupeFilter, then set DUPEFILTER_CLASS = "my_package.MyDupeFilter" in your settings.py
You may have plenty of luck just subclassing the existing RFPDupeFilter and inserting a line into def request_seen(self, request) that case-folds the URL before fingerprinting it:
class MyDupeFilter(RFPDupeFilter):
def request_seen(self, request):
lc_req = request.replace(url=request.url.lower())
return super(MyDupeFilter, self).request_seen(lc_req)
In fact, that sounds like such a common feature, if you find that change works for you, then submit a PR to Scrapy to add case_fold = settings.getbool("DUPEFILTER_CASE_INSENSITIVE") so others can benefit from that change

Related

What is in reality the order of evaluation of URL patterns in urls.py in Django?

Although in the documentation it says that "Django runs through each URL pattern, in order, and stops at the first one that matches the requested URL", all the examples are like this:
from django.urls import path
from . import views
urlpatterns = [
path('articles/2003/', views.special_case_2003),
path('articles/<int:year>/', views.year_archive),
path('articles/<int:year>/<int:month>/', views.month_archive),
path('articles/<int:year>/<int:month>/<slug:slug>/', views.article_detail),
]
with more broad URL patterns at the end. And everything seems to work properly.
Moreover, some articles in Stackoverflow suggest that this may be a bug (What is the urls.py regex evaluation order in django?).
What is the real order of evaluation of URL patterns and why?

The real order of evaluation of URL patterns is different for path and re_path (url intern calls re_path, so all statements for re_path apply to url too).
When we use path the order of URLs doesn't matter. However, when using re_path, the order of URLs is specific URLs to broad URLs from top to bottom.

Does Kentico allow query strings with question mark?

I'm trying to migrate my ASPX site to Kentico, and as part of my task I'm migrating URLs. I need to preserve my URL structure, so I need to keep URLs which look like : "foo.com/bar.aspx?pageid=1".
I checked page's "URLs" property tried to use wildcards, some patterns like /bar/{pageid}- /bar/{?pageid?}-, etc but Kentico always replaces question marks.
Is there a way to achieve that via the admin interface?

You don't need to do anything in order to use "foo.com/bar.aspx?pageid=1" url.
Create a page under the root and call it bar, so you'll get a page # foo.com/bar.aspx. Kentico and/or .net does not care what you add to a url after question mark, so foo.com/bar.aspx?pageid=1 will work as well as foo.com/bar.aspx?someparam=sdf, or foo.com/bar.aspx?id=1&p=3&t=3.
You may (or may not) implement some functionality based on query string (e.g. paging), so it will parse query string and act in appropriate way.

By default Kentico UI does not handle adding URL aliases with URL parameters like you show. There is an article on the DevNet for a URL Redirection module which has code you can import into your site to allow you to perform these redirects within the Kentico UI. I'd suggest using this approach.
Unfortunately, I can't share a code sample since it's an article but it also has a link to download the code too. This appears to only be coded for Kentico 8.2 right now but I'm guessing you could do some work to make it work for other versions if you needed.

I think there are few concepts that you are clubbing here. I will start with your line code here
/bar/{pageid} - {pageid} is a positional parameter in Kentico's language if you choose to use dynamic URLS based on patterns. SO if you have a code that relies on pageid parameter to fetch some data then Kentico will pass that value. E.g in case of /bar/420, it will pass pageid as 420 different web parts on your template
/bar/{?pageid?} - This will search for query string parameter "pageid" on the request URL and replace its value here. So if you passed foo.com/bar.aspx?pageid=366, the resulting URL will be /bar/366
The #1 is positional parameter and #2 is the way in which Kentico resolves query string macros.
I hope this clarifies.

Scrapy managing dynamic spiders

I am building a project where I need a web crawler which crawls a list of different webpages. This list can change at any time. How is this best implemented with scrapy? Should I create one spider for all websites or dynamically create spiders?
I have read about scrapyd, and I guess that dynamically creating spiders is the best approach. I would need a hint about how to implement it though.

If parsing logic is same then there are two methods,
For large number of webpages, you can create a list and read that list at the start may b in start_requests method or in constructor and assign that list to start_urls
You can pass you webpage link as a parameter to the spider from command line arguments, ans same in requests_method or in constructor you can access this parameter and assign it to start_urls
Passing parameters in scrapy
scrapy crawl spider_name -a start_url=your_url
In scrapyd replace -a with -d

301 Redirect with Regular Expressions

Couldn't find an answer to this and thought it might be a quick answer.
My company, a local news site, is working on migrating to WordPress from a proprietary CMS. Part of the challenge is we are restructuring URLs. I will be utilizing 301 redirects but my issue is as follows:
Example Page name: Story Name: is "this"
Example Old CMS Page URL: /story-name--is--this-/
New CMS Page URL: /news/2012/09/12/story-name-is-this/
The old CMS turned special characters and spaces into hyphens. WordPress will be configured to instead ignore special characters and simply turn spaces into hyphens. Additionally, the old CMS did not include the date in the URL, and I'm not sure the best route to take regarding adding the date.
Thanks!

You're either going to have to write a script that takes all of your old links, does a lookup in your database to transform it into the new link, and redirect the browser to the new link. Or you'll have to enumerate the entire mapping of old links -> new links and create a 301 redirect for each of them (in either your vhost/server config or in an htaccess file):
Redirect 301 /story-name--is--this-/ /news/2012/09/12/story-name-is-this/

It's not clear what is your real question? I am also not sure what Regular expressions have to do with the problem.
There is no information about what your old CMS is capable of, assuming that you can intercept the calls to old articles when they are accessed via the browser, but before they are rendered you can form and send the redirect back to the browser dynamically generating the url using the programming mechanisms available in your proprietary CMS.
Again, assuming you have access to Java:
A. When generating the redirect URL you can access the article's date and form the
2012/09/12 from the date, you can use SimpleDateFormatter to format Dates into a string representation like YYYY/MM/DD.
B. You can use similar approach with the titles and replace the list of special characters in the title string with empty spaces. For example Apache StringUtils library can let you specify a set of characters to look for and if any are found they will be replaced with the target character.
C. You concatenate the output of A and B to create the target redirect URL and send it back to the browser instead of the article itself.

how to submit query to .aspx page in python

I need to scrape query results from an .aspx web page.
http://legistar.council.nyc.gov/Legislation.aspx
The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.
Somebody out there must know how to do this.

As an overview, you will need to perform four main tasks:
to submit request(s) to the web site,
to retrieve the response(s) from the site
to parse these responses
to have some logic to iterate in the tasks above, with parameters associated with the navigation (to "next" pages in the results list)
The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup
The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).
This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.
import urllib
import urllib2
uri = 'http://legistar.council.nyc.gov/Legislation.aspx'
#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type.
headers = {
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
'Content-Type': 'application/x-www-form-urlencoded'
}
# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples. This helps
# with clarity and also makes it easy to later encoding of them.
formFields = (
# the viewstate is actualy 800+ characters in length! I truncated it
# for this sample code. It can be lifted from the first page
# obtained from the site. It may be ok to hardcode this value, or
# it may have to be refreshed each time / each day, by essentially
# running an extra page request and parse, for this specific value.
(r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),
# following are more of these ASP form fields
(r'__VIEWSTATE', r''),
(r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
(r'ctl00_RadScriptManager1_HiddenField', ''),
(r'ctl00_tabTop_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),
#but then we come to fields of interest: the search
#criteria the collections to search from etc.
# Check boxes
(r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'), # file number
(r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'), # Legislative text
(r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'), # attachement
# etc. (not all listed)
(r'ctl00$ContentPlaceHolder1$txtSearch', 'york'), # Search text
(r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'), # Years to include
(r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'), #types to include
(r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation') # Search button itself
)
# these have to be encoded
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req) #that's the actual call to the http site.
# *** here would normally be the in-memory parsing of f
# contents, but instead I store this to file
# this is useful during design, allowing to have a
# sample of what is to be parsed in a text editor, for analysis.
try:
fout = open('tmp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.
This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...

Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.
What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.
Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.

Selenium is a great tool to use for this kind of task. You can specify the form values that you want to enter and retrieve the html of the response page as a string in a couple of lines of python code.
Using Selenium you might not have to do the manual work of simulating a valid post request and all of its hidden variables, as I found out after much trial and error.

The code in the other answers was useful; I never would have been able to write my crawler without it.
One problem I did come across was cookies. The site I was crawling was using cookies to log session id/security stuff, so I had to add code to get my crawler to work:
Add this import:
import cookielib
Init the cookie stuff:
COOKIEFILE = 'cookies.lwp' # the path and filename that you want to use to save your cookies in
cj = cookielib.LWPCookieJar() # This is a subclass of FileCookieJar that has useful load and save methods
Install CookieJar so that it is used as the default CookieProcessor in the default opener handler:
cj.load(COOKIEFILE)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
To see what cookies the site is using:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
This saves the cookies:
cj.save(COOKIEFILE) # save the cookies

"Assume we need to select "all years" and "all types" from the respective dropdown menus."
What do these options do to the URL that is ultimately submitted.
After all, it amounts to an HTTP request sent via urllib2.
Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.
Select '"all years" and "all types" from the respective dropdown menus'
Note the URL which is actually submitted.
Use this URL in urllib2.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Crawling case-sensitive URLs but Scraping case-insensitive with Scrapy - web-scraping

Related

What is in reality the order of evaluation of URL patterns in urls.py in Django?

Does Kentico allow query strings with question mark?

Scrapy managing dynamic spiders

301 Redirect with Regular Expressions

how to submit query to .aspx page in python

Categories

Resources