Scrapy Follow externals links [duplicate]

Scrapy Follow externals links [duplicate] - web-scraping

I saw this post to make scrapy crawl any site without allowed domains restriction.
Is there any better way of doing it, such as using a regular expression in allowed domains variable, like-
allowed_domains = ["*"]
I hope there is some other way than hacking into scrapy framework to do this.

Don't set allowed_domains at all.
Look at the get_host_regex() function in this scrapy file:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py

you should diactivate offsite middlware which is a built in spider middleware in scrapy.
for more information http://doc.scrapy.org/en/latest/topics/spider-middleware.html

Related

How url dispatching works in Plone?

In Django there is url dispatcher(urlconf). Does Plone have some url traversing rules? Obviously answer is yes, but what are those rule, can anyone help me through this?

Most of times Plone URLs are mapped to physical objects in its database.
An URL like http://yourhost/Plone/folder/folder/document is automatically available when the folder structure exists.
It just works.
See http://docs.plone.org/develop/plone/serving/traversing.html
If you need custom URL to be mapped to something that is not a content you need to develop a browser view: http://docs.plone.org/develop/plone/views/browserviews.html
In this case you can have something like http://yourhost/Plone/##your-view-name
If you want to map a view with subpaths you need to define a browser view and manage traversal: http://docs.plone.org/develop/plone/serving/traversing.html#custom-traversal
In this case there's some gotchas and you don't have a powerful URL mapping like in Django or Flask (commonly in Plone it's better to stay simple and use only simple views).

Error scraping aspx site with ruby Mechanize. Mechanize::ResponseCodeError: 404 => Net::HTTPNotFound

I'm trying to scrape a ratings website with Ruby's mechanize, and am having a world of trouble. My code is pretty simple:
require "mechanize"
#client.get("http://cape.ucsd.edu/responses/Results.aspx")
At that point, you'll see the 404 errors.
I've tried a few things, including HTTParty searching for redirects; disabling SSL checking; even saving the html file locally (to get the proper query form), and then trying to issue it directly from an agent connected to the main site. All of these lead to the same error.
I'm fairly new to scraping, and I'm hoping I'm doing something silly. Any help would be appreciated.

Yes, it's user agent. To set the user agent do:
#client = Mechanize.new
#client.user_agent = 'Mozilla'

Is there a way in iron:router for handling "no routes configured" case?

I'd like to handle the case properly where a URL might be just slightly off (e.g. mistyped, wrong case) in my Meteor app and I'm using iron:router for routing.
How can I define my regular routes and then define some kind of "catch all" route or "no routes found" callback? Does iron:router provide such capabilities or are there easy workarounds or community packages?
I can sort of work around this by doing something like
Router.route('/:slug', ...)
last. But as soon as routes are defined not just from within the main app but also from packages I get in trouble because there's no way to say "and run this particular route last".
Thanks everybody!

Would something like this work?
this.route('notFound', {
path: '*'
});
http://www.manuel-schoebel.com/blog/iron-router-tutorial

Yeah, first of all, what you are using is an old iron:router api, check its documentation for newest routes handling.
Router handles RegEx so you can create your LAST route that catches any string.
For syntax check my earlier answer for this: Meteor infinite redirect instead of render 404

ASP.Net Webforms - how to get the friendly URL for the currently executing page

I have enabled friendly URLs in my application by having the following line:
routes.EnableFriendlyUrls();
in my App_Start/RouteConfig.cs file.
I would like to get the Friendly URL of the currently executing page. I know that I can always the currently executing file page from the request and take off the ".aspx" extension from that.
this.Request.CurrentExecutionFilePath;
However, I have a feeling that the Friendly URL framework component should be able to directly provide me the information I am looking for rather than me having to do string manipulation.
Any pointers on the same will be appreciated. Thanks for looking up my question.

What are you trying to achieve exactly?
If you're trying to extract a value from the url like you would normally do with Request.QueryString["someValue"], you should read up on how Routing works. Here's a good write-up
http://www.codeproject.com/Tips/698666/USE-OF-MapPageRoute-Method-IN-ASP-NET-WEBFORM-ROUT
If you are only interested in getting the url itself, you can use
Page.Request.RawUrl
Cheers

Regexing it up with IIS re-write module

I am developing a profile-based web application where each user is assigned there own url through their username & iis rewrite mod's magic. A typical user's profile url would be http://www.mymark.com/mike
Each user is also created a blog in a multi-user wordpress installation. The wordpress url would look like this: http://www.mymark.com/blog/mike
I am trying to use the rewrite module to create more canonical urls for the user (http://www.mymark.com/mike/blog), and have tried several regex variations that I have created through RegExr(a regex generation tool) and come up with this as the pattern to match (www.|)mymark.com/([^/]+)/blog but haven't had any success so far. What am I doing wrong here?
Here is a screen shot of my re-write rule:

The entire host name (mymark.com) is not part of the URL, so you should not enter that as part of the Pattern. To make sure you easily figure it out, make sure to use Failed Request Tracing to extract the exact pattern that you should enter there:
http://learn.iis.net/page.aspx/467/using-failed-request-tracing-to-trace-rewrite-rules/
but its probably going to be something like:
(^/])*/blog

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrapy Follow externals links [duplicate] - web-scraping

I saw this post to make scrapy crawl any site without allowed domains restriction. Is there any better way of doing it, such as using a regular expression in allowed domains variable, like- allowed_domains = ["*"] I hope there is some other way than hacking into scrapy framework to do this.

Don't set allowed_domains at all. Look at the get_host_regex() function in this scrapy file: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py

you should diactivate offsite middlware which is a built in spider middleware in scrapy. for more information http://doc.scrapy.org/en/latest/topics/spider-middleware.html

Related

How url dispatching works in Plone?

Error scraping aspx site with ruby Mechanize. Mechanize::ResponseCodeError: 404 => Net::HTTPNotFound

Is there a way in iron:router for handling "no routes configured" case?

ASP.Net Webforms - how to get the friendly URL for the currently executing page

Regexing it up with IIS re-write module

Categories

Resources