Do import.io crawlers obey robots.txt? - web-scraping

When you run an import.io crawler, does it obey to the robots.txt file?

According to the faq, it turns out it does.
Do you adhere to robots.txt?
Yes.

Yes, we do (also with Extractors and Connectors)

Related

Sulu CMS: Is it possible to set custom http-header information for pdf files?

We would like to prevent pdf files from beeing indexed in our sulu 1.6 website. Apparently this works best, if pdf files have also a X-Robots-Tag: noindex attached to them.
Is there a way to configure or easily add additional http headers in sulu?
Thx a lot!
We found, that also architecture wise (seperation of concern) it will be better to solve this on the webserver level.
In our caddy:
header *.pdf {
X-Robots-Tag "noindex, nofollow"
}
I also like it way better not to hijack sulu/symfony for this.
Update Feb/2020
Actually it turned out, that our webserver caddy is not as straight foreward as the above assumption. Caddy 2 can probably do it, but is not out jet.
We will investigate also a bit, weather we could upload pdf files to a specific folder instead of the dynamic delivery atm.

Can I change all my http:// links to just //?

Dave Ward says,
It’s not exactly light reading, but section 4.2 of RFC 3986 provides for fully qualified URLs that omit protocol (the HTTP or HTTPS) altogether. When a URL’s protocol is omitted, the browser uses the underlying document’s protocol instead.
Put simply, these “protocol-less” URLs allow a reference like this to work in every browser you’ll try it in:
//ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js
It looks strange at first, but this “protocol-less” URL is the best way to reference third party content that’s available via both HTTP and HTTPS.
This would certainly solve a bunch of mixed-content errors we're seeing on HTTP pages -- assuming that our assets are available via both HTTP and HTTPS.
Is this completely cross-browser compatible? Are there any other caveats?
I tested it thoroughly before publishing. Of all the browsers available to test against on Browsershots, I could only find one that did not handle the protocol relative URL correctly: an obscure *nix browser called Dillo.
There are two drawbacks I've received feedback about:
Protocol-less URLs may not work as expected when you "open" a local file in your browser, because the page's base protocol will be file:///. Especially when you're using the protocol-less URL for an external resource like a CDN-hosted asset. Using a local web server like Apache or IIS to test against http://localhost addresses works fine though.
Apparently there's at least one iPhone feed reader app that does not handle the protocol-less URLs correctly. I'm not aware of which one has the problem or how popular it is. For hosting a JavaScript file, that's not a big problem since RSS readers typically ignore JavaScript content anyway. However, it could be an issue if you're using these URLs for media like images inside content that needs to be syndicated via RSS (though, this single reader app on a single platform probably accounts for a very marginal number of readers).
The question of whether one could change all their links to be protocol-relative may be moot, considering the question of whether one should do so. According to Paul Irish:
2014.12.17: Now that SSL is encouraged for everyone and doesn’t have performance concerns, this technique is now an anti-pattern. If the
asset you need is available on SSL, then always use the https://
asset.
If you use protocol-less URLs to load stylesheets, IE 7 & 8 will download them twice:
http://www.stevesouders.com/blog/2010/02/10/5a-missing-schema-double-download/
So, this is to be avoided for CSS if you like good performance.
Yes, network-path references were already specified in RFC 1808 and should work with all browsers.
Is this completely cross-browser compatible? Are there any other caveats?
Just to throw this in the mix, if you are developing on a local server, it might not work. You need to specify a scheme, otherwise the browser may assume that src="//cdn.example.com/js_file.js" is src="file://cdn.example.com/js_file.js", which will break since you're not hosting this resource locally.
Microsoft Internet Explorer seem to be particularly sensitive to this, see this question: Not able to load jQuery in Internet Explorer on localhost (WAMP)
You would probably always try to find a solution that works on all your environments with the least amount of modifications needed.
The solution used by HTML5Boilerplate is to have a fallback when the resource is not loaded correctly, but that only works if you incorporate a check:
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
<!-- If jQuery is not defined, something went wrong and we'll load the local file -->
<script>window.jQuery || document.write('<script src="js/vendor/jquery-1.10.2.min.js"><\/script>')</script>
I posted this answer here as well.
UPDATE: HTML5Boilerplate now uses <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"> after deciding to deprecate protocol relative URLs, see here.
If you would like to make sure all requests are upgraded to secure protocol then there is simple option to use Content Security Policy header upgrade-insecure-requests
Content-Security-Policy: upgrade-insecure-requests;
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy/upgrade-insecure-requests
I have not had these issues when using ://example.com - but you do need to add the colon at the beginning. Yoast had a good write up about this a while back. But it's lost in his pile of blog posts.

How and where to add a robots.txt file to an ASP.net web application?

I am using ASP.net with C#.
To increase the searchability of my site in Google, I have searched & found out that I can do it by using my robots.txt, but I really don't have any idea how to create it and where can I place my tag like ASP.net, C# in my txt file.
Also, please let me know the necessary steps to include it in my application.
robots.txt is a text file in the root folder that sets certain rules for the search robots, mainly which folders to access and what not. You can read more about it here: http://www.robotstxt.org/robotstxt.html
The robots.txt file is placed at the root of your website and is used to control where search spiders are allowed to go, e.g., you may not want them in your /js folder. As usual, wikipedia has a great write up
I think you may find SiteMaps more useful though. This is an XML file which you produce representing the content of your site. You then push this to the main search engines. Although started by Google all the main search engines have now agreed to follow a standard schema.
Increasing your Google score, and SEO in general, isn't something I've know much about. It sounds like a black art to me :) Check out the IIS SEO Toolkit though, it may offer some pointers.
Most search engines will index your site unless a robots.txt tells it not to. In other words, robots.txt is generally used to exclude robots from your site.

Changing response type in aspx page breaks in IIS7

I have a custom implementation of Application_PreRequestHandlerExecute which is applying a deflate/gzip filter to the response. However, on IIS7, this is failing on my "script generator" pages. These aspx pages take in Query String values and return a custom bit of script, changing the response type to text/javascript. I think it is failing because of the way iis7 uses mime types, but I'm unsure how to fix it short of turning all compressio off.
Anyone faced this problem?
I understand you’re trying to implement your own gzip filter, but why don’t you consider 3rd-party software?
For example there is mod_gzip module in Helicon Ape http://www.helicontech.com/ape/doc/mod_gzip.htm. It’s very powerful solution and you may enable text/* compression just in one line as follows:
SetEnvIf mime text/.* gzip=9
If you need to exclude javascript, you may try this:
SetEnvIf mime text/(?!javascript).* gzip=9
Helicon Ape is totally free for 3 websites. You may be interested in this.
But if you don’t prefer 3rd-party software, please make sure that native IIS compression is switched off. You may do it through IIS manager, see the Compression icon.
WFetch is also handy in such situations (http://www.microsoft.com/downloads/details.aspx?FamilyID=b134a806-d50e-4664-8348-da5c17129210). The latest version understands GZIP.
If you provide a few examples and WFetch output—the situation will look clearly.
Thank you.

Tools to convert asp.net dynamic site into static site

Are there any tools that will spider an asp.net website and create a static site?
http://www.httrack.com/
Have used for this purpose a few times, may need to do a little tidying up of urls, and some css linked images might not make it, depends on how good a job you want to do.
If you have dreamweaver, you can use that to manage the links if you need to clean up the file names afterwards.
Optionally use the link checker extension for firefox to check it all afterwards.
You could use OfflineExplorer: http://www.metaproducts.com/mp/Offline_Explorer.htm
This works well as long as you only have GET requests (links). Postbacks will not
be executed.
Be aware that crawling your site might acually change the underlying
database so I would strongly recommend you back up the database and web before
using a crawler.
Another solution is wget.
I've had good luck with WebZip.

Resources