Simulating a crawler on my website - asp.net

I need to debug my web app which is written by asp.net to find out how it is acting when rendering the content for the crawlers like Googlebot. The first thing I found was some online/offline tools but none of them can pass the Request.Browser.IsCrawler flag.
Then I tried to simulate a handmade request adding the Googlebot UserAgent but still no chance.

I used Telerik Fidler and Chrome while setting User-Agent to Googlebot/2.1 (+http://www.googlebot.com/bot.html), including _escaped_fragment_ in the URI and successfully saw the page from crawler perspective.

Related

IFrame Detection for Cross-Origin Pages

I'm working on a web app that contains iframes whose sources are from different domains. I want to be able to detect whether or not a website will load successfully in an iframe.
I've tried a bunch of solutions I've found here, none of which seem to work. I've tried setting a timer and then checking the content of the iframe (which won't work because of the same-origin policy). I've tried performing a GET request to the source of the iframe, but a lot of domains won't let you do this even if the iframe loads up.
From my understanding, whether or not it loads is determined by the X-Frame-Options, but I don't think I have access to that either because I cannot get the HTTP response headers.
Is this just impossible? Any help would be appreciated.

IE9 redirect caching, fonts, and cross domain resource sharing(CORS) CDN HTTP headers

I thought I have somehow found a solution to the very vexing problem with Firefox and CDN-hosted fonts access, but here comes IE9.
I recently found a very frustrating issue with IE9 caching problem, and chanced upon this blog post (IE9 Redirect Caching Nightmare) which enlightened me more about the actual issue.
I have to admit that I'm not sure whether the above mentioned is actually the issue, but it seems close enough.
Problem:
I have a website set up with 2 domains(base domain and subdomain) pointing to the same server, serving the exact same website which is using a same set of resources from a CDN hosted on Amazon S3, served by Cloudfront.
https://example.com
https://www.example.com
I get these kind of error messages in my IE9 developer tools console when loading fonts from my CSS file using #font-face:
CSS3117: #font-face failed cross-origin request. Resource access is restricted.
This happens when I loaded either of the URL first, then visiting the other second. IE9 is not running in Compatibility Mode. It running is in Document Mode: IE9 Standards.
From my limited understanding of the CORS and the need to set Access-Control-Allow-Origin HTTP header, I have dutifully set it up in S3 CORS policy, and it works perfectly fine with Firefox.
Requests from both domains, will get their respective header when requesting the CDN resource.
It seems that IE9 tried to do some optimization with caching, and cached the redirect too.
This causes a problem as the Access-Control-Allow-Origin header is cached as well. Without sending a request to the CDN server, the Access-Control-Allow-Origin header cannot change for different domains.
So I'm left with a situation where the request is from https://www.example.com and yet the Access-Control-Allow-Origin is https://example.com. This leads to the restricted resource problem with the error message above.
Further look: I did a check with Firefox 19, the above situation actually occurs, but it does not encounter the same strict restriction as IE9. Subdomain (https://www.example.com) requesting information will accept the access-control-allow-origin of the main domain(https://example.com). Chrome (Webkit) doesn't seem to care. I'm at a loss about which browser's behaviour implementation is correct.
With my current settings in the CDN, it seems like Chrome and Firefox, automatically reroutes allwww subdomain requests to the main domain. Only upon multiple attempts of inputing the www subdomain in the address bar, then will Chrome and Firefox obey. IE9 on the other hand, just goes to whichever address is typed in the address bar. IE9 seems to be the odd one out here, but I'm not sure which browser's behaviour is actually correct.
From a usability standpoint, Chrome and Firefox seems to be the "correct" behaviour.
Known Possible Solutions:
Set Access-Control-Allow-Origin header to allow all, i.e. *
Turn off caching in the browser
Redirect one domain to the other
Use query string to differentiate different domain calls for resource
Embed the font into the CSS as data-uri
For solution 1, let's just say I'm paranoid that I just want to set specific domains to allow.
For solution 2, is not optimal if I were to set it for all browsers, also my site has to run on mobile devices with usually less-than-desirable download speeds.
For solution 3, possible, but I'm still curious for solution to deal directly with the IE9 caching issue.
For solution 4, it is very hard to implement especially when the resource is called from #font-face. Does it mean that I'll have to dynamically re-generate the CSS for different domain calls for the different line just to load a font to bypass the issue? Seems to defeat the purpose of CSS itself, and caching resources for that matter.
edit: Stylesheet works, font-loading doesn't.
For Solution 5, it is tedious for maintenance and updating, especially when there are changes to the font files periodically.
Question: Are there any known ways to deal specifically with IE9's redirect caching behaviour in this particular case?
Answers and comments are very much appreciated. Thanks in advance!
Edit: More browser test information.
Solution 1:
Check this question.
Solution 4: rename your CSS file to style.php and use whatever code you need to call the appropriate resource.
Set the content type at the top of the page.
<?php
header("Content-type: text/css; charset: UTF-8");
?>
More info about style.php from Chris Coyier.
We discovered the same weird behavior also in IE10 and IE11.
Resetting the browser cache makes the fonts to be loaded without any problem. Also enabling and disabling compatibility mode.
But when switching to another subdomain, IE does not render the font because request header does not match the response header which is still the URL of the last request. And IE always shows the full URL for even if the definition on the bucket is *.ourdomain.com
So the general issue with allowing cross origin requests to assets like webfonts was solved by adding CORS permissions to the S3 Bucket - that made the webfonts work perfectly in Firefox.
But we still have no idea how to avoid * and tell IE not to cache the response headers.

HTML5 audio with a HTTP 302 redirect in Chrome

I am trying to write an HTML 5 based last.fm player using the popular jPlayer jQuery plugin (http://jplayer.org).
The player works fine in Firefox. However I ran into a problem:
From the last.fm API (http://last.fm/api) I get a playlist with urls to the files. When requesting one of these, last.fm does a HTTP 302 redirect from play.last.fm to something like "http://s03.last.fm/someurl/128.mp3".
It looks like there is some same origin policy for html 5 tags, because jPlayer is unable to play the file in Chrome and Chromium. If jPlayer uses the flash solution (using "flash, html" instead of "html, flash"), everything works fine.
I installed the extra codecs on my Ubuntu and mp3 playback works nicely for the jPlayer demos.
HEAD requests are not supported by the streaming servers. I already tried to do a normal GET request and then tried to get the "Location" header of the xmlhttprequest, but it fails with a security error.
You can find the sources of my (proof of concept) project at https://github.com/tburny/html5-lastfm-player
Is there any hint/solution to this problem?
i had a similar problem but only on android browser. there are lots of gotchas. the key question is if either the original url which gives 302 and the end one is https? if so it'll fail.
check out this test suite http://areweplayingyet.org/

Emulating user browsing session for unit test

I'm searching for a framework that could allow me to emulate user browsing session.
A typical session looks like:
Browse to home page, get session
Be redirected to current page
Click on some link
Get connected
Submit a form
and co...
I would like to be able to define this session using API calls.
What frameworks would you recommend to be able to run this setup? It should be run headless (not inside the browser), to be able to execute via Hudson.
Language does not matter, python of java would be great.
Thank you,
Maxim.
There are multiple frameworks which can do this. Check out:
https://github.com/axefrog/XBrowser
http://htmlunit.sourceforge.net/
and the answer to this question:
Alternative to HtmlUnit
Have a look at htmlunit
Its even got decent javascript support, its Java based.
Support for the HTTP and HTTPS protocols
Support for cookies
Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type)
Support for submit methods POST and GET (as well as HEAD, DELETE, ...)
Ability to customize the request headers being sent to the server
Support for HTML responses
Wrapper for HTML pages that provides easy access to all information contained inside them
Support for submitting forms
Support for clicking links
Support for walking the DOM model of the HTML document
Proxy server support
Support for basic and NTLM authentication
Excellent JavaScript support
take a look at Selenium WebDriver with Xvfb.
this post shows an example in Python:
'Python - Headless Selenium WebDriver Tests using PyVirtualDisplay'

HTTPS does not work - Secure and Non secure data on web page?

I have a browser compatibilty problem with https? I have SSL installed and is in usage. Until today morning, my https part is working well. From then, Https is shown as https(with slashed in red color) saying the page has some insecure content.
I have not changed any code and suddenly i see this problem in chrome. In IE 8, i see the same problem but on every page, it shows me a popup if i should allow to opne secure and non secure or just secure. Firefox has no issues . It shows correct https without any problem. I am fed up with it searching all over. Why is this happenening for me in Chrome and IE 8.
Could someone tell me what the problem is and what can be done to solve it!
PS: I have also checked if the page source is any different when IE8 showed with and without secure data. Everything is the same. but viewstateID was different. Is that something that is creating this problem?
Thanks a lot in advance.
This is usually caused by having the absolute path to a resource specified somewhere on the page without having https specified, eg:
<img src="http://someurl.com/image.png">
If it's a link to something on your site, use https: or a relative path.
DO you have any 3:rd party javascript included, like google analytics or other that might have changed.
If you try with Firefox there is firebug you can add as an addon.
In there is a tab for network (net).
It lists everything the page loads.
In that list you should be able to find anything that gets loaded without https.
IE (correctly) complains when there is mixed http/https content as a security warning. Most other browsers do not typically complain when dealing with mixed content so your source is very likely the same in both instances.
I would second David MÃ¥rtensson's answer and say the issue is likely a third party library (google or MS hosted JQuery for example) or static asset server.

Resources