I am using JSoup to crawl a site but it redirects to a new page using javascript. I am sure it is not using 302 redirect because it will stop redirect when I turn off my browser's javascript. Is there a way to allow JSoup to auto follow javascript redirect? If not, what other alternatives allow javascript redirect?
Jsoup is a parser. It doesn't include a javascript execution engine, so it cannot execute javascript.
In order to execute javascript you will have to use a headless browser, like selenium webdriver.
One other alternative is to parse the javascript (as text) that is responsible for the redirect and extract the url. After that you just do what you normally do in order to scrape a site. But this is a "hack", it's not automatic, and I don't know if it's generic enough for your needs.
Related
I need to debug my web app which is written by asp.net to find out how it is acting when rendering the content for the crawlers like Googlebot. The first thing I found was some online/offline tools but none of them can pass the Request.Browser.IsCrawler flag.
Then I tried to simulate a handmade request adding the Googlebot UserAgent but still no chance.
I used Telerik Fidler and Chrome while setting User-Agent to Googlebot/2.1 (+http://www.googlebot.com/bot.html), including _escaped_fragment_ in the URI and successfully saw the page from crawler perspective.
How can I acheive bi direction rewriting to where the following occurs inbound to the web server
www.mysite.com/region/program/cat1/cat2/cat3
sends the request to the server
www.mysite.com/program.ASPX?idregion=(regionlookupnumber)&idcategory=3276
and when the server is going to write the above it converts it the other way.
Bi-directional URL rewriting on IIS7 using the URL rewriting package. We don't want to modify the source code if possible.
Any advise or resource or sample link please?
I did this using the automatic friendly url template of the url rewriting package 2.0
It works really well. You just need to make sure to only re-write the urls you really need to. If you re-write img, link, etc. tags for instance, it can break your page.
I'm searching for a framework that could allow me to emulate user browsing session.
A typical session looks like:
Browse to home page, get session
Be redirected to current page
Click on some link
Get connected
Submit a form
and co...
I would like to be able to define this session using API calls.
What frameworks would you recommend to be able to run this setup? It should be run headless (not inside the browser), to be able to execute via Hudson.
Language does not matter, python of java would be great.
Thank you,
Maxim.
There are multiple frameworks which can do this. Check out:
https://github.com/axefrog/XBrowser
http://htmlunit.sourceforge.net/
and the answer to this question:
Alternative to HtmlUnit
Have a look at htmlunit
Its even got decent javascript support, its Java based.
Support for the HTTP and HTTPS protocols
Support for cookies
Ability to specify whether failing responses from the server should throw exceptions or should be returned as pages of the appropriate type (based on content type)
Support for submit methods POST and GET (as well as HEAD, DELETE, ...)
Ability to customize the request headers being sent to the server
Support for HTML responses
Wrapper for HTML pages that provides easy access to all information contained inside them
Support for submitting forms
Support for clicking links
Support for walking the DOM model of the HTML document
Proxy server support
Support for basic and NTLM authentication
Excellent JavaScript support
take a look at Selenium WebDriver with Xvfb.
this post shows an example in Python:
'Python - Headless Selenium WebDriver Tests using PyVirtualDisplay'
Hii,
Any one knows how to upload files to the physical location of the server. It is possible using file upload control that i know. But i want to avoid the external postbacking of the page. For e.g exactly like what in the yahoo mail did.
In yahoo mail latest version if you attach a file that won't post back and attach that file in to server. What is the technology behind that?
Normally when you submit a form it does a POST request to the server, causing a refresh. Ajax requests get round this by using JavaScript to send the POST data through to the server, and that doesn't need a page refresh.
Ajax requests can't be used to send file data though, so the best way to currently do it is with an iframe hack - you use JavaScript to dynamically build up a form within an iframe, submit that form via JavaScript, and listen for the iframe's onload event. So you know when the form has been submitted. A version of this approach is detailed here:
http://www.webtoolkit.info/ajax-file-upload.html
Other methods to do this would include using a Flash-based solution like http://www.swfupload.org/ or a wrapper like http://www.plupload.com/ - these will prevent you having to roll your own solution and will also provide some extra functionality - upload progress feedback, for example.
I am trying to wrap my head around a URL rewrtie / redirect project I need to work on. We currently have this url: http://www.example.com/Details/Detail.aspx?param1=8¶m2=12345
Here is what the rewritten URL will look like: http://www.example.com/Param1/8/Param2/12345
I am using the ISAPI_Rewrite filter to allow for the "nice" url and make the page think it is still using the old url. That works fine.
Now, I need to redirect users, if they use the old URL, to the new URL. I figure I would need to use a combination of the filter and an HTTPModule / Handler to perform the redirect.
Any ideas?
Have you tried IIS URL Rewrite?
If you are not going to go down the System.Web.Routing (or use ASP.NET MVC) path then I would have a look at this link.
Using a HttpHandler would be your best bet. That way, you will be able to track all incoming requests, filter out the old format URLs and redirect them to the correct pages.