I'm using HtmlUnit for web testing and it is great. But I also want to test the download time for the total page including linked images, javascript and CSS pages as well as ensure the links are valid. Is there a way to tell HtmlUnit to grab all the dependancies for any given html page?
I don't know of anything as part of HtmlUnit itself, but it's not hard to do yourself.
Pull out all the dependent links yourself — e.g. use getByXPath() to retrieve all the <a> tags — and iterate through them and have HtmlUnit pull each down individually.
Just be careful using this to determine download times: unless you emulate how a "real" browser would retrieve things in parallel, you will not get an accurate measure. A tool like Xenu would be a better tool for this job.
Related
I am searching for a solution for crawling % parsing a whole website (online shop) automatically and save all products as Product-name and product-price in a CSV.
Gaining data from a website can be extremely simple or the complete opposite. It depends on how the website is made. A shop tends to be a complex website and thus the DOM (the HTML structure) is mostly unique for that website. It is very unlikely that someone else tried the exact same thing you want for that page. So you have to write code and extract the necessary piecs.
This will be our example product: http://www.thomann.de/gb/focusrite_scarlett_2i2.htm
HTML uses classes to tell the CSS (for styling) how to design or render a certain element. You can use this behaviour for you and find an element containing the price by a class. In this example it is .tr-prod-price.
Every major browser has a Discover element function and it can be used to find a class for a element which appears on screen. Make a right click on your text (price or title) press Q (Firefox only).
Now, you've got closer to parsing your data. Now it is time to write code. You could use Python, Java or even JavaScript to give you some examples. JavaScript in conjunction with Node.JS could be very easy, because JS has the built in methods we need.
You may need a searchengine to find the detail pages of a product. Google can list you all results like site:thomann.de/gb. But of course Google does not provide an easy way (API) to get this information and if you start writing your own parser for that I am not sure about the legal consequences. The legal side needs also to be adressed for you main intention.
CDN integration seems to be a hot topic among Tridion crowd. But, somehow, available discussions mainly revolve around pushing content to/fro CDN. What i'm specifically interested is:
What will be the proper way of modifying/prefixing inline images outbound links to use CDN?
The simplest way to go would be to create some post-processing TBB, operating on Output item, and place it inside 'Default Finish Actions'. Though, doing this on CD side would seem to be more correct, ain't it so?
EDIT
Consider fancier case: what if not only I want to modify image paths, but wrap the whole image links into ASP.Net controls. Where do I do this?
EDIT 2
So far, implemented tag to ASP.Net control replacement via TBB. Went smooth, only needed to keep an eye on the following subtle matters:
Consider CSS inline styles (i.e.: background-image: url(..))
New TBB needs to be placed after any link-manipulating logic (e.g.: Extract Binaries from Html, Publish Bnaries in Package, Link Resolver)
The quickest and most robust implementation is probably with a simple string replacements (in contrast to regexp's or XML parsing)
To keep standard "Preview" logic intact, some condition is necessary to trigger the logic
If you decide to go with ASP.NET controls for your CDN-hosted images, you may consider these phases/steps:
write a TCDL tag (e.g. <tcdl:image id="..." path="...") on CM during rendering
write a TCDL TagHandler implementation that transforms the TCDL into an ASP.NET include during deployment
write the ASCX control to do the CDN lookup proper when the visitor requests the page
I'm not sure if both step 2 and 3 are needed. You might also simply write the CDN path during the deployment phase (step 2 above).
At the same time I'd expect you to upload (updated) images to the CDN using a deployer extension, so that it also happens during phase 2.
I just worked out, by trial-and-error, that IE 7 has an upper limit of 32 stylesheet includes (i.e. tags).
I'm working on the front-end of a very large website, in which we wish to break our CSS into as many separate files as we wish, since this makes developing and debugging much easier.
Performance isn't a concern, as we do compress all these files into a single package prior to deployment.
The problem is on the development side. How can we work with more than 32 stylesheets if IE 7 has an upper limit of 32?
Is there any means of hacking around this?
I'm trying to come up with solutions, but it seems that even if I loaded the stylesheets via Ajax, I'd still be writing out tags, which would still count towards the 32-stylesheet limit.
Is this the case? Am I stuck with the 32-file limit or is there a way around it?
NOTE: I'm asking for a client-side solution to this. Obviousy a server-side solution isn't necessary as we already have a compression system in place. I just don't want to have to do a re-compress every time I make one little CSS change that I want to test.
Don't support IE7.
To avoid confusion: I'm not seriously suggesting this as a real solution.
Create CSS files on the server side and merge all files that are needed for this certain page.
If you are using Apache or Lighttp consider using mod_concat
Write your stylesheet into an existing style block with JavaScript using the cssText property, like this:
document.styleSheets[0].cssText += ourCss;
More info here:
https://bushrobot.blogspot.com/2012/06/getting-around-31-stylesheet-limit-in.html
At my last company we solved this by mashing all the CSS into one big document and inserting a URL in the web page that referenced that one-shot document. This was all done on-the-fly, just before returning the page to the client (we had a bunch of stuff going on behind the scenes that generated dynamic CSS).
You might be able to get your web server to do something similar, depending on your setup, otherwise it sounds like you're stuck with only 32 files.
Or you could just not support IE7 ;)
I want to give end users the ability to save HTML to my backend store. Since this feature could easily cause SQL Injection, and loads of other issues, does anyone know of a server side library that will clean the input so only the "safe" parts of HTML can be used?
Some things I'd like to avoid:
Object Tag use
JavaScript use
Windows "style" pop-up boxes (such as your PC is infected with a virus)
CSS with a Javascript action
inline data from external sites
Since there is a 100% guarantee that I didn't come up with all the ways a user could be malicious with this feature, I'd like to learn what options I have to clean the data, but preserve basic formatting
Consider sanitizing user input with the Microsoft AntiXSS library.
http://wpl.codeplex.com/
http://msdn.microsoft.com/en-us/security/aa973814.aspx
Like I want to check
on Every page <h3> tag must come after <h2> otherwise page should be marked.
like if any page has PDF then Some particular text <p>Download Adobe reader from here</p> should be at bottom of every page is this condition is not matched then page should be marked.
I want to make different type of conditions to check then want to check on whole site and if anything mismatch then report should be generated.
Do you necessarily have to use XHTML? I'd use Python and BeautifulSoup, myself.
(Edit: I was confused - I was thinking of XSLT, not XHTML, and I thought "why would you use XSLT for someting like this?". XHTML is fine, and my recommendation of Python and BeautifulSoup still stands.)
This ruby gem looks like it could be useful to you:
http://code.google.com/p/opticon/
I haven't personally used it, but it claims to basically do what you're asking for.
I've had, and still have, the same need on many of my projects. In my case I'm looking for anything with the class 'error'. This is supported by the TestPlan product in it's verification engine.
In my case, as a quick example, I have several "Web" states and my generic verify script is:
CheckNot //div[#class='error']
Now the way TestPlan works is that every state within "Web" will first run this generic verify script.
If you're interested I could help you come up with the exact syntax needed to do your check.