I am trying to print the href of a html doc, however I am not able to do so.
newurl = 'http://www.heroesfire.com/hots/guide/the-many-ways-of-abathur-1194'
buildpage = Nokogiri::HTML(open(newurl))
#puts buildpage
thistext = buildpage.css("div#wrap div#site-content.self-clear div#guide.view-guide div.col-l div.tab-contents.box div.guide-tab div.chapter-text div.text table.bbcode_columns tbody tr td.bbcode_column a").each do |href|
puts href['href']
end
I am expecting to see '/hots/wiki/talents/pressurized-glands'
I was able to get something similar to work earlier in my script, but I am having zero luck with this.
Invariably, the longer the Node selector, the less likely it will work correctly, especially if you're dealing with HTML you don't control.
Reduce it to find way-points, places that help you drill down instead of trying to define each step.
You're also relying on tbody in the selector. When we see that, the odds are good that it's not in the original HTML source but instead was injected by your browser. Selectors like that smell of using a browser and an inspector to locate a particular item in the page, but the resulting path won't work if the HTML doesn't actually contain tbody. Browsers do a lot of fix-up in an attempt to present something useful, including adding tags. So be careful when you see tbody and confirm it actually exists. In your case, it does, but the concern still exists when navigating through a document.
A simple example of simplifying the path is:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo">
<div id="bar">
<p>text1</p>
</div>
<div id="baz">
<p>text2</p>
</div>
</div>
</body>
</html>
EOT
doc.at('body div#foo div#bar p').text # => "text1"
Can be written more easily, while still accomplishing the same thing, using:
doc.at('#bar p').text # => "text1"
or perhaps one of these:
doc.at('#foo div p').text # => "text1"
doc.search('#foo div p').first.text # => "text1"
All scraping requires at least some advance knowledge of the target page's structure, so, while you're nosing around, take note of the important layout tags. id parameters are especially useful, followed by class and/or unique patterns of tags not replicated elsewhere in the document. Those make it easy to reduce the selector. Sometimes we have to step into the document incrementally like I did using first or one of the "sibling" methods after locating a particular node, but using a long selector rarely is needed.
In the following example, the word Test is not clickable in Internet Explorer, even though the link URL appears at the bottom of the page when it's hovered over, and the link's area is represented accurately in the horrible IE debugging tool (F12). This works fine in all other browsers (of course).
<table><tr><td>Test</td></tr></table>
I know it's not technically valid to nest a table inside a hyperlink tag, but it's really the only practical way to do what I want to accomplish, and seeing how it works fine in all browsers, is there a way to get it to work in IE?
So far, I've tried giving both the table and link a height, width, and also a display property of inline-block. None have worked. Thanks.
You say "seeing how it works fine in all browsers" -- but that's really not true. What's actually happening in some browsers is they're doing work to make it work.
Do something like this instead:
<table onclick="location.href='/'" style="cursor: hand;">
<tr><td>Test</td></tr>
</table>
Also a hack, but a more valid one.
UPDATE
If you have concens about crawlers, there are two possible approaches. One is to add a link after, something like:
<table onclick="location.href='/'" style="cursor: hand;">
<tr><td>Test</td></tr>
</table>
Test
You can also use a <link> tag in the <head> of the document, something like:
<link href="/" rel="section" />
Or whatever link rel type makes sense.
Additionally, HTML structured as you have in your question is invalid according to the spec. In terms of what works reliably and into the future, your code does not qualify. Code written more towards an eye on standards will work more reliably.
ANOTHER UPDATE
Given your comment, here's how I would structure this, assuming markup like this:
<table class="dataTable">
<tr>
<td><img></td>
<td>Description</td>
<td>Details</td>
</tr>
</table>
Your details link represents the link you're using, so what I would do is add this bit of JavaScript (uses jQuery, but could be rewritten for whatever libraries you're currently using:
<script>
jQuery(function($){
$('table.dataTable').delegate('td', 'click', function(){
$(this).find('a.details').trigger('click');
});
});
</script>
This code does not validate. This is counter-intuitive to me. Inputs belong in a from correct? Yet I get a validation error for each element with in the form - input,input,a,a - to this effect:
Seems absurd to wrap it in a fieldset and then add CSS to take out the fieldset box.
Line 37, Column 91: document type
does not allow element "input" here;
missing one of "p", "h1", "h2", "h3",
"h4", "h5", "h6", "div", "pre",
"address", "fieldset", "ins", "del"
start-tag
<form id="f3"method="post"action="interface_add.php">
<input onkeydown="this.value=''"class="te3"type="text"name="f3a"value="title"/>
<input onkeydown="this.value=''"class="te3"type="text"name="f3b"value="url"/>
<a id="f3c"class='but'href="javascript:void(0)"onclick="interface_add()">Add</a>
<a id="f3d"class='but'href="javascript:void(0)"onclick="interface_delete()">Delete</a>
</form>
It is highly recommended (it may even be required in strict) to use a <fieldset> when creating a form. You should include a <legend> as well.
But why should you really do it? Accessibility. Screen readers read this information so a user knows what information they should be providing in the grand scheme. I'm pretty sure Section 508 requires it's use as well. It's been argued that federal laws don't apply to private websites, but tell that to Target, who was sued over accessibility issues.
In reality, it's not hurting anything to include it. And you can style it as you will.
Well, might seem absurd, but that's how it is. You could always file a complaint to the W3C group. So, wrap those inputs into divs, fieldsets, paragraphs or something to make the validator happy and put this nice XHTML strict logo to your site :-)
I believe it's stating that those controls should not be directly in a form, but should be in one of those elements in the form, i.e.:
<form ...>
<p>
<input ... />
</p>
</form>
I have one error when validating my Joomla built site against the W3C validator. It's a closed tag but without an opening tag. Problem I'm having is the tag appears to be completely unrelated to any other elements and I can't see it via Firebug. Anyone know how I can track down what's causing this? It's a right pain. Details from the validator:
Line 403, Column 5: end tag for element "h2" which is not open
The Validator found an end tag for the above element, but that element is not currently open. This is often caused by a leftover end tag from an element that was removed during editing, or by an implicitly closed element (if you have an error related to an element being used where it is not allowed, this is almost certainly the case). In the latter case this error will disappear as soon as you fix the original problem.
If this error occurred in a script section of your document, you should probably read this FAQ entry.
and a snippet of how it appears in the HTML from view source:
<div id="ja-current-content" class="column" style="width:100%">
<div class="ja-content-main clearfix">
</h2>
<div class="article-content">
<h1>Welcome to SWAYsearch web design, Cambridge</h1>
The site is: http://www.swaysearch.com
Any help much appreciated.
Cheers
John
That's a known bug when you have title's disabled. You'll have to open that file manually and comment out that if statement which leaves the stray closing h2 tag.
I believe the opening h2 tag is this:
<h2 class="contentheading clearfix">
Search through your code for that string and you should find your issue.
I started using a diagnostic css stylesheet, e.g.
http://snipplr.com/view/6770/css-diagnostics--highlight-deprecated-html-with-css--more/
One of the suggested rules highlights input tags with the type submit, with the recommendation to use <button> as a more semantic solution. What are the advantages or disadvantages of <button> with type submit (such as with browser compatibility) that you have run across?
Just to be clear, I understand the spec of <button>, it has a defined start and end, it can contain various elements, whereas input is a singlet and can't contain stuff. What I want to know essentially is whether it's broken or not. I'd like to know how usable button is at the current time. The first answer below does seem to imply that it is broken for uses except outside of forms, unfortunately.
Edit for 2015
The landscape has changed! I have 6 more years experience of dealing with button now, and browsers have somewhat moved on from IE6 and IE7. So I'll add an answer that details what I found out and what I suggest.
When using <button> always specify the type, since browsers default to different types.
This will work consistently across all browser:
<button type="submit">...</button>
<button type="button">...</button>
This way you gain all of <button>'s goodness, no downsides.
Answering from an ASP.NET perspective.
I was excited when I found this question and some code for a ModernButton control, which, in the end, is a <button> control.
So I started adding all sorts of these buttons, decorated with <img /> tags inside of them to make them stand out. And it all worked great... in Firefox, and Chrome.
Then I tried IE6 and got the "a potentially dangerous Request.Form value was detected", because IE6 submits the html inside of the button, which, in my case, has html tags in it. I don't want to disable the validateRequest flag, because I like this added bit of data validation.
So then I wrote some javascript to disable that button before the submit occurred. Worked great in a test page, with one button, but when I tried it out on a real page, that had other <button> tags, it blew up again. Because IE6 submits ALL of the buttons' html. So now I have all sorts of code to disable buttons before submit.
Same problems with IE7. IE8 thankfully has this fixed.
Yikes. I'd recommend not going down this road IF you are using ASP.NET.
Update:
I found a library out there that looks promising to fix this.
If you use the ie8.js script from this library: http://code.google.com/p/ie7-js/
It might work out just fine. The IE8.js brings IE5-7 up to speed with IE8 with the button tag. It makes the submitted value the real value and only one button gets submitted.
Everything you need to know: W3Schools <button> Tag
The tag is supported in all major browsers.
Important: If you use the button element in an HTML form, different browsers will submit different values. Internet Explorer will submit the text between the <button> and </button> tags, while other browsers will submit the content of the value attribute. Use the input element to create buttons in an HTML form.
Pros:
The display label does not have to be the same as the submitted value. Great for i18n and "Delete this row"
You can include markup such as <em> and <img>
Cons:
Some versions of MSIE default to type="button" instead of type="submit" so you have to be explicit
Some versions of MSIE will treat all <button>s as successful so you can't tell which one was clicked in a multi-submit button form
Some versions of MSIE will submit the display text instead of the real value
From https://developer.mozilla.org/en-US/docs/Web/HTML/Element/button:
IE7 has a bug where when submitting a form with Click me, the POST data sent will result in myButton=Click me instead of myButton=foo.
IE6 has an even worse bug where submitting a form through a button will submit ALL buttons of the form, with the same bug as IE7.
This bug has been fixed in IE8.
An important quirk to be aware of: In a form that contains a <button/> element, IE6 and IE7 will not submit the form when the <button/> element is clicked. Other browsers, on the other hand, will submit the form.
In contrast, no browsers will submit the form when <input type="button"/> or <button type="button"/> elements are clicked. And naturally, all browsers will submit the form when <input type="submit"/> or <button type="submit"/> elements are clicked.
As #orip's answer says, to get consistent submit behavior across browsers, always use <button type="button" /> or <button type="submit" /> inside a <form/> element. Never leave out the type attribute.
I've had some experience with the quirks of <button> now, 6 years later, so here are my suggestions:
If you're still supporting IE6 or IE7, be very careful with button, the behavior is very buggy with those browsers, in some cases submitting the innerHtml instead of value='whatever' and all button values instead of just one and wonky behavior like that. So test thoroughly or avoid for those browser's sake.
Otherwise: If you're still supporting IE8, <a href='http://example.com'><button></button></a> doesn't work well, and probably anything else where you nest a button inside a clickable element. So watch out for that.
Otherwise: If you're using a <button> mainly as an element to click for your javascript, and it's outside of a form, make it <button type='button'> and you'll probably be just fine!
Otherwise: If you're using <button> in a form, be wary that the default type of <button> is actually <button type='submit'> in (most) cases, so be explicit with your type and your value, like: <button type='submit' value='1'>Search</button>.
Note that: Using a button-mimic class, like Bootstrap's .btn allows you to just make things like <div> or <a> or even <button> look exactly the way you want it to, and in the case of <a> have a more useful fallback behavior. Not a bad option.
TLDR; Ok to use if you don't care about ancient browsers, but Bootstrap provides even more robust css visually similar alternatives worth looking into.
Is it broken or not:
As usual, the answer is "it works fine in all major browsers, but has the following quirks in IE." I don't think it will be a problem for you though.
The <button> tag is supported by all the major browsers. The only support problem lies in what Internet Explorer will submit upon pressing a button.
The major browsers will submit the content of the value attribute. Internet exploter will submit the text between the <button> and </button> tags, while also submitting the value of every other one in the form, instead just the one you clicked.
For your purposes, just cleaning up old HTML, this shouldn't be a problem.
Sources:
http://www.peterbe.com/plog/button-tag-in-IE
http://www.w3schools.com/tags/default.asp
Here's a site that explains the differences:
http://www.javascriptkit.com/howto/button.shtml
Basically, the input tag allows just text (although you can use a background image) while the button allows you to add images, tables, divs and whatever else. Also, it doesn't require it to be nested within a form tag.
You might also run into these problems:
jQuery cannot target the button (not jQuery's fault, though): <button> in IE7
Multiple request variables if there are >1 <button>s: http://www.peterbe.com/plog/button-tag-in-IE
Another thing is related to styling it using the sliding-door technique: you need to insert another tag e.g. <span> to make it work.
as far as I am concerned the difference between submit and button tags is this:
gives you the option to have different text displayed than the element's value
Let's say you have a list of products then next to each product you want a button to add it to the customer's cart:
product1 : <add to cart>
product2 : <add to cart>
product3 : <add to cart>
then you could do this:
<button name="buy" type="submit" value="product2"> add to cart </button>
Now the problem is that IE will send the form with value="add to cart" instead of value="product2"
The easiest way to workaroound this issue is by adding onclick="this.value='product2'"
So this:
<button name="buy" type="submit" value="product2" onclick="this.value='product2'"> add to cart </button>
will do the trick on all major browsers - I have actually used this on a form with multiple buttons and works with Chrome Firefox and IE
Looks like the main reason to use <button> is to allow for CSS markup of that button and the ability to style the button with images: (see here: http://www.javascriptkit.com/howto/button.shtml)
However, I think the more adopted approach I've seen in (X)HTML + CSS is to use a div and style it completely with images and :hover pseudo-classes (simulating button downpress... can't add more than one link per answer, so just google "div button" you'll see lots of examples of this), and using javascript to do form submission or AJAX call... this also makes even more sense if you don't use HTML forms, and do all submissions with AJAX.