Eliminating CSS selectors when parsing with Nokogiri? - css

I am retrieving the latest news articles from cnn.com website, and wrote a simple Nokogiri script to do this:
url = "http://edition.cnn.com/?refresh=1"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css("#cnn_maintt2bul div+ div a").each do |headline|
article = headline.text
puts "#{article}"
end
The problem is, CNN posts a mixture of articles and links to videos. Now I am only interested in articles not videos. When I run this script it retrieves all articles but leaves a space when an article links to a video, for example.
Pakistan airstrikes kill dozens
Could U.S. leave Afghanistan?
Editor's stabbing draws outrage
Ukrainian city fears uprising
U.S. hate groups in decline
This would mean that Ukrainian city fears uprising would actually link to a video. It would do this until it retrieves the last article.
I discovered that the articles have a selector called .cnnVideoIcon. Any ideas about how I could eliminate this such that articles linking to videos are removed from my results?
How would I eliminate such links when am parsing? They could appear anywhere.

I looked at the HTML source code of the CNN site and found that the "li" tag of a video headline has four child elements, and only three child elements with text headlines.
<li class="c_hpbullet3" data-vr-contentbox="">
<span class="cnnPreWOOL"></span>
Ukrainian politics remain in flux
<span class="cnnPostWOOL"></span>
<img class="cnnVideoIcon" width="16" height="10" border="0" alt="Ukrainian politics remain in flux" src="http://i.cdn.turner.com/cnn/.e/img/3.0/global/icons/video_icon.gif">
</li>
So, we can use the XPath syntax below:
doc.xpath("//div[#id='cnn_maintt2bul']/div/div/ul/li[count(*)=3]/a").each do |headline|
article = headline.text
puts "#{article}"
end

If you look at the source code of the blocks you're scraping from http://edition.cnn.com/?refresh=1, you will notice that videos are a link with a video icon (and no text), like so:
<a href="/video/data/...">
<img class="cnnVideoIcon" alt="Ukrainian city fears uprising" ...
height="10" width="16">
</a>
This explains why you get some empty lines.
You could skip those links using a more refined selector like:
#cnn_maintt2bul div + div a:empty
Using a:empty, you will only retrieve links without images or other elements inside, or, in other words, all links with a description text only.
Another (less optimized) approach is to simply skip the empty lines with an if statement:
doc.css("#cnn_maintt2bul div + div a").each do |headline|
article = headline.text
if (article != "")
puts "#{article}"
...

You should use something else than the CSS attributes to find the desired tags. Use search instead of css and give it an XPath that only selects the elements that don't have the link to a video as child.
I will update the answer with a designated XPath when you provide a real URL to the site you want to fetch information from.

Related

How to best format thumbnails for accessibility?

I am frequently tasked with displaying a grid of thumbnails for work, such as on a posts/articles page, with each thumbnail linking to a separate post/article, but I have never really been sure of the best way to format these for screen readers/accessibility. More specifically, I have never been sure whether to use the <article> or <figure> tag for this purpose, or neither, or something else entirely. Does anyone know? These are the three methods I am debating between:
<a>
<article>
<img />
<div></div>
</article>
</a>
<a>
<figure>
<img />
<figcaption></figcaption>
</figure>
</a>
<a>
<img />
<div></div>
</a>
The documentation for the article tag says that it "represents a self-contained composition in a document, page, application, or site, which is intended to be independently distributable or reusable". I don't know what that means in this context, but it seems like it could be intended for this purpose, or it could be meant to be used once on the actual article pages and not the overall "articles" list page.
The documentation for the figure tag says that it "represents self-contained content, potentially with an optional caption". It seems like it would work quite well here, except my intuition says that it might be intended more for figures that are inline with the text of articles, so I have my doubts.
The 3rd option is to use neither the article or the figure tag in an effort to just simplify the html as much as possible so that screen readers do not have to look at and interpret as many nested tags.
References:
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/article
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/figure
From an accessibility perspective, I have not found much benefit to using an <article>. On iOS and Mac, Safari incorrectly treats an <article> as a landmark even though the definition of an article role specifically says it isn't.
An article is not a navigational landmark, but may be nested to form a discussion where assistive technologies could pay attention to article nesting to assist the user in following the discussion.
Notice that it says AT could pay attention to the article element but other than the aforementioned treatment as a landmark in Safari, I have not found NVDA, JAWS, or Voiceover to do anything special with an <article>.
If you plan on having a caption below the image, then you could use <figcaption>. It's just a handy way to visually display text below an image. But if the thumbnail doesn't have text below it but rather has a heading or link to the article, then <figcaption> isn't needed.
Your last example, the simplest, is the most common way to code what you want and works just fine for accessibility. I know your code snippets were just minimal code but make sure your <img> uses the alt attribute.
If your image is inside your link (as in your example) and there's other text containing the title of the article within the link, then the image can have an empty alt="" (or even just alt with no value). But if there isn't any visible text in the link, then make sure the image has an appropriate alt attribute value.

How do I structure my HTML semantically correct for screen readers when the visual order of elements is different?

I’m trying to make a search result list more accessible.
Lets say I have a list of search results that are structured in the following way:
<article>
<h2>Name of the author</h2>
<h1><a>Name of the book</a></h1>
<div class="seperator">
<div class="availability-status status1" title="available"></div>
<div class="icon icon-book" title="Book"></div>
<div class="result-button-group">
Sharing
…
</div>
</div>
<p class="imprint">Publishing house (Year)</p>
<p class="series">Part of: name of the series</p>
</article>
The name of the book is a link to another page, while the other elements around it are additional information for the corresponding item.
Visually it looks like this:
How do I structure the markup semantically correct so that users with screen readers can make sense of the result item?
When they navigate on a link to link basis they land on the name of the book, but might miss the author field that is above the title, right? Can I achieve this with aria-attributes? Or is this structured enough to make sense of regardless?
I played around with VoiceOver myself to try to make sense of it but I’m far from an expert. So any input is appreciated.
Outline
You should not use a h2 for the author name. This heading would become the heading for the article element (as it’s the first one), and the heading for the book title would create another section on the same level.
Instead, use only one heading (the book title would make the most sense) and group it with the author name (for which you could use a cite element) in a header element.
<article>
<header>
<cite>Name of the author</cite>
<h1><cite>Name of the book</cite></h1>
</header>
<!-- … -->
</article>
Link
When they navigate on a link to link basis they land on the name of the book, but might miss the author field that is above the title, right?
Yes. But that’s not a problem, it’s exactly what the screen reader user expects/wants to do (finding links, not anything else).
You could, however, consider adding the author name to the link/heading, too:
<h1><cite>Name of the author</cite>: <cite>Name of the book</cite></h1>
Font icons
Note that this is likely inaccessible (details), because the element has no content (the generated image is useless for user agents without CSS, blind users, etc., and the meaning that it conveys is not represented in an alternative way in addition):
<div class="availability-status status1" title="available"></div>
The title attribute is not sufficient. Either use an img (with alt), or add alternative text (and visually hide it).
And this seems to be pure decoration, so there’s no need for a title attribute (and it would be inaccessible to many users anyway, because the element has no content):
<div class="icon icon-book" title="Book"></div>
(But if the information that it’s a book is important, e.g. because there are magazine etc. too, then you should provide an alternative, just like in the case above.)

xquery- how to get content of a node which is immediately after a node with known text

I am trying to extract content from a XHTML document-- in this document, within a div, there are a number of 'b' elements, each followed by a link.
For eg--
<div id="main">
<b> Bold text 1</b>
some link 1
<b> Bold text 2</b>
some link 2
<b> ABRACADABRA</b>
abracadbralink
</div>
Now, I want to extract the link 'abracadabralink'-- the problems are that, I dont know how many and elements are there before this specific link-- in different documents there are a different number of such elements- sometimes there are many links immediately after a single element-- all I do know is that the text for the element that occurs just before the link that I want, is always fixed.
So the only fixed information is that I want the link immediately after the element with known text-- how do I get this link using XQuery?
If I get it right, you are interested in the value of the #href attribute? This can be done with standard XPath syntax:
doc('yourdoc.xml')//*[. = ' abracadbralink']/#href/string()
For more information on XPath, I’d advise you to check out some online tutorials, such as http://www.w3schools.com/xpath/default.asp
I guess the following should work for you:
$yournode/b[. = ' ABRACADABRA']/following-sibling::a/#href/string()

what is the correct way to code incoming links for SEO?

our site is giving out 'badges' to our authors. they can post these on their personal blogs and they will serve as incoming links to our site.
We want to give out the best possible code for SEO without doing anything that would get us flagged.
i would like to know what you're thoughts are on the following snippet of code and if anyone has any DEFINITE advice on dos and donts with it. Also, let me know if any of it is redundant or not worth it for SEO purposes.
i've kept the css inline since some of the writers would not have access to add link to external css
i've changed the real values, but title, alt etc would be descriptive keywords similar to our page titles etc (no overloading keywords or any of that)
<div id="writer" style="width:100px;height:50px;>
<h1><strong style="float:left;text-indent:-9999px;overflow:hidden;margin:0;padding:0;">articles on x,y,z</strong>
<a href="http://www.site.com/link-to-author" title="site description">
<img style="border:none" src="http://www.site.com/images/badge.png" alt="description of articles" title="View my published work on site.com"/>
</a>
</h1></div>
thanks
Using H1 to enclose your "badge" is a really bad idea—not in so much as it'll negatively affect SEO for your site, but it will very likely ruin the accessibility (and thus SEO) of the author site. H1-H6 are used to provide document structure by semantically delimiting document headings. Random use of heading tags can confuse screen readers and webcrawlers. There's not much you can do in terms of legitimate SEO aside from making correct use of semantic HTML markup.
Edit:
Something like this would be the safest bet:
<div id="writer-badge" style="width: 100px; height: 50px;">
<strong>
Articles on x,y,z
</strong>
<br />
<a href="..." title="site description" rel="profile">
<img style="border: none" src="..." alt="..."
longdesc="http://site.com/badges-explained"
/>
</a>
</div>
I put a line-break between the text and image to treat the text as sort of a badge title. If it's not meant to be displayed that way, then I would omit the <strong> tags altogether (there's no semantic value in encapsulating the text that way, and any styling could be done using the DIV or a weight-neutral SPAN element).
IMO there's really no reason for a achievement badge to have a heading of its own (it's really not even part of the document, just a flourish in the layout), but if you absolutely must, then H6 would be more appropriate and safer to use than H1.
As far as keyword proximity, that is sorta venturing into the grey-hat area of SEO (similar to keyword stuffing), and I wouldn't know anything about that. I've yet to come across any reliable info on how Google or other search engines treat keyword placement. I think if you properly use tag attributes like alt, title, longdesc, rel, rev, etc. in images and links, you'll be alright.
I don't think there is any issue with this code except your <h1> tag. I would probably change it to <h2> simply because pages are supposed to have only 1 <h1> tag per page.
You could also use an iFrame instead if you wanted. That is what SO does but I know you will not get as much linky goodness.

Apply a href-like attribute to non-<a> elements

I've been working on a page where there are several entries contained in different <div>s. Each is only a title linked to a page, an image and a short description. However, the description may contain arbitrary tags, including <a> tags.
Since these are pretty straightforward and the actual link isn't that big, I've made it so a click on the <div> will call location.href = (link URL). However, that's a pretty sad thing, because it's browser-unfriendly: for instance, under Google Chrome, a middle-click on one of said <div>s won't open the link in a new tab.
Considering you shouldn't nest <a> tags, is it possible to make any element in XHTML behave like a link without resorting to Javascript?
I'm using XHTML 1.1, sent with the proper MIME type, and that's the only restriction I'm bound to.
Not really, no. Though it's worth reading Eric Meyer's thoughts on this. Also, it appears that HTML51 includes the capacity for any element to become a link, so it might be worth using that doctype instead of xhtml, if possible.
It's worth also adding that html 5 does allow for an <a> element to enclose block-level elements, see: http://www.brucelawson.co.uk/2008/any-element-linking-in-html-5/, example taken from the linked page:
Instead of:
<h3>Bruce Lawson as Obama's running mate!</h3>
<img src="bruce.jpg" alt="lovegod" />
<p>In answer to McCain's appointment of MILF, Sarah Palin, Obama hires DILF, Bruce Lawson, as his running mate. Read more!</p>
you can say:
<a href="story.htm">
<h3>Bruce Lawson as Obama's running mate!</h3>
<img src="bruce.jpg" alt="lovegod" />
<p>In answer to McCain's appointment of MILF, Sarah Palin, Obama hires DILF, Bruce Lawson, as his running mate. Read more!</p>
</a>
Updated to mention possible inaccuracy
1: I may have misinterpreted part of the document to which I linked, having tried to find support for my claim that '...appears that HTML5...any element to become a link' (in the W3C's html 5 overview) it doesn't seem to be there. I think I was over-encouraged when I saw Meyer's proposal to include that possibility.
I'm too gullible, and naive... =/
If you want a link to cover an entire div, an idea would be to create an empty <a> tag as the first child:
<div class="covered-div">
<a class="cover-link" href="/my-link"></a>
<!-- other content as usual -->
</div>
div.covered-div {
position: relative;
}
a.cover-link {
position: absolute;
top: 0;
bottom: 0;
left: 0;
right: 0;
}
This works especially great when using <ul> to create block sections or slideshows and you want the whole slide to be a link (instead of simply the text on the slide). In the case of an <li> it's not valid to wrap it with an <a> so you'd have to put the cover link inside the item and use CSS to expand it over the entire <li> block.
Do note that having it as the first child means it will make other links or buttons inside the text unreachable by clicks. If you want them to be clickable, then you'd have to make it the last child instead.

Resources