How to convert set of images to proper RSS description? - rss

I am trying to build a RSS feed, so first I get the page with XPath Fetch Page (XPath is //div[#class='topic']/h2/a), then I Loop for each item and XPath Fetch Page again with URL equal to item.href and extract images there with XPath equal to //*[#id="topic"]/div[4]/div/div/a/img.
In result I get a description, which contains set of images like:
0
alt Sample title
height 925
src http://cdn.example.com/f2dd/3702212_6cddf28e.jpg
style width:640px; height:925px;
width 640
1
alt Sample title
height 920
src http://cdn.example.com/1cc7/3702213_00acefab.jpg
style width:640px; height:920px;
width 640
...
How should I convert it to one text string with a number of <img src="..."> elements? If I Loop thru each item (what is shown on the image below), then I get just one (first) element.

So you want to join / concatenate a list of images... I don't think there's an operator for that.
The next best thing is a dirty hack like this:
In your loops, assign to x instead of description
Add a Regex operator with params:
in: item.description
replace: .*
with: ${x.0}${x.1}${x.2}${x.3}${x.4} (and so on)
UPDATE
As #LA_ commented, the replacement should be like this instead: <img src="${x.0.src}"><br><img src="${x.1.src}">

Related

XPATH how to get the attr by selecting the class

I want to get the value of the attibute of an image like 'src'. also the image have a classname, i select the image using the class name, but how can i get the attribute?
<img src="http://somelink.jpg" class="img-fluid">
here's how i use the xpath selector to select an image using the classname
//img[#class="img-fluid"]
how can i get the value of an attr of an image "src"?
so i'll have the link "http://somelink.jpg"?
You can use the below xpath to get the src value.
//img[#class="img-fluid"]/#src
For example, you can get src attribute of all avatar images in this page using the below xpath.
//*[contains(#class, "user-gravatar")]//img/#src
You may also take a look at this xpath cheatsheet.

Saving an image using Nokogiri

I'm using the following piece of code to scrape specific images from a webpage. There are multiple images on this page with the image tag, so how does this code interpret that? I've noticed that it saves only the first image with the image tag. Is this true in general?
Am I correct in reasoning that this code starts reading the css from top to bottom and once it finds the first image with the image tag it saves it and stops looking further? Because I need it to do just that.
PAGE = "http://example.com/page.html"
require 'nokogiri'
require 'open-uri'
html = Nokogiri.HTML(open(PAGE))
src = html.at('.image')['src']
File.open("foo.png", "wb") do |f|
f.write(open(src).read)
end
Yes,
html.at finds the first matching element only
html.search find all matching elements
Does that answer your question?
On a related note,
html.at(".image") finds any element with class="image" even if it were eg a <div> tag
html.at("img.image") finds the first <img> element with class="image"
html.at("img") finds the first <img> element

How do I select only a certain character of text and turn it into a break?

I will have users input text in a textbox to set as their identifier, however, they can only enter 1 line of text. I have no way of changing that.
I would like to add CSS that takes the string of text and edits a | character and changes it to a <br>
The string of text they will type will be something like this: 1234-5678-1234 | Jim
I want it to show up like this:
1234-5678-1234
Jim
I'm guessing the code might look like this:
p:contains('|') {code for an enter and float right}
I would be posting this as comment but I need 50 rep :)
Just this: What you are trying to do needs JS. You should give RegExp a try. There's not a way to do that using pure css.
It is not possible to select an element on the basis of its textual content, except for the special case of empty content. There was once (in 2001) a draft suggesting a :contains(...) selector, but this feature was removed as the draft progressed (to eventually become Selectors Level 3 recommendation).
Still less is there a way to select something inside an element based on its content.
Besides, adding <br> would not be possible. You cannot add tags or elements with CSS, only textual content via pseudo-elements.
Moreover, if the input is read in an input element, you cannot make its content displayed in two lines. If the user input is actually echoed an in different element, like p element, then it is programmatically copied there, so the question is why the change is made there. You can modify the content with JavaScript, and it would be rather simple to replace any | by <br> in the content of a p element.
You should give a try using <div contenteditable="true"></div>. You would be able to solve your issue, using JS by wrapping and adding tags as necessary. Also, textbox wont support multiline & formatting.
A good read for contenteditable attribute on MDN: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_Editable
I did some testing and created the following RegExp: Example Here
<input type="text" id="demo" value="123-456-7890 | John Doe" size="35">
<button onclick="myFunction()">Try it</button>
<p style="text-align:left;" id="number"></p>
<p style="text-align:right;" id="name"></p>
<script>
function myFunction() {
var str = document.getElementById("demo").value;
strnum = str.indexOf('|');
var name = str.slice(strnum+1);
var number = str.slice(0,strnum);
document.getElementById("number").innerHTML = number;
document.getElementById("name").innerHTML = name;
}
</script>
Is this what you want?

Eliminating CSS selectors when parsing with Nokogiri?

I am retrieving the latest news articles from cnn.com website, and wrote a simple Nokogiri script to do this:
url = "http://edition.cnn.com/?refresh=1"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css("#cnn_maintt2bul div+ div a").each do |headline|
article = headline.text
puts "#{article}"
end
The problem is, CNN posts a mixture of articles and links to videos. Now I am only interested in articles not videos. When I run this script it retrieves all articles but leaves a space when an article links to a video, for example.
Pakistan airstrikes kill dozens
Could U.S. leave Afghanistan?
Editor's stabbing draws outrage
Ukrainian city fears uprising
U.S. hate groups in decline
This would mean that Ukrainian city fears uprising would actually link to a video. It would do this until it retrieves the last article.
I discovered that the articles have a selector called .cnnVideoIcon. Any ideas about how I could eliminate this such that articles linking to videos are removed from my results?
How would I eliminate such links when am parsing? They could appear anywhere.
I looked at the HTML source code of the CNN site and found that the "li" tag of a video headline has four child elements, and only three child elements with text headlines.
<li class="c_hpbullet3" data-vr-contentbox="">
<span class="cnnPreWOOL"></span>
Ukrainian politics remain in flux
<span class="cnnPostWOOL"></span>
<img class="cnnVideoIcon" width="16" height="10" border="0" alt="Ukrainian politics remain in flux" src="http://i.cdn.turner.com/cnn/.e/img/3.0/global/icons/video_icon.gif">
</li>
So, we can use the XPath syntax below:
doc.xpath("//div[#id='cnn_maintt2bul']/div/div/ul/li[count(*)=3]/a").each do |headline|
article = headline.text
puts "#{article}"
end
If you look at the source code of the blocks you're scraping from http://edition.cnn.com/?refresh=1, you will notice that videos are a link with a video icon (and no text), like so:
<a href="/video/data/...">
<img class="cnnVideoIcon" alt="Ukrainian city fears uprising" ...
height="10" width="16">
</a>
This explains why you get some empty lines.
You could skip those links using a more refined selector like:
#cnn_maintt2bul div + div a:empty
Using a:empty, you will only retrieve links without images or other elements inside, or, in other words, all links with a description text only.
Another (less optimized) approach is to simply skip the empty lines with an if statement:
doc.css("#cnn_maintt2bul div + div a").each do |headline|
article = headline.text
if (article != "")
puts "#{article}"
...
You should use something else than the CSS attributes to find the desired tags. Use search instead of css and give it an XPath that only selects the elements that don't have the link to a video as child.
I will update the answer with a designated XPath when you provide a real URL to the site you want to fetch information from.

xquery- how to get content of a node which is immediately after a node with known text

I am trying to extract content from a XHTML document-- in this document, within a div, there are a number of 'b' elements, each followed by a link.
For eg--
<div id="main">
<b> Bold text 1</b>
some link 1
<b> Bold text 2</b>
some link 2
<b> ABRACADABRA</b>
abracadbralink
</div>
Now, I want to extract the link 'abracadabralink'-- the problems are that, I dont know how many and elements are there before this specific link-- in different documents there are a different number of such elements- sometimes there are many links immediately after a single element-- all I do know is that the text for the element that occurs just before the link that I want, is always fixed.
So the only fixed information is that I want the link immediately after the element with known text-- how do I get this link using XQuery?
If I get it right, you are interested in the value of the #href attribute? This can be done with standard XPath syntax:
doc('yourdoc.xml')//*[. = ' abracadbralink']/#href/string()
For more information on XPath, I’d advise you to check out some online tutorials, such as http://www.w3schools.com/xpath/default.asp
I guess the following should work for you:
$yournode/b[. = ' ABRACADABRA']/following-sibling::a/#href/string()

Resources