I'm using the following piece of code to scrape specific images from a webpage. There are multiple images on this page with the image tag, so how does this code interpret that? I've noticed that it saves only the first image with the image tag. Is this true in general?
Am I correct in reasoning that this code starts reading the css from top to bottom and once it finds the first image with the image tag it saves it and stops looking further? Because I need it to do just that.
PAGE = "http://example.com/page.html"
require 'nokogiri'
require 'open-uri'
html = Nokogiri.HTML(open(PAGE))
src = html.at('.image')['src']
File.open("foo.png", "wb") do |f|
f.write(open(src).read)
end
Yes,
html.at finds the first matching element only
html.search find all matching elements
Does that answer your question?
On a related note,
html.at(".image") finds any element with class="image" even if it were eg a <div> tag
html.at("img.image") finds the first <img> element with class="image"
html.at("img") finds the first <img> element
Related
Please, I use a small fixed [btm right of page] pair of buttons (an up and a down) for in-page navigation. I also have a larger return arrow with the code snippet below meant to go back to the previous html page menu:
<div id="Div-icon-returnback-indiv" class="IconOpacityControl-p5">
<a href="#Previous" onclick="history.back();">
<img id="icon-returnback-indiv" src="./Button-PREVIOUS-AD151D-E.png" alt="Previous"/></a>
</div>
Emanating from the parent page and intended return to page of
[eg.] http://gladheart.royalwebhosting.net/Menu%20-%20Rivers%20of%20Mind%20and%20Heart.html
the problem then develops on the child page when its up or down buttons are used. Using them effectively adds my associated id(s) of #Top and #Bottom to the initial child page URL string of [ie]:
http://gladheart.royalwebhosting.net/Poetry%20-%20Alive%20and%20Living%20in%20this%20Body.html
and it thus becomes appended as a new 'history' entry of [ie.]
http://gladheart.royalwebhosting.net/Poetry%20-%20Alive%20and%20Living%20in%20this%20Body.html#Bottom
There could, if the child page up and down arrows are used more than once, even then be more than one unwished for '#'(id) ending up being stacked in 'history.back'. Is it possible to javascript parse the stack until the first result that is free of any '#'(id) string at the end of the path [in my case specifically eliminating '#Top' and/or '#Bottom' if it makes it easier and dealing with the original unappended URL of the of the child page being navigated up and down on] to then use that result in the above 'onclick="history.back()' code snippet I am presently using to send it back to the parent html page where it really came from, not the top or bottom of the current child page?
As I commented below, perhaps #Will Peavy 's suggestion [in another thread] of 'document.referrer' (instead of 'history.back') might be the shortcut silver bullet if I used the correct syntax dragon. In my above code snippet, I have tried to no avail substituting the syntax of:
<!-- failure -->
<a href="#Previous" onclick="document.referrer();">
<!-- failure -->
I am trying to print the href of a html doc, however I am not able to do so.
newurl = 'http://www.heroesfire.com/hots/guide/the-many-ways-of-abathur-1194'
buildpage = Nokogiri::HTML(open(newurl))
#puts buildpage
thistext = buildpage.css("div#wrap div#site-content.self-clear div#guide.view-guide div.col-l div.tab-contents.box div.guide-tab div.chapter-text div.text table.bbcode_columns tbody tr td.bbcode_column a").each do |href|
puts href['href']
end
I am expecting to see '/hots/wiki/talents/pressurized-glands'
I was able to get something similar to work earlier in my script, but I am having zero luck with this.
Invariably, the longer the Node selector, the less likely it will work correctly, especially if you're dealing with HTML you don't control.
Reduce it to find way-points, places that help you drill down instead of trying to define each step.
You're also relying on tbody in the selector. When we see that, the odds are good that it's not in the original HTML source but instead was injected by your browser. Selectors like that smell of using a browser and an inspector to locate a particular item in the page, but the resulting path won't work if the HTML doesn't actually contain tbody. Browsers do a lot of fix-up in an attempt to present something useful, including adding tags. So be careful when you see tbody and confirm it actually exists. In your case, it does, but the concern still exists when navigating through a document.
A simple example of simplifying the path is:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo">
<div id="bar">
<p>text1</p>
</div>
<div id="baz">
<p>text2</p>
</div>
</div>
</body>
</html>
EOT
doc.at('body div#foo div#bar p').text # => "text1"
Can be written more easily, while still accomplishing the same thing, using:
doc.at('#bar p').text # => "text1"
or perhaps one of these:
doc.at('#foo div p').text # => "text1"
doc.search('#foo div p').first.text # => "text1"
All scraping requires at least some advance knowledge of the target page's structure, so, while you're nosing around, take note of the important layout tags. id parameters are especially useful, followed by class and/or unique patterns of tags not replicated elsewhere in the document. Those make it easy to reduce the selector. Sometimes we have to step into the document incrementally like I did using first or one of the "sibling" methods after locating a particular node, but using a long selector rarely is needed.
I am trying to extract content from a XHTML document-- in this document, within a div, there are a number of 'b' elements, each followed by a link.
For eg--
<div id="main">
<b> Bold text 1</b>
some link 1
<b> Bold text 2</b>
some link 2
<b> ABRACADABRA</b>
abracadbralink
</div>
Now, I want to extract the link 'abracadabralink'-- the problems are that, I dont know how many and elements are there before this specific link-- in different documents there are a different number of such elements- sometimes there are many links immediately after a single element-- all I do know is that the text for the element that occurs just before the link that I want, is always fixed.
So the only fixed information is that I want the link immediately after the element with known text-- how do I get this link using XQuery?
If I get it right, you are interested in the value of the #href attribute? This can be done with standard XPath syntax:
doc('yourdoc.xml')//*[. = ' abracadbralink']/#href/string()
For more information on XPath, I’d advise you to check out some online tutorials, such as http://www.w3schools.com/xpath/default.asp
I guess the following should work for you:
$yournode/b[. = ' ABRACADABRA']/following-sibling::a/#href/string()
I have some html stored in database.
I dont know that html stored in databse has extra closing div like </div> or not.
I want to find extra closing div in html string.
I have tried to find using HTML Agility pack but not find the way to achieve this.
Example:
<div class="readers">
A total of 218 users are reading this article.
</div>
</div>
</div>
How can i find these two extra closing div and extract fully valid html.
Use this pure javascript parser before rendering the html: http://ejohn.org/blog/pure-javascript-html-parser/
You can check out by pasting your code here,
http://ejohn.org/apps/htmlparser/
it removes the extra </div>s.
You just need to pass your html to the HTMLtoXML function as:
HTMLtoXML(your_html);
and it would remove the extra closing tags. Infact what it does is that it converts it into xml format, but since you are dealing with html strigs & all tags are expected to be valid in html, you can be safe to use this.
EDIT: You can easily call javascript functions from a C# file. See this question for more details.
Click here to find both unclosed (hanging) as well as extra div tags: tormus
I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc).
An example of this content:
<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.
So the output for the above will be something like:
Header 1
Some text here
Some more text here
A link here
Header 2
Some text here
Some
It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:
Regex.Replace(htmlStr, "<[^>]*>", "")
Then trim the length using Regex as well with:
Regex.Match(textStr, #"^.{1,100}\b").Value
My problem is, how to retaining the line break?. I get an output like:
Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text
Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!
Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.
I think how I would solve this is to look at it as though it were a simple browser. Create a base Tag class, make it abstract with maybe an InnerHTML property and a virtual method PrintElement.
Next, create classes for each HTML tag that you care about and inherit from your base class. Judging from your example, the tags you care most about are h1, p, a, and hr. Implement the PrintElement method such that it returns a string that prints out the element properly based on the InnerHTML (such as the p class' PrintElement would return "\n[InnerHTML]\n").
Next, build a parser that will parse through your HTML and determine which object to create and then add those objects to a queue (a tree would be better, but doesn't look like it's necessary for your purposes).
Finally, go through your queue calling the PrintElement method for each element.
May be more work than you had planned, but it's a far more robust solution than simply using regex and should you decided to change your mind in the future and want to show simple styling it's just a matter of going back and modifying your PrintElement methods.
For info, stripping html with a regex is... full of subtle problems. The HTML Agility Pack may be more robust, but still suffers from the words bleeding together:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;
One way could be to strip html in three steps:
Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>
Well, I need to close this though not having the ideal solution. Since the HTML tags used in my app are very common ones (no tables, list etc) with little or no nesting, what I did is to preformat the HTML fragments before I save them after user input.
Remove all line breaks
Add a line break prefix to all block tags (e.g. div, p, hr, h1/2/3/4 etc)
Before I extract them out to be displayed as plain-text, use regex to remove the html tag and retain the line-break. Hardly any rocket science but works for me.