I am trying to print the href of a html doc, however I am not able to do so.
newurl = 'http://www.heroesfire.com/hots/guide/the-many-ways-of-abathur-1194'
buildpage = Nokogiri::HTML(open(newurl))
#puts buildpage
thistext = buildpage.css("div#wrap div#site-content.self-clear div#guide.view-guide div.col-l div.tab-contents.box div.guide-tab div.chapter-text div.text table.bbcode_columns tbody tr td.bbcode_column a").each do |href|
puts href['href']
end
I am expecting to see '/hots/wiki/talents/pressurized-glands'
I was able to get something similar to work earlier in my script, but I am having zero luck with this.
Invariably, the longer the Node selector, the less likely it will work correctly, especially if you're dealing with HTML you don't control.
Reduce it to find way-points, places that help you drill down instead of trying to define each step.
You're also relying on tbody in the selector. When we see that, the odds are good that it's not in the original HTML source but instead was injected by your browser. Selectors like that smell of using a browser and an inspector to locate a particular item in the page, but the resulting path won't work if the HTML doesn't actually contain tbody. Browsers do a lot of fix-up in an attempt to present something useful, including adding tags. So be careful when you see tbody and confirm it actually exists. In your case, it does, but the concern still exists when navigating through a document.
A simple example of simplifying the path is:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo">
<div id="bar">
<p>text1</p>
</div>
<div id="baz">
<p>text2</p>
</div>
</div>
</body>
</html>
EOT
doc.at('body div#foo div#bar p').text # => "text1"
Can be written more easily, while still accomplishing the same thing, using:
doc.at('#bar p').text # => "text1"
or perhaps one of these:
doc.at('#foo div p').text # => "text1"
doc.search('#foo div p').first.text # => "text1"
All scraping requires at least some advance knowledge of the target page's structure, so, while you're nosing around, take note of the important layout tags. id parameters are especially useful, followed by class and/or unique patterns of tags not replicated elsewhere in the document. Those make it easy to reduce the selector. Sometimes we have to step into the document incrementally like I did using first or one of the "sibling" methods after locating a particular node, but using a long selector rarely is needed.
Related
I'm having an issue where an unordered list created by data-sly-list is adding whitespace that isn't represented in the DOM or by any class. If I manually code the list rather than letting data-sly-list handle it, the whitespace isn't added.
<div class="bullets">
<ul class="columns unordered-list" id="stateList">
<div data-sly-unwrap data-sly-list.slidesNode="${resource.listChildren}">
<div data-sly-unwrap data-sly-list.states="${slidesNode.listChildren}">
<li data-sly-test="${states.valueMap.flag}">
<sly data-sly-use.htmlpaths="${'htmlpaths.js' # thePath=states.valueMap.path}" data-sly-unwrap>
${states.valueMap.name}
</sly>
</li>
</div>
</div>
</ul>
</div>
If I hardcode the list like the following, there's no whitespace
<div class="bullets">
<ul class="columns unordered-list" id="stateList">
<li>Accessibility
</li>
<li>Accessibility
</li>
<li>Accessibility
</li>
<li>Accessibility
</li>
</ul>
</div>
There's also a htmlpaths.js involved:
"use strict";
use(function() {
var path = this.thePath;
var httpRegex = /http/;
var hashRegex = /#/;
if (path !== undefined && (httpRegex.test(path) === false && hashRegex.test(path) === false)){
path = path + '.html';
}
return {
href: path
}
});
The only difference I see is that its run through Sightly iterating. Is there any fix to this? In addition to listing I'm trying to break them into columns with the following CSS
li {
width:25%;
float:left;
display:inline;
}
This works perfectly fine on the hardcoded list, but on the Sightly iterated one it creates all kind of weird spacing issues that change based on screen width
This whitespace isn't accounted for at all in the DOM. I'm not sure what to do.
More weirdness:
If the margin top is set to -9 or higher, it looks like the above screenshot. But if its set to -10 or lower, it looks like this
It's like its a breakpoint, it goes from one extreme to the other on that one pixel change. No change otherwise. It's bizarre.
It's a little weird behavior in sightly, when you have some extra spaces in your HTML code, it will display with extra spaces in the HTML.
Try to remove all the spaces in the HTML as shown below and try it.
<div class="bullets"><ul class="columns unordered-list" id="stateList"><sly data-sly-list.slidesNode="${resource.listChildren}"><sly data-sly-list.states="${slidesNode.listChildren}"><li>${states.valueMap.name}</li></sly></sly></ul></div>
You can use HTML formatter in your IDE or online tools like below to format the HTML for a readable format
https://www.freeformatter.com/html-formatter.html.
<div class="bullets">
<ul class="columns unordered-list" id="stateList">
<sly data-sly-list.slidesNode="${resource.listChildren}">
<sly data-sly-list.states="${slidesNode.listChildren}">
<li>${states.valueMap.name}</li>
</sly>
</sly>
</ul>
</div>
This should get rid of the extra spaces in your HTML.
Also, it is best to use sightly tags wherever we need some conditions to check or embed them directly in the actual div tag or html tags instead of using data-sly-unwrap.
You can also use sling models to get the required data and check all the conditions(including appending html) in the backend and send the data just to display and avoid all the conditions in sightly.
Using data-sly-unwrap or a sly tag still adds an empty line in the generated HTML. Even though most browsers ignore those spaces, they might cause issues in some cases. If you want the HTL output to look similar to your hardcoded HTML, try placing the use statement and anchor tag in a single line as shown below.
<div class="bullets">
<ul class="columns unordered-list" id="stateList" data-sly-list.slidesNode="${resource.listChildren}">
<li data-sly-repeat.states="${slidesNode.listChildren}" data-sly-test="${states.valueMap.flag}"><sly data-sly-use.htmlpaths="${'htmlpaths.js' # thePath=states.valueMap.path}">${states.valueMap.name} </sly></li>
</ul>
</div>
Also, a few tips
The sly tag doesn't need a data-sly-unwrap. It is automatically
removed in the generated HTML.
data-sly-list can be added to the parent ul tag itself instead of introducing an extra div tag and then unwrapping it.
Use data-sly-repeat instead of data-sly-list wherever possible. I was able to bring down the generated HTML of one of our complex pages from 20k lines to 12k lines, as data-sly-repeat doesn't introduce additional white spaces.
Solution
The issue is on line 7 of your HTL template:
${states.valueMap.name}
You have a space at the end of the inner HTML of your tag ;)
Unrelated
Regarding your htmlpaths.js script, are you aware of Transformers in AEM? You can use them to implement a global Link Rewriter which will fix links when a page is rendered, much like your script does. You can see an example here: https://helpx.adobe.com/experience-manager/using/aem63_link_rewriter.html
If you decide to keep htmlpaths.js, you may want to review it because I'm afraid there might be some problems with it. Of course, I don't know your requirement so it's just a suggestion :)
I understand from reading similar posts that the <section> tag in html is meant for semantic and organizational purposes. I was wondering, however, why using the <div> tag with a class attribute wouldn't have a similar effect.
(e.g. <div class = "SectionOne">)
Given these two methods, I could refer to each of them in CSS by using their respective names:
Section
{
color = white;
}
or
.SectionOne
{
color = white;
}
Personally, I think the second method allows for greater versatility in webpage design and I don't see many advantages to the new HTML5 feature. Would anyone care to clear this up for me?
section is usually used for having article like contents whereas div are meant to combine various block elements in order to style them differently. The main difference is just semantics.
Refer https://www.thoughtco.com/difference-between-div-and-section-3468001 for derails
Let me know if you require any further help
The <section> tag defines sections in a document, such as chapters, headers, footers, or any other sections of the document.
Whereas: The <div> tag defines a division or a section in an HTML document. The tag is used to group block-elements to format them with CSS.
Maybe you mean section and not Section. Anyway, the semantics is a thing and the selectors another. In CSS it is better to select using classes than tag selectors, because you gain a lot in terms of versatility. So you are right from this point of view. Semantics is another matter: is not given by a class. Even if you give a "section" class to a div, you are not giving semantic meaning to a div.
<div> is simply a generic block-level element which predates the later, semantically-named, document-related elements which arrived with HTML5, such as:
<header>
<nav>
<main>
<section>
<aside>
<footer>
When dividing up a document into its anatomical parts, you could still use:
<div class="header">
<div class="section">
etc.
But... you don't need to anymore.
Of course, even if you still use all of the above in your document you might still want to add other block-level elements and when you do... <div> is general purpose.
This is a question regarding Angular 2 selectors, Custom tags vs. Custom attributes, SEO and browser rendering.
When I first started to look over Angular 2, the very first thing I did when following their quickstart, right of the bat, was to change my selector to '[my-component]' (attribute selector) instead of 'my-component' (tag selector), so I could have <div my-component></div> in my html instead of <my-component></my-component>, which isn't valid html. So I would write html according to standards. Well, at least pretty close to standards (because my-component isn't a valid html attribute, but I could live with only that html validation error)
Then, at some point in a video on youtube, someone from the angular team mentioned that we should use the tag selector, performance wise at least.
Alright I said, screw html validation... or shouldn't I?
So:
Say I ignore the W3C screaming about my html being completely invalid because of the <custom-tags>. I actually have another bigger and more real concern: how does this impact SEO?
I mean don't just think client-side app, because in the real world (and for my angular 2 project as well) I also have server-side rendering, for 2 very important reasons: SEO and Fast initial rendering of the site to the user for that initial view, before the app bootstraps. You can not have a very high traffic SPA otherwise.
Sure, google will crawl my site, regardless of the tags I use, but will it rank it the same in both scenarios: one with <custom-make-believe-tags> and the other with only standard html tags?
Let's talk browsers and css:
As I started to build my first SPA site in Angular 2, I was immediately faced with another concern:
Say (in a non SPA site) I have the following html markup:
<header>
<a class="logo">
...
</a>
<div class="widgets">
<form class="frm-quicksearch"> ... </form>
<div class="dropdown">
<!-- a user dropdown menu here -->
</div>
</div>
</header>
<div class="video-listing">
<div class="video-item"> ... </div>
<div class="video-item"> ... </div>
...
</div>
Angular 2 wise I would have the following component tree:
<header-component>
<logo-component></logo-component>
<widgets-component>
<quicksearch-component></quicksearch-component>
<dropdown-component></dropdown-component>
</widgets-component>
</header-component>
<video-listing-component>
<video-item-component></video-item-component>
...
</video-listing-component>
Now, I have 2 options. Let's just take the <video-listing-component> for example, to keep this simple... I either
A) place the entire standard html tags which I already have (<div class="video-item"></div>) within the <video-item-component> tag, and once rendered will result in this:
<video-listing-component>
<div class="video-listing>
<video-item-component>
<div class="video-item>...</div>
</video-item-component>
...
...
</div>
</video-listing-component>
OR:
B) Only put the content of <div class="video-item"> directly into my <video-item-component> component and adding the required class (class="video-item") for styling on the component tag, resulting in something like this:
<video-listing-component class="video-listing">
<video-item-component class="video-item"></video-item-component>
<video-item-component class="video-item"></video-item-component>
...
</video-listing-component>
Either way (A or B), the browser renders everything just fine.
BUT if you take a closer look (after everything is rendered in the dom, of course), by default the custom tags don't occupy any space in the dom. They're 0px by 0px. Only their content occupies space. I don't get it how come the browser still renders everything as you would want to see it, I mean in the first case (A):
While having float: left; width: 25%; on the div class="video-item", but each of these divs being within a <video-item-component> tag, which doesn't have any styling... Isn't it just a fortunate side-effect that the browser renders everything as you'd expect? With all the <div class="video-item"> floating next to eachother, even though each of them are within another tag, the <video-item-component> which does NOT have float: left? I've tested on IE10+, Firefox, Chrome, all fine. Is it just fortunate or is there a solid explanation for this and we can safely rely for this kind of markup to be rendered as we'd expect by all (or at least most) browsers?
Second case (B):
If we use classes and styling directly on the custom tags (<video-item-component>)... again, everything shows up fine. But as far as I know, we shouldn't style custom components, right? Isn't this also just a fortunate expected outcome? Or is this fine also? I don't know, maybe I'm still living in 2009... am I?
Which of these 2 approaches (A or B) would be the recommended one? Or are both just fine?
I have no ideea!!
EDIT:
D'oh, thanks Günter Zöchbauer. Yeah, since my divs have float: left, that's why the (custom or not) tag they're wrapped in doesn't expand it's height. Seems I've forgotten how css works since I started to look over Angular 2:)
But one thing still remains:
If I set a percentage width on a block element (call it E), I would assume it takes x% of it's immediate parent. If I set float: left, I would expect floating within the immediate parent. In my A case, since the immediate parent is a custom tag with no display type and no width, I would expect for things to break somehow, but still... my E elements behave like their parent isn't the custom tag they're each wrapped in, but the next one in the dom (which is <div class="video-listing> in my case). And they occupy x% of that and they float within that. I don't expect this to be normal, I would think this is just a fortunate effect, and I'm afraid that one day, after some browser update... I'll wake up to find all my Angular 2 sites looking completely broken.
So... are both A and B an equally proper approach? Or am I doing it wrong in case A?
EDIT2:
Let's simplify things a bit. As I got part of my question answered, let's take another example of generated html (simplified a bit, with inlined css):
<footer>
<angular-component-left>
<div style="float: left; width: 50%;">
DIV CONTENT
</div>
</angular-component-left>
<angular-component-right>
<div style="float: left; width: 50%;">
DIV CONTENT
</div>
</angular-component-right>
</footer>
In the original, not yet implemented html (whithout <angular-component-...>, those divs should float left and each occupy 50% of the <footer>. Surprisingly, once they're wrapped in the <angular-component-...> custom tags, they do the same: occupy 50% of the footer. But this just seems like good fortune to me, dumb luck... Unintended effect.
So, is it or isn't it "dumb luck"?
Should I leave it like that, or rewrite so instead of the above code, I would have something like this:
<footer>
<angular-component-left style="display: block; float: left; width: 50%;">
DIV CONTENT
</angular-component-left>
<angular-component-right style="display: block; float: left; width: 50%;">
DIV CONTENT
</angular-component-right>
</footer>
Note that the inline styling is introduced here for simplicity, I would actually have a class instead which would be in an external css file included in the <head> of my document, not through style or styleUrls from my angular components.
The issue is your HTML validator. The - in the element name is required for elements to be treated as custom elements and it is valid HTML5. Angular doesn't require - in element names but it's good practice.
Check for example https://www.w3.org/TR/custom-elements/#registering-custom-elements (search for x-foo) or https://w3c.github.io/webcomponents/spec/custom/#custom-elements-custom-tag-example. I'm sure this dash rule is specified somewhere but wasn't able to find the spec. It is for example required in Polymer that depends on elements being proper custom elements while this doesn't matter much in Angular. The only difference as far as I know is that when you query the element, you get a HTMLUnknownElement when the - is missing in the name and a HTMLElement when it contains a -.
See also this question I asked a few years ago Why does Angular not need a dash in component name
BUT if you take a closer look, by default the custom tags don't occupy any space in the dom. They're 0px by 0px. Only their content occupies space. I just don't get it how come the browser still renders everything as you would want to see it
I'm not sure I understand this question. When Angular processes the template it adds the content dynamically. When you see the content in the browser than it's also available in the DOM and has actual dimensions.
Search engine crawlers are able to process pages that are generated by JavaScript. If this isn't enough, server-side rendered pages can provide static HTML to crawlers that contain the whole view.
I have some html stored in database.
I dont know that html stored in databse has extra closing div like </div> or not.
I want to find extra closing div in html string.
I have tried to find using HTML Agility pack but not find the way to achieve this.
Example:
<div class="readers">
A total of 218 users are reading this article.
</div>
</div>
</div>
How can i find these two extra closing div and extract fully valid html.
Use this pure javascript parser before rendering the html: http://ejohn.org/blog/pure-javascript-html-parser/
You can check out by pasting your code here,
http://ejohn.org/apps/htmlparser/
it removes the extra </div>s.
You just need to pass your html to the HTMLtoXML function as:
HTMLtoXML(your_html);
and it would remove the extra closing tags. Infact what it does is that it converts it into xml format, but since you are dealing with html strigs & all tags are expected to be valid in html, you can be safe to use this.
EDIT: You can easily call javascript functions from a C# file. See this question for more details.
Click here to find both unclosed (hanging) as well as extra div tags: tormus
I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc).
An example of this content:
<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.
So the output for the above will be something like:
Header 1
Some text here
Some more text here
A link here
Header 2
Some text here
Some
It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:
Regex.Replace(htmlStr, "<[^>]*>", "")
Then trim the length using Regex as well with:
Regex.Match(textStr, #"^.{1,100}\b").Value
My problem is, how to retaining the line break?. I get an output like:
Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text
Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!
Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.
I think how I would solve this is to look at it as though it were a simple browser. Create a base Tag class, make it abstract with maybe an InnerHTML property and a virtual method PrintElement.
Next, create classes for each HTML tag that you care about and inherit from your base class. Judging from your example, the tags you care most about are h1, p, a, and hr. Implement the PrintElement method such that it returns a string that prints out the element properly based on the InnerHTML (such as the p class' PrintElement would return "\n[InnerHTML]\n").
Next, build a parser that will parse through your HTML and determine which object to create and then add those objects to a queue (a tree would be better, but doesn't look like it's necessary for your purposes).
Finally, go through your queue calling the PrintElement method for each element.
May be more work than you had planned, but it's a far more robust solution than simply using regex and should you decided to change your mind in the future and want to show simple styling it's just a matter of going back and modifying your PrintElement methods.
For info, stripping html with a regex is... full of subtle problems. The HTML Agility Pack may be more robust, but still suffers from the words bleeding together:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;
One way could be to strip html in three steps:
Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>
Well, I need to close this though not having the ideal solution. Since the HTML tags used in my app are very common ones (no tables, list etc) with little or no nesting, what I did is to preformat the HTML fragments before I save them after user input.
Remove all line breaks
Add a line break prefix to all block tags (e.g. div, p, hr, h1/2/3/4 etc)
Before I extract them out to be displayed as plain-text, use regex to remove the html tag and retain the line-break. Hardly any rocket science but works for me.