I have some html stored in database.
I dont know that html stored in databse has extra closing div like </div> or not.
I want to find extra closing div in html string.
I have tried to find using HTML Agility pack but not find the way to achieve this.
Example:
<div class="readers">
A total of 218 users are reading this article.
</div>
</div>
</div>
How can i find these two extra closing div and extract fully valid html.
Use this pure javascript parser before rendering the html: http://ejohn.org/blog/pure-javascript-html-parser/
You can check out by pasting your code here,
http://ejohn.org/apps/htmlparser/
it removes the extra </div>s.
You just need to pass your html to the HTMLtoXML function as:
HTMLtoXML(your_html);
and it would remove the extra closing tags. Infact what it does is that it converts it into xml format, but since you are dealing with html strigs & all tags are expected to be valid in html, you can be safe to use this.
EDIT: You can easily call javascript functions from a C# file. See this question for more details.
Click here to find both unclosed (hanging) as well as extra div tags: tormus
Related
I'm having an issue where an unordered list created by data-sly-list is adding whitespace that isn't represented in the DOM or by any class. If I manually code the list rather than letting data-sly-list handle it, the whitespace isn't added.
<div class="bullets">
<ul class="columns unordered-list" id="stateList">
<div data-sly-unwrap data-sly-list.slidesNode="${resource.listChildren}">
<div data-sly-unwrap data-sly-list.states="${slidesNode.listChildren}">
<li data-sly-test="${states.valueMap.flag}">
<sly data-sly-use.htmlpaths="${'htmlpaths.js' # thePath=states.valueMap.path}" data-sly-unwrap>
${states.valueMap.name}
</sly>
</li>
</div>
</div>
</ul>
</div>
If I hardcode the list like the following, there's no whitespace
<div class="bullets">
<ul class="columns unordered-list" id="stateList">
<li>Accessibility
</li>
<li>Accessibility
</li>
<li>Accessibility
</li>
<li>Accessibility
</li>
</ul>
</div>
There's also a htmlpaths.js involved:
"use strict";
use(function() {
var path = this.thePath;
var httpRegex = /http/;
var hashRegex = /#/;
if (path !== undefined && (httpRegex.test(path) === false && hashRegex.test(path) === false)){
path = path + '.html';
}
return {
href: path
}
});
The only difference I see is that its run through Sightly iterating. Is there any fix to this? In addition to listing I'm trying to break them into columns with the following CSS
li {
width:25%;
float:left;
display:inline;
}
This works perfectly fine on the hardcoded list, but on the Sightly iterated one it creates all kind of weird spacing issues that change based on screen width
This whitespace isn't accounted for at all in the DOM. I'm not sure what to do.
More weirdness:
If the margin top is set to -9 or higher, it looks like the above screenshot. But if its set to -10 or lower, it looks like this
It's like its a breakpoint, it goes from one extreme to the other on that one pixel change. No change otherwise. It's bizarre.
It's a little weird behavior in sightly, when you have some extra spaces in your HTML code, it will display with extra spaces in the HTML.
Try to remove all the spaces in the HTML as shown below and try it.
<div class="bullets"><ul class="columns unordered-list" id="stateList"><sly data-sly-list.slidesNode="${resource.listChildren}"><sly data-sly-list.states="${slidesNode.listChildren}"><li>${states.valueMap.name}</li></sly></sly></ul></div>
You can use HTML formatter in your IDE or online tools like below to format the HTML for a readable format
https://www.freeformatter.com/html-formatter.html.
<div class="bullets">
<ul class="columns unordered-list" id="stateList">
<sly data-sly-list.slidesNode="${resource.listChildren}">
<sly data-sly-list.states="${slidesNode.listChildren}">
<li>${states.valueMap.name}</li>
</sly>
</sly>
</ul>
</div>
This should get rid of the extra spaces in your HTML.
Also, it is best to use sightly tags wherever we need some conditions to check or embed them directly in the actual div tag or html tags instead of using data-sly-unwrap.
You can also use sling models to get the required data and check all the conditions(including appending html) in the backend and send the data just to display and avoid all the conditions in sightly.
Using data-sly-unwrap or a sly tag still adds an empty line in the generated HTML. Even though most browsers ignore those spaces, they might cause issues in some cases. If you want the HTL output to look similar to your hardcoded HTML, try placing the use statement and anchor tag in a single line as shown below.
<div class="bullets">
<ul class="columns unordered-list" id="stateList" data-sly-list.slidesNode="${resource.listChildren}">
<li data-sly-repeat.states="${slidesNode.listChildren}" data-sly-test="${states.valueMap.flag}"><sly data-sly-use.htmlpaths="${'htmlpaths.js' # thePath=states.valueMap.path}">${states.valueMap.name} </sly></li>
</ul>
</div>
Also, a few tips
The sly tag doesn't need a data-sly-unwrap. It is automatically
removed in the generated HTML.
data-sly-list can be added to the parent ul tag itself instead of introducing an extra div tag and then unwrapping it.
Use data-sly-repeat instead of data-sly-list wherever possible. I was able to bring down the generated HTML of one of our complex pages from 20k lines to 12k lines, as data-sly-repeat doesn't introduce additional white spaces.
Solution
The issue is on line 7 of your HTL template:
${states.valueMap.name}
You have a space at the end of the inner HTML of your tag ;)
Unrelated
Regarding your htmlpaths.js script, are you aware of Transformers in AEM? You can use them to implement a global Link Rewriter which will fix links when a page is rendered, much like your script does. You can see an example here: https://helpx.adobe.com/experience-manager/using/aem63_link_rewriter.html
If you decide to keep htmlpaths.js, you may want to review it because I'm afraid there might be some problems with it. Of course, I don't know your requirement so it's just a suggestion :)
I am trying to print the href of a html doc, however I am not able to do so.
newurl = 'http://www.heroesfire.com/hots/guide/the-many-ways-of-abathur-1194'
buildpage = Nokogiri::HTML(open(newurl))
#puts buildpage
thistext = buildpage.css("div#wrap div#site-content.self-clear div#guide.view-guide div.col-l div.tab-contents.box div.guide-tab div.chapter-text div.text table.bbcode_columns tbody tr td.bbcode_column a").each do |href|
puts href['href']
end
I am expecting to see '/hots/wiki/talents/pressurized-glands'
I was able to get something similar to work earlier in my script, but I am having zero luck with this.
Invariably, the longer the Node selector, the less likely it will work correctly, especially if you're dealing with HTML you don't control.
Reduce it to find way-points, places that help you drill down instead of trying to define each step.
You're also relying on tbody in the selector. When we see that, the odds are good that it's not in the original HTML source but instead was injected by your browser. Selectors like that smell of using a browser and an inspector to locate a particular item in the page, but the resulting path won't work if the HTML doesn't actually contain tbody. Browsers do a lot of fix-up in an attempt to present something useful, including adding tags. So be careful when you see tbody and confirm it actually exists. In your case, it does, but the concern still exists when navigating through a document.
A simple example of simplifying the path is:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo">
<div id="bar">
<p>text1</p>
</div>
<div id="baz">
<p>text2</p>
</div>
</div>
</body>
</html>
EOT
doc.at('body div#foo div#bar p').text # => "text1"
Can be written more easily, while still accomplishing the same thing, using:
doc.at('#bar p').text # => "text1"
or perhaps one of these:
doc.at('#foo div p').text # => "text1"
doc.search('#foo div p').first.text # => "text1"
All scraping requires at least some advance knowledge of the target page's structure, so, while you're nosing around, take note of the important layout tags. id parameters are especially useful, followed by class and/or unique patterns of tags not replicated elsewhere in the document. Those make it easy to reduce the selector. Sometimes we have to step into the document incrementally like I did using first or one of the "sibling" methods after locating a particular node, but using a long selector rarely is needed.
Instead of
<p>Hello <img src="helloworld.jpg"> World</p>
I would like to have:
<p>Hello</p><img src="helloworld.jpg"><p>World</p>
<p> has a padding of 40px and I would like the images to use all space available.
You can turn off paragraphs for all content (link), but as far as I know you can't turn it off for certain elements only.
What you can do is modify the HTML after retrieving it from the database but before outputting it. You haven't specified your server-side language, for C# I've found that CsQuery is great.
Let's say I typed in a <textarea>:
I
Love
You
Then I save it to phpMyAdmin database. Then I use MySQL to retrieve it from database and display it onto a <div>. Now the output that show in the <div> is :
I Love You
How do I make the <div> show exactly like the database field and textarea which has multiple rows?
If the text is stored exactly as it was entered into the <textarea> (i.e. it still contains the newline; check this by viewing the source of your page), you can use the CSS property white-space: pre on your <div>.
See this jsFiddle example; note how the content of both <div> tags are the same, but produce different output, due to the use of the white-space property in the second <div>.
Replace line returns with <br> tags.
If the textarea supports use of HTML tags (similar to the Stack Overflow answer box that I'm typing in now), then you could type:
I <br>
love <br>
you
and it would render in the way you want it to.
I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc).
An example of this content:
<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.
So the output for the above will be something like:
Header 1
Some text here
Some more text here
A link here
Header 2
Some text here
Some
It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:
Regex.Replace(htmlStr, "<[^>]*>", "")
Then trim the length using Regex as well with:
Regex.Match(textStr, #"^.{1,100}\b").Value
My problem is, how to retaining the line break?. I get an output like:
Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text
Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!
Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.
I think how I would solve this is to look at it as though it were a simple browser. Create a base Tag class, make it abstract with maybe an InnerHTML property and a virtual method PrintElement.
Next, create classes for each HTML tag that you care about and inherit from your base class. Judging from your example, the tags you care most about are h1, p, a, and hr. Implement the PrintElement method such that it returns a string that prints out the element properly based on the InnerHTML (such as the p class' PrintElement would return "\n[InnerHTML]\n").
Next, build a parser that will parse through your HTML and determine which object to create and then add those objects to a queue (a tree would be better, but doesn't look like it's necessary for your purposes).
Finally, go through your queue calling the PrintElement method for each element.
May be more work than you had planned, but it's a far more robust solution than simply using regex and should you decided to change your mind in the future and want to show simple styling it's just a matter of going back and modifying your PrintElement methods.
For info, stripping html with a regex is... full of subtle problems. The HTML Agility Pack may be more robust, but still suffers from the words bleeding together:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;
One way could be to strip html in three steps:
Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>
Well, I need to close this though not having the ideal solution. Since the HTML tags used in my app are very common ones (no tables, list etc) with little or no nesting, what I did is to preformat the HTML fragments before I save them after user input.
Remove all line breaks
Add a line break prefix to all block tags (e.g. div, p, hr, h1/2/3/4 etc)
Before I extract them out to be displayed as plain-text, use regex to remove the html tag and retain the line-break. Hardly any rocket science but works for me.