When using apoc.load.html, Is it possible to return the full HTML rather than only text? - css

Lets say I want to scrape the Neo4j RefCard found at: https://neo4j.com/docs/cypher-refcard/current/
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...):
...so in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.

Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.

Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at prerender.com. It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call prerender.com with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
...but if we call it via the prender.com website, you will get a bunch of span tags and their content.
CALL apoc.load.html("https://service.prerender.cloud/https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags

Related

Web Component that can request another HTML URL and inject it into it's shadow DOM

I spent some time today with Lit trying to make a simple WebComponent that makes a HTTP GET to a URI, which returns a fully formed HTML document, and I want to inject said HTML document into the WebComponent's shadow DOM; basically this WebComponent acts as a simple proxy for embedding an externally hosted (but trusted) web snippet on my web page. I ran into a few problems:
Lit considers all HTML unsafe, so i had to mark it with Lit's unsafeHTML directive.
Then, I noticed none of the script or link tags in the injected HTML were being followed, so I parsed the incoming HTML as a HtmlDocument, located all the script/link tags, and "re-created" them using document.createElement(...) and returned them in my render(). I'm now noticing that images arent showing up either.
I don't like scraping scripts/links and re-creating them and jamming them into my web component anyhow, but I'm curious - what's the right way to approach this syndicating/consuming syndicated HTML pages/fragments?
Is this a solved problem w/ oEmbed already?
Is this simpler to do with a different WebComponent library?
This seems way harder than it should be at this point.
I think that it has little to do with WebComponents but rather with the HTML5 specs. lit-html uses innerHTML to create elements.
Script elements inserted using innerHTML do not execute when they are inserted
There are still ways to execute JS but this has little to do with your question.
unsafeHTML(`<img src="triggerError" onerror="alert('BOOM')">`)
Regarding the images, it may be a path issue?
This should work:
unsafeHTML(`<img src='http://placehold.it/350x350'>`)

Scrapy: Unable to access class despite of it's there

I am trying to scrape this page, I am trying to fetch Color Name, LT. BLUE. From Chrome I see HTML:
<div id="desc-options"><div class="option"><span class="label">Color:</span> LT. BLUE</div><div class="option"><span class="label">Size:</span> 6.5</div></div>
I tried response.css("#desc-options") to access everything inside but returns []. Even BeautifulSoup is failing.
The element you're looking for is dynamically created via JavaScript. You cannot parse it from the plain HTML.
The good news is: the data you're looking for is probably still in the page. Check out the <script> tag defining the spConfig variable. Looks like there's some JSON there you can parse ...

How to elegantly modify html to inject html element after x-th paragraph on the server side?

I need to modify html coming from external file (server side) before I render it and inject a quote 'component' like this:
This component needs to be injected after 2nd paragraph and I'm planning to use htmlagillity pack. Any examples? Is HtmlNode.InsertAfter() method good choice once I found third paragraph which should be trivial.
Another question is would it be possible to inject sitecore placeholder or even usercontrol that is going to render my quote instead of pure html? I feel it should be but not sure what would be good approach.
Thanks
I can suggest two possible approaches here:
1) Use snippets with some customisation. Snippets allow users to insert pre-defined chunks of HTML into a RTE field. You could have a pre-defined piece of HTML which might have some identifier to indicate it should use custom processing (I would suggest some data-xxx style attribute which would not conflict with any CSS or JavaScript). Then you could create a new renderField pipeline processor which would detect the data-xxx attribute within the content of a rich text field - you would use HtmlAgilityPack for this and then replace that snippet with the contents of your server-side file.
-or-
2) Split your text content into two separate chunks and have two instances of a "HtmlText" rendering within the placeholder, with a rendering for your quote text between them in the same placeholder.
I would advise that having a rule to insert text after the second paragraph would be quite 'brittle' as this would be very reliant on content editors setting the rich text field contents in quite a precise way e.g. to always ensure two or more paragraphs and to always break text with paragraphs - they might decide to use a load of line breaks instead to split their text. That said if you did do this, you would create a new renderField pipeline processor.

How can I strip html and word formatted text from a text box

I have a multiline textbox, need to restrict users from adding html input, any other scripts, copy pasting from word or any other word processer.
But I need to allow bullets for the input.
I thought it would be a simple thing to do since it looks like a common problem.
But I could not find a good solution in the web, please help.
I am using telerik tool kit as well.
If you need to strip out HTML then HTML Agility Pack is your friend. It will deal with all manner of malformed html. As a bonus it is included in Sitecore already.
If you want to use something with a friendlier syntax then consider CSQuery or Fizzler both of which provide you with a jQuery type syntax from within C#.
If you need to build a whitelist then take a look at this post on how to add whitelist:
public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
{
if (!pWhiteList.Contains(pNode.Name))
{
pNode.Remove();
return;
}
pNode.Attributes
.Where(att => !pWhiteList.Contains(att.Name))
.ToList()
.ForEach(att => att.Remove());
pNode.ChildNodes
.ToList()
.ForEach(att => RemoveNotInWhiteList(att, pWhiteList));
}
You could create a Validation rule, I reckon (in /sitecore/System/Settings/Validation Rules). Put the allowed HTML in a whitelist somewhere (possibly a Sitecore item), when validating run through that whitelist. If any other HTML tags appear in it, make it invalid.
This doesn't stop them from putting it in, but it will stop the item from being published.
You could even create a custom item:saved event handler which strips out all HTML tags apart from the whitelisted stuff. Again, it doesn't stop them from putting the HTML tags in, but as soon as the item is saved it will be removed. Going even a step further than this, I think it also would be possible to use the Rules Engine for this - this article by John West shows how to use the Rules engine to modify item names, but you could modify it to read out specific text boxes.
Neither option here will stop users from inputting HTML, but the HTML tags will automatically be removed when the item is saved.

Parsing page data into sidebar - wordpress

What would be the proper procedure for accessing the current page html data and picking up all of a certain tag and throwing them into the sidebar as links?
I'm not sure your proficiency with php, but I'll give you and overview of what you'd probably want to do.
First, you need the HTML. I'm assuming you're running this on a page (in a page.php file or single.php file, or similar), this means that you have access to the global variable $post, which contains the html of the page in it. To access it you can use the helper function get_the_content(), this returns the html being displayed.
Next you need to parse through this to get the h2 tags. A simple regex can handle this, something like <h2[^>]*>(.*)</h2>. It's important to remember that this regex is very picky, so format your html correctly, no multiline h2s.
So now you have the html, and have parsed it with a regex to get the h2s. Now you need to generate the list from the results, and prepend it to the top of the content of the page. There are a ton of ways to do this, the easiest being just running the code in the right spot in the template file.
Of course there are probably better ways of doing this, I'd recommend you look at say a FAQ plugin (if that's what this is for), or do the lists manually (as this system can be broken), or possibly use a custom post type; but for your question, that's how I'd do it.

Resources