I'm trying to build a hacker news scraper using Symfony 2's Dom Crawler [1]
When I try out the xpath with a chrome plugin [2], it works. But when I try it in my scraper I keep getting The current node list is empty.
Here's my scraper code:
$crawler1 = $client1->request('GET','https://news.ycombinator.com/item?id=8296437');
$hnpost->selftext = $crawler1->filterXPath('/html/body/center/table/tbody/tr[3]/td/table[1]/tbody/tr[4]/td[2]')->text();
[1] http://api.symfony.com/2.0/Symfony/Component/DomCrawler/Crawler.html#method_filter
[2] https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en-US
If the problem is what I think it is, I've been battered by this one a couple of times. Chrome implicitly adds any missing <tbody> tags to the DOM, so if you then copy the XPath or CSS path, you may also have copied tags that don't necessarily exist in the source document. Try viewing the page's source and see if the DOM reported by your browser's console corresponds to the original source HTML. If the <tbody> tags are absent, be sure to exclude them in your filterXPath() call.
Related
I hope you are all doing well.
I am facing an error during web scraping in R using the Selector Gadget Tool where when I am selecting the data using the tool on the Coursera website, the no. of values it shows is correct (10). But when I copy that particular CSS code in R and run it, it's showing 18 names in the list. Please if anyone can help me with this. Here is a screenshot of the selector gadget output:
And here is what gets returned in R when I scrape that css selector:
The rendered content seen via a browser is not exactly the same as that returned by an XHR request (rvest). This is because a browser can run JavaScript to update content.
Inspect the page source by pressing Ctrl+U in browser on that webpage.
You can re-write your css selector list to match the actual html returned. One example would be as follows, which also removes the reliance on dynamic classes which change more frequently and would break your program more quickly.
library(rvest)
read_html("https://in.coursera.org/degrees/bachelors") |>
html_elements('[data-e2e="degree-list"] div[class] > p:first-child') |>
html_text2()
Learn about CSS selectors and operators here: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
Lets say I want to scrape the Neo4j RefCard found at: https://neo4j.com/docs/cypher-refcard/current/
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...):
...so in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.
Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.
Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at prerender.com. It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call prerender.com with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
...but if we call it via the prender.com website, you will get a bunch of span tags and their content.
CALL apoc.load.html("https://service.prerender.cloud/https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
I've made an Add-on which is a custom field.
The style of the text in the field changes depending on the properties of an issue.
I check which style should the text have in the .java file and I pass the html class in a variable called $indicator to the velocity template:
#if( ${value} )
<span class="$indicator">${value}</span>
#end
It works perfect everywhere but in gadgets. When I add this field to a table showing issues in a dashboard, the html code is correct, but it doesn't find the css resource. This is because gadgets are inside an iframe.
How can I make the iframe have a reference to the stylesheet?
You did not say exactly which gadget you were using, but try adding the following context within your <web-resource> module:
<context>jira.webresources:issue-table</context>
The above should work for at least Assigned to Me, Filter Results, In Progress, Voted, and Watched in JIRA 6.1+.
If that does not work, you might also try:
<context>com.atlassian.jira.gadgets:common-lite</context>
If that general context doesn't work, you can look for which exact contexts are #requireContext'ed by the specific gadget you are trying to use, and then make sure that your web-resource is listed in that context. You can figure this out by looking at the gadget's XML and then searching for the #requireContext. (You can find the gadget XMLs inside $JIRA_DATA/plugins/.osgi-plugins/transformed-plugins/jira-gadgets-plugin-*.jar)
Starting with JIRA 7 the Answer of Scott Dudley is no longer working. #requireContext was replaced with a #requireResource within the Atlassian sources of this gadget.
As it affects our plugin, I created a Improvement Request to make that possible again
I have added the Facebook comments plugin to dynamic pages on my site, and I am attempting to set up administration of those plugins by adding the associated meta tags to their pages. My site uses the CMS dotnetnuke.
I went to the page's settings - under advanced - and added the appropriate meta tag information. However, upon saving, administration is not enabled on that page.
I ran the page through the wc3 validator, and the following error related to that meta tag was produced:
Error Line 11, Column 1926: there is no attribute "property"
…type="text/javascript"></script><meta property="fb:admins" content="76804243"/>
✉
You have used the attribute named above in your document, but the
document type you are using does not support that attribute for this
element. This error is often caused by incorrect use of the "Strict"
document type with a document that uses frames (e.g. you must use the
"Transitional" document type to get the "target" attribute), or by
using vendor proprietary extensions such as "marginheight" (this is
usually fixed by using CSS to achieve the desired effect instead).
This error may also result if the element itself is not supported in
the document type you are using, as an undefined element will have no
supported attributes; in this case, see the element-undefined error
message for further information.
How to fix: check the spelling and case of the element and attribute,
(Remember XHTML is all lower-case) and/or check that they are both
allowed in the chosen document type, and/or use CSS instead of this
attribute. If you received this error when using the element
to incorporate flash media in a Web page, see the FAQ item on valid
flash.
Do I need to specify this attribute somehow in my DNN skin? Any ideas on a possible fix?
Thanks!
Alex
There are several ways you can change the DocType for your site. Here's the entry from the DotNetNuke wiki that describes the options:
http://www.dotnetnuke.com/Resources/Wiki/Page/Set-the-doctype-of-your-skin.aspx
I receive HTML pages from our creative team, and then use those to build aspx pages. One challenge I frequently face is getting the HTML I spit out to match theirs exactly. I almost always end up screwing up the nesting of <div>s between my page and the master pages.
Does anyone know of a tool that will help in this situation -- something that will compare 2 pages and output the structural differences? I can't use a standard diff tool, because IDs change from what I receive from creative, text replaces lorem ipsum, etc..
You can use HTMLTidy to convert the HTML to well-formed XML so you can use XML Diff, as Gulzar suggested.
tidy -asxml index.html
If out output XML compliant HTML. Or at least translate your HTML product into XML compliancy, you at least could then XSL your output to remove the content and id tags. Apply the same transformation to their html, and then compare.
I was thinking on lines of XML Diff since HTML can be represented as an XML Document.
The challenge with HTML is that it might not be always well formed. Found one more here showing how to use XMLDiff class.
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
winmerge is a good visual diff program