scrape xml from a html page with getNodeSet - r

Hi I am using R to do some basic web scraping where I am comfortable parsing xml file and querying them with xpath. However I am having difficulty parsing a full html page and trying to extract the xml to get into my comfort zone. For example:
parsedhtml <- htmlParse("http://www.w3schools.com/XPath/xpath_examples.asp")
parses the html. I am using this because xmlParse only works on .xml files. I know that by using getNodeSet I can isolated specific nodes within the parsed html. So I am attempting to extract the embedded xml document under the "The Example XML Document" section by trying:
getNodeSet(parsedhtml, "//div[#class = 'code notranslate']")
where I get the data in the correct node, however it is not in standard xml and I am unable to parse this using xmlParse. My question is how do I use the result of getNodeSet to extract the xml?
Thanks very much

Related

How to get reviews with xpath in R

I'm triying to scrape reviews from this webpage https://www.leroymerlin.es/fp/82142706/armario-serie-one-blanco-abatible-2-puertas-200x100x50cm. I'm running into some issues to get XPath, when I ran the code I found the output is always NULL.
Code:
library(XML)
url <- "https://www.leroymerlin.es/fp/82142706/armario-serie-one-blanco-abatible-2-puertas-200x100x50cm"
source <- readLines(url, encoding = "UTF-8")
parsed_doc <- htmlParse(source, encoding = "UTF-8")
xpathSApply(parsed_doc, path = '//*[#id="reviewsContent"]/div[1]/div[2]/div[3]/h3', xmlValue)
I must be doing something wrong! I'm trying everything. Many thanks for your helps.
The This webpage is dynamically created upon load with the data is stored in a secondary file, typical scraping and xpath methods will not work.
If you access your browser's developer's tools and goto the network tab.
Reload the webpage and filter for the XHR files. Review each file and one should see a file named "reviews", this is the file where the reviews are stored in a JSON format. Right click the file and copy the link address.
One can access this file directly:
library(jsonlite)
fromJSON("https://www.leroymerlin.es/bin/leroymerlin/reviews?product=82142706&page=1&sort=best&reviewsPerPage=5")
Here is a good reference: How to Find The Link for JSON Data of a Certain Website

Retrieve data from a website via Visual Basic

There is this website that we purchase widgets from that provides details for each of their parts on its own webpage. Example: http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND. I have to find all of their parts that are in our database, and add Manufacturer and Manufacturer Part Number values to their fields.
I was told that there is a way for Visual Basic to access a webpage and extract information. If someone could point me in the right direction on where to start, I'm sure I can figure this out.
Thanks.
How to scrape a website using HTMLAgilityPack (VB.Net)
I agree that htmlagilitypack is the easiest way to accomplish this. It is less error prone than just using Regex. The following will be how I deal with scraping.
After downloading htmlagilitypack*dll, create a new application, add htmlagilitypack via nuget, and reference to it. If you can use Chrome, it will allow you to inspect the page to get information about where your information is located. Right-click on a value you wish to capture and look for the table that it is found in (follow the HTML up a bit).
The following example will extract all the values from that page within the "pricing" table. We need to know the XPath value for the table (this value is used to instruct htmlagilitypack on what to look for) so that the document we create looks for our specific values. This can be achieved by finding whatever structure your values are in and right click copy XPath. From this we get...
//*[#id="pricing"]
Please note that sometimes the XPath you get from Chrome may be rather large. You can often simplify it by finding something unique about the table your values are in. In this example it is "id", but in other situations, it could easily be headings or class or whatever.
This XPath value looks for something with the id equal to pricing, that is our table. When we look further in, we see that our values are within tbody,tr and td tags. HtmlAgilitypack doesn't work well with the tbody so ignore it. Our new XPath is...
//*[#id='pricing']/tr/td
This XPath says look for the pricing id within the page, then look for text within its tr and td tags. Now we add the code...
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
Next
To extract the values we simply reference our table value that was created in our loop and it's innertext member.
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load("http://www.digikey.ca/product-search/en?lang=en&site=ca&KeyWords=AE9912-ND")
For Each table As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes("//*[#id='pricing']/tr/td")
MsgBox(table.InnerText)
Next
Now we have message boxes that pop up the values...you can switch the message box for an arraylist to fill or whatever way you wish to store the values. Now simply do the same for whatever other tables you wish to get.
Please note that the Doc variable that was created is reusable, so if you wanted to cycle through a different table in the same page, you do not have to reload the page. This is a good idea especially if you are making many requests, you don't want to slam the website, and if you are automating a large number of scrapes, it puts some time between requests.
Scraping is really that easy. That's is the basic idea. Have fun!
Html Agility Pack is going to be your friend!
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
Looking at the source of the example page you provided, they are using HTML5 Microdata in their markup. I searched some more on CodePlex and found a microdata parser which may help too: MicroData Parser

How to extract html table into XML file?

I have a HTML table stored in a string
string tbl = "<table calls='report'><tr><th>head</th><th>name</th></tr><tr><td>Department name</td><td>Mike</th></tr></table>";
how can i loop thru this string and then write it to an XML file?
I think i will be able to write the file to XML but the question is how or loop thru the string and identify whats in t and how to parse it.
Thanks
Since HTML is already XML, you could leave it as it is and meet your objective. But I assume you want semantically meaningful tag names.
You might try the HTML Agility Pack. This allows you to write queries against an object model, similar to the way you can do it with XDocument and Linq-to-XML. I quote:
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH
nor XSLT to use it, don't worry...). It is a .NET code library that
allows you to parse "out of the web" HTML files. The parser is very
tolerant with "real world" malformed HTML. The object model is very
similar to what proposes System.Xml, but for HTML documents (or
streams).
It also supports Linq, if you aren't familiar with XPATH et al.

How to Show & in xml constructed from string

I am creating a web api using asp.net mvc4 and the response output is xml. Before outuptting to browser I modify the xml response so that one of the values between the start and closing tags contain a url string which may have '&'
When outputting in browser, this generates an error that xml is not well formed.
I have read from How to show & in a XML attribute That would be produced by XSLT that one can use D-O-E to generate unescaped content using xslt
but dont know how this could apply for xml generated from a string and displayed in browser
You should encode the & as
&
which is understood by XML (see http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined%5Fentities%5Fin%5FXML)
Another alternative would be to surround the output in a CDATA tag (http://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean)

Need to way to grab an XML file from an URL, transform it with XSL, and present it back as a XML file that prompts a save?

Requirements:
Raw XML is from external website I have little control via URL (eg. http://example.com/raw.xml)
I need to transform it via XSL into another XML file (I already have this XSL file written and it works)
I need to write an asp.net or asp file that takes the url, applies the xsl transform, and outputs the resultant xml that prompts the client to save the xml to the client local disk
End result is a xml file that has been xsl transformed, based on xsl and xml from external website
This should not be difficult, but I do not see any examples that allow me to do what is stated above. Please help! Thanks in advance!
You can get the external XML using the WebRequest class (for example).
The result can be loaded to an XML document and transformed - the transformed document can then be returned on the HttpResponse.OutputStream with the correct headers for an XML document (response-type will be either text/xml or application/xml).

Resources