Debugging htmlParse in R's XML library - r

This is not the first time I've encountered a problem while using htmlParse in the XML library, but in the past I've just given up and used a regex to parse what I needed instead. I'd rather do it via parsing the XML/XHTML, since as we all know regexs aren't parsers.
That said, I find the error messages from the parse commands to be non-helpful at best, and I have no idea how to proceed. For instance:
> htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50))
Error in htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", :
File
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_Head1">
<title></title>
<script language="JavaScript" type="text/javascript">
var s_pageName = document.title;
var s_channel = "Take Care";
var s_campaign = "";
var s_eVar1 = ""
var s_eVar2 = ""
var s_eVar22 = ""
var s_eVar23 = ""
</script>
<meta name="keywords" content="take care clinic, walgreens clinic, walgreens take care clinic, take care health, urgent care clinic, walk in clinic" />
<meta name="description" content="Information about simple, quality healthcare for the whole family from Take Care Clinics at select Walgreens, including Take Care Clinic hours, providers, offers, insurance and quality of care." />
<link rel="shortcut icon" hre
I'm glad it sees something in there, but where do I drill down past "Error: File"?
Note this is, as far as I can tell, well-formed XHTML. When I visit the link manually I can run xpaths on it and Firebug does not complain.
How do I debug errors from htmlParse like this?

Downloading first then passing to XML package seems to work
test<-getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50)
htmlParse(test,asText=T)
or directly
htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50),asText=T)
also seems fine

Related

Using multiple DTDs (from mozilla.org)

I need use XHTML.
In the URL https://developer.mozilla.org/en-US/docs/Archive/Mozilla/XUL/Using_multiple_DTDs (updated in Sept 9 2019) Mozilla say:
""" If you want to use multiple DTDs with your XUL file, you can simply list all of the DTDs inside your DTD declaration: """
<!DOCTYPE window [
<!ENTITY % commonDTD SYSTEM "chrome://myextensions/locale/common.dtd">
%commonDTD;
<!ENTITY % mainwindowDTD SYSTEM "chrome://myextension/locale/mainwindow.dtd">
%mainwindowDTD;
]>
but: how I can load the "official" www.w3.org/TR/xhtml11/DTD/xhtml11.dtd + MyOwn.tdt" ??
Logically this structure is wrong, but please: how I can fixed this?
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html[
<!ENTITY % commonDTD PUBLIC "chrome://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
%commonDTD;
<!ENTITY % qfc SYSTEM "chrome://myweb.com/DTD/my.dtd">
%qfc;
]>
I need load two DTD, one remote:
http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
and other local:
mydtd.dtd
the first with "PUBLIC" and second with "SYSTEM".
However mozilla.org don't show a example to load one PUBLIC + other SYSTEM.
Thanks by your help.

Web scraping suddenly stopped working

I am going to start off by saying that my knowledge of XML is pretty minimal.
I promise you than until 2 or 3 days ago the following code worked perfectly:
library("rvest")
url<-"https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election"
H<-read_html(url)
table<-html_table(H, fill=TRUE)
Z<-table[1]; Z1<-Z[[1]]
Which then allowed me to get on and do what I wanted, extracting the first table from that web page and putting it in data frame Z1. However, this has suddenly stopped working and I keep getting the error message:
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != :
missing value where TRUE/FALSE needed
When I look at H it seems no longer to be a list and now looks like this:
{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
It is clearly failing at html_table.
I really don't know where to start with this at all.
I believe you missed a step in parsing out the table nodes before the html_table function.
library("rvest")
url<-"https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election"
H<-read_html(url)
tables<-html_nodes(H, "table")
Z1<-html_table(tables[1], fill = TRUE)[[1]]

plotting data on to a country map

hey guys having trouble with vincent, im not sure exactly how to use it
So ive parsed some data from the UK house of commons petition site, and now have a list of countries and their corresponding number of votes into a certain petition and ive got the data from JSON to ('Austria', 40) format
Im using vincent to plot them onto a map with colour scaled to represent number of votes but dont really know how to use vincent
for example
to render a basic map of the world the code is
world_topo = r'world-countries.topo.json'
geo_data = [{'name': 'countries',
'url': world_topo,
'feature': 'world-countries'}]
vis = Map(geo_data=geo_data, scale=200)
vis.to_json('vega.json')
but that just outputs a JSON, not a picture of a map, even though that is what two tutorial examples are saying should happen (for example here: http://wrobstory.github.io/2013/10/mapping-data-python.html and another place I forgot to save the link)
could someone help me out? thanks in advance guys
First you should change the last line of your code. Try this simplified example:
import vincent
list_data = [10, 20, 30, 20, 15, 30, 45]
vega = vincent.Bar(list_data)
vega.to_json('vega.json',html_out=True,html_path='vega.html')
Then using the terminal CD to the location of your project, i.e. where vega.html is saved.
After that run a local server using Python -m SimpleHTTPServer 8000. After that you can open any browser and type http://localhost:8000/vega.html .
Note that the depending on the version of vincent the notation inside the .tojson may be different.
Hope that helps:)
P.S. I think you should add the Python tag as well so people can find it easier.
If you want to see a picture of a map you will have to create a html file that will read the json file and map it.
<html>
<head>
<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
<script src="http://d3js.org/topojson.v1.min.js"></script>
<script src="http://d3js.org/d3.geo.projection.v0.min.js" charset="utf-8"></script>
<script src="http://trifacta.github.com/vega/vega.js"></script>
</head>
<body>
<div id="vis"></div>
//This script will read your json file and print it to the map
<script type="text/javascript">
function parse(spec)
{
vg.parse.spec(spec, function(chart) { chart({el:"#vis"}).update(); });
}
//Put the name of your json file where vega.json is
parse("vega.json");
</script>
</body>
</html>
Then open your command line and type in:
Python -m SimpleHTTPServer 8000 # Python 2
Next open your browser at http://localhost:8000/path/to/json/file.
You should now see a map on the page. Note this will be a basic map depending on what data you passed from the json file.
I hope this helps and good luck!

google prettify: ada syntax

I am currently trying to highlight Ada code on my website using google prettify and a file I have found here.
However, I am not able to use the later file with prettify, and the automatic language detection messes up attributes with ' characters (such as Array'first or integer'image), and highlights them as string delimiters.
for instance, I have the following sample code, and I would like to have it formatted correctly in my page:
procedure mergesort (V: in out TV_integer; iterations: in out integer) is
-- {} => {V is sorted}
m : integer := (V'first + V'last) / 2;
begin -- mergesort
if V'length > 1 then
mergesort(V(V'first..m), iterations);
mergesort(V(m+1..V'last), iterations);
merge(V(V'first..m),V(m+1..V'last),V,iterations);
end if;
end mergesort;
Any help would be appreciated.
EDIT: I tried using a pre class="prettyprint lang-ada" tag so that it would use the lang-ada custom script, but without success.
I'm the author of the Ada lexer for google code prettify. To use it, add this to your page:
<head>
<!-- ... -->
<link href="css/prettify.css" media="screen" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="js/prettify.js"></script>
<script type="text/javascript" src="js/lang-ada.js"></script>
</head>
<body onload="prettyPrint()">
Do not use the auto-loader, it won't use custom lexers (change the paths to where you put the files of google code prettify). After you have done that, you can highlight code on your website like this:
<pre class="prettyprint lang-ada"><code>
-- Ada code
</code></pre>
or if you're using markdown or something else that prevents you from adding classes to your tags:
<?prettify lang=ada?>
<pre><code>
-- here goes your Ada code
</code></pre>
By the way, the Ada lexer will mark Ada attributes with the class atn (which is colored violet by default). If you want them to have the same color as other code, just edit prettify.css.
Ada is not supported. A lexer has been submitted by fordprefect86, but has not (yet) been included.
See Issue 312 for more information

Modifying a Blogger template: Is there some way to access data:post.labels from within the header?

I would like to be able to access data:post.labels from within the header at Blogger. I only plan to make use of it when data:blog.pageType == "item", so there won't be any confusion with regard to multiple posts on a page. However, nothing I have tried has yielded any results. Here's what I plan to do with the data if I discover a way to get access to it:
<b:if cond='data:blog.pageType == "item"'>
<b:loop values='data:post.labels' var='label'>
<b:if cond='label.name == "poetry"'>
<meta expr:content='"Poem “" + data:blog.pageName + "”" + " at Form and Formlessness"' property='og:title'/>
</b:if>
<b:if cond='label.name == "article"'>
<meta expr:content='"Article “" + data:blog.pageName + "”" + " at Form and Formlessness"' property='og:title'/>
</b:if>
<b:if cond='label.name == "lists"'>
<meta expr:content='"Poem list “" + data:blog.pageName + "”" + " at Form and Formlessness"' property='og:title'/>
</b:if>
</b:loop>
</b:if>
All of my posts are either poems, articles on poetry, or poem lists, and labeled appropriately. So, if I can figure out some way to access the labels used by the post, this should work.
Any assistance would be appreciated.
p.s. Don't worry about the open quotation marks in the code--they're open and close quotation marks and they work just fine without having to use the Unicode value.
I spent quite a while looking for a solution some time ago. I finally came to the conclusion (well others advising me did) that data:post.labels is only available within the blog-posts widget: if you're outside that (eg in the header) it's not available.
Someone did suggest some code to populate an array with the labels while in blog-posts, and call it after that. But I never took that any further because really I wanted access to the labels before I reach the post-widget.
My original discussion was at: http://www.google.com/support/forum/p/blogger/thread?tid=188cd44d0908f736&hl=en

Resources