I want to parse a court document I downloaded in xml format. But the response type is application/xhtml+xml. And I'm getting an error in turning this xhtml document to xml in r so that I can extract information I need. See below. Can anyone help? Thank you.
resp_xml <- readRDS("had_NH_xml.rds")
# Load xml2
library(xml2)
# Check response is XML
http_type(resp_xml)
[1] "application/xhtml+xml"
# Examine returned text with content()
NH_text <- content(resp_xml, as = "text")
NH_text
[1] "<!DOCTYPE html>\n<html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n \t<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\" /><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resource/theme.css.jsf?ln=primefaces-redmond\" /><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resource/primefaces.css.jsf?ln=primefaces&v=5.3\" /><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/jquery/jquery.js.jsf?ln=primefaces&v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/jquery/jquery-plugins.js.jsf?ln=primefaces&v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/primefaces.js.jsf?ln=primefaces&v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/primefaces-extensions.js.jsf?ln=primefaces-extensions&v=4.0.0\"></script><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resou... <truncated>
>
> # Check htmltidy package: https://cran.r- project.org/web/packages/htmltidy/htmltidy.pdf
>
# Turn NH_text into an XML document
NH_xml <- read_xml(NH_text)
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url,
as_html = as_html, :
Entity 'nbsp' not defined [26]
Named HTML entities are invalid in XML (regardless of what any potential troll comments might otherwise "suggest"). I do not know R programming though what I can tell you is that you need to do string replacement for the following array:
' ','>','<'
...and replace them with the following strings:
' ','<','>'
In PHP this would simply be:
$f = array(' ','>','<');
$r = array(' ','<','>');
$a = str_ireplace($f,$r,$a);
...and each relative key/value would be replaced, I'm not sure enough to try to post R code looking at basic tutorials though.
What I can tell you is that if you clean out those strings (and any doctype) then if the rest of the code is not malformed then it should render just fine as application/xml.
Related
I have svg file that I get from dataURI (plotly.js) and sending that data to server (shiny app):
exportImage(plot, settings.config).then(function(dataURI) {
var payload;
if (!settings.dataURI) {
var data = dataURI.replace(/data:image\/svg\+xml,/, '');
// I'm using decodeURIComponent in browser because it's much faster.
payload = decodeURIComponent(data);
$('<div>' + payload + '</div>').appendTo('body');
} else {
payload = dataURI;
}
Shiny.onInputChange(settings.messageId, payload);
});
The svg contain unicode characters in unit mm³, and in observeEvent the svg contain proper characters, when I pause in RStudio with browser(), I've got this:
> substring(input$svg, 198036, 198061)
[1] "Volume (mm³) on log2 scale"
But when I save that into a file I've got mm3, I'm using this:
writeLines(
paste('<?xml version="1.0" encoding="utf-8"?>', input$svg),
svg.file
)
I've tried using enc2utf8 function and setting useBytes to TRUE, I've also tried to add <?xml in JavaScript and using cat(svg, svg.file) and it produce characters with invalid encoding or 3 instead of ³.
I've got this:
> Encoding(input$svg)
[1] "UTF-8"
> Sys.getlocale()
[1] "LC_COLLATE=Polish_Poland.1250;LC_CTYPE=Polish_Poland.1250;LC_MONETARY=Polish_Poland.1250;LC_NUMERIC=C;LC_TIME=Polish_Poland.1250"
Should this be UT8 for it to work? How can I save utf8 characters to file in R?
I'm testing this on Windows but it will be deployed to Linux machine.
So, it seems the problem is with the encoding , In this case if I change the encoding to UTF-16 then the value is correctly printed.
So in this case :
Encoding(input$svg) <- "UTF-16"
The above works well and prints the correct output
#[1] "Volume (mm³) on log2 scale"
This is what I need to decode
\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab
it is generated by String.fromCharCode(arrayPw[i]);
but i don't understand how to decode it :(
Please help
Python:
data = "\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab"
udata = data.decode("utf-8")
asciidata = udata.encode("ascii","ignore")
JavaScript:
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Otherwise do more research about decoding UTF-8.
https://gist.github.com/chrisveness/bcb00eb717e6382c5608
There's also an online UTF-8 decoder/encoder:
https://mothereff.in/utf-8
HINT: ÙÙé-B[x¾æEz«
duplicate of this : https://stackoverflow.com/a/70815136/5902698
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))
First of all, I am sorry if this is a repeated question. I tried for several hours already and I see different solutions for PHP or other languages but not for R.
I am retrieving data from the last.fm website using their API.
You do need an API key to retrieve the data I am trying to get but I will make it simpler here and hopefully you can answer my question.
Here is my problem:
At certain point, when retrieving the data, I encounter an error which stops my request. I skipped it once but it comes back again and again. I always get the same: PCDATA invalid Char value #
Here is an example:
string = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<lfm status=\"ok\">\n<results for=\"a\" xmlns:opensearch=\"http://a9.com/-/spec/opensearch/1.1/\">\n<opensearch:Query role=\"request\" searchTerms=\"a\" startPage=\"1382\" />\n<opensearch:totalResults>212588</opensearch:totalResults>\n<opensearch:startIndex>1381</opensearch:startIndex>\n<opensearch:itemsPerPage>1</opensearch:itemsPerPage><artistmatches>\n<artist>\n <name>!B0A \0348E09;>2</name>\n <listeners>1672</listeners>\n <mbid></mbid>\n <url>http://www.last.fm/music/!B0A+%1C8E09;%3E2</url>\n <streamable>0</streamable>\n <image size=\"small\">http://userserve-ak.last.fm/serve/34/88015017.png</image>\n <image size=\"medium\">http://userserve-ak.last.fm/serve/64/88015017.png</image>\n <image size=\"large\">http://userserve-ak.last.fm/serve/126/88015017.png</image>\n <image size=\"extralarge\">http://userserve-ak.last.fm/serve/252/88015017.png</image>\n <image size=\"mega\">http://userserve-ak.last.fm/serve/_/88015017/B0A+8E092+15286997.png</image>\n </artist></artistmatches>\n</results></lfm>\n"
When I try to parse this text I get the error:
doc = xmlParse(string, asText = TRUE)
PCDATA invalid Char value 28
Error: 1: PCDATA invalid Char value 28
I believe the part that is making this happen comes from this part of the string:
<name>!B0A \0348E09;>2</name>\n
But I can't be sure now.
What I am looking for is one of these solutions, being the first one the ideally situation but any of the others will make me happy:
1 - Allow R to receive these invalid characters
2 - Eliminate the invalid characters and continue with the parse without stopping.
3 - Skip the string with the invalid characters and continue with the parse
4 - Create a function to find the invalid characters so I can include that when retrieving the data from last.fm
I hope you can understand the question and help me with it.
Thanks in advance
You are right. The artist name has an illegal characters for XML parsing.
Try this out:
illegal <- "[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
utf8_for_xml <- function(x) {
return(gsub(illegal, "", x))
}
string_formatted <- utf8_for_xml(string)
xmlParse(string_formatted)
<?xml version="1.0" encoding="utf-8"?>
<lfm status="ok">
<results xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" for="a">
<opensearch:Query role="request" searchTerms="a" startPage="1382"/>
<opensearch:totalResults>212588</opensearch:totalResults>
<opensearch:startIndex>1381</opensearch:startIndex>
<opensearch:itemsPerPage>1</opensearch:itemsPerPage>
<artistmatches>
<artist>
<name>!B0A 8E09;>2</name>
<listeners>1672</listeners>
<mbid/>
<url>http://www.last.fm/music/!B0A+%1C8E09;%3E2</url>
<streamable>0</streamable>
<image size="small">http://userserve-ak.last.fm/serve/34/88015017.png</image>
<image size="medium">http://userserve-ak.last.fm/serve/64/88015017.png</image>
<image size="large">http://userserve-ak.last.fm/serve/126/88015017.png</image>
<image size="extralarge">http://userserve-ak.last.fm/serve/252/88015017.png</image>
<image size="mega">http://userserve-ak.last.fm/serve/_/88015017/B0A+8E092+15286997.png</image>
</artist>
</artistmatches>
</results>
</lfm>
Extra:
Let's find out which character is illegal for XML in your string object.
The function gregexpr finds the character number:
gregexpr(illegal, string)
[1] 403
attr(,"match.length")
[1] 1
using "Unicode" package:
require(Unicode)
unicode_string <- as.u_char(utf8ToInt(string))
unicode_string[403]
[1] U+001C
The Unicode U+001C is the "Information Separator Four" and it is illegal for parsing in XML.
I'm parsing HTML by inheriting HTMLParser, which is a class coming from the library html.parser. I'm making a web scraper. I have set "convert_charrefs" to true. The program downloads a page by doing "downloadPage(url)" and passes it to myParser (I think It will be better for you if I don't paste here all my code). When the parser finds the link I'm interested to (e.g Attività e procedimenti) from a web site, the program get the value of the attribute "href" and tries to download the page linked by href, by doing "downloadPage(href)", passes it to myParser and so on...
The code for downloadPage(href) is the following:
def getCharset(response):
str = response.info()["Content-type"]
if str:
end = re.search("charset=", str).span()[1]
if end:
return str[end:]
else:
return "ascii"
else:
return "ascii"
def downloadPage(url):
response = urllib.request.urlopen(url)
charset = getCharset(response)
return response.read().decode(charset)
Now, the problem is that certain link has some vowel stressed, such as "http://città.it/" (last url is faked). Not all links found in a web page are made of Unicode characters. So the following code sometimes raises UnicodeEncodeError:
urllib.request.urlopen(url)
I specify that I can't know at first glance how each link is composed
I have solved this problem in this way:
def fromIriToUri(iri):
myUri = []
iri = urlsplit(iri)
iri = list(iri)
for i in iri:
try:
i.encode("ascii")
myUri.append(i)
except UnicodeEncodeError:
myUri.append(urllib.parse.quote(i))
uri = urllib.parse.urlunsplit(myUri)
return uri
Is there a nice way to extract get the R-help page from an installed package in the form of an R object (e.g a list). I would like to expose help pages in the form of standardized JSON or XML schemas. However getting the R-help info from the DB is harder than I thought.
I hacked together a while ago to get the HTML of an R help manual page. However I would rather have a general R object that contains this information, that I can render to JSON/XML/HTML, etc. I looked into the helpr package from Hadley, but this seems to be a bit of overkill for my purpose.
Edited with suggestion of Hadley
You can do this a bit easier by:
getHTMLhelp <- function(...){
thefile <- help(...)
capture.output(
tools:::Rd2HTML(utils:::.getHelpFile(thefile))
)
}
Using tools:::Rd2txt instead of tools:::Rd2HTML will give you plain text. Just getting the file (without any parsing) gives you the original Rd format, so you can write your custom parsing function to parse it into an object (see the solution of #Jeroen, which does a good job in extracting all info into a list).
This function takes exactly the same arguments as help() and returns a vector with every element being a line in the file, eg:
> head(HelpAnova)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
[2] "<html><head><title>R: Anova Tables</title>"
[3] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">"
[4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
[5] "</head><body>"
[6] ""
Or :
> HelpGam <- getHTMLhelp(gamm,package=mgcv)
> head(HelpGam)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
[2] "<html><head><title>R: Generalized Additive Mixed Models</title>"
[3] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">"
[4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
[5] "</head><body>"
[6] ""
So below what I hacked together. However I yet have to test it on many help files to see if it generally works.
Rd2list <- function(Rd){
names(Rd) <- substring(sapply(Rd, attr, "Rd_tag"),2);
temp_args <- Rd$arguments;
Rd$arguments <- NULL;
myrd <- lapply(Rd, unlist);
myrd <- lapply(myrd, paste, collapse="");
temp_args <- temp_args[sapply(temp_args , attr, "Rd_tag") == "\\item"];
temp_args <- lapply(temp_args, lapply, paste, collapse="");
temp_args <- lapply(temp_args, "names<-", c("arg", "description"));
myrd$arguments <- temp_args;
return(myrd);
}
getHelpList <- function(...){
thefile <- help(...)
myrd <- utils:::.getHelpFile(thefile);
Rd2list(myrd);
}
And then you would do something like:
myhelp <- getHelpList("qplot", package="ggplot2");
cat(jsonlite::toJSON(myhelp));