rvest | read_xml claims error where read_html does not - r

Taking the following url
URL <- "http://www.google.de/complete/search?output=toolbar&q=TDS38311DE"
doc <- read_xml(URL)
I get the following error:
Error: Input is not proper UTF-8, indicate encoding !
Bytes: 0xDF 0x20 0x2F 0x20 [9]
Using read_html instead everything is fine.
Am i doing something wrong? Why does this error occur?

First: rvest uses xml2 for the acquisition of content so you should file any issue relating to it under that package gh vs rvest.
Second, read_xml takes an encoding parameter for a reason and says so: "Unless otherwise specified XML documents are assumed to be in UTF-8 or UTF-16. If the document is not UTF-8/16, and lacks an explicit encoding directive, this allows you to supply a default."
XML files have the ability to specify an encoding but this "AJAX-y" response from google clearly isn't (and it's not something it's expecting you to be pilfering and it knows it's being read—usually—by an HTML parsing engine [a.k.a. a browser], not an XML parsing engine).
rvest used to do this:
encoding <- encoding %||% default_encoding(x)
xml2::read_xml(httr::content(x, "raw"), encoding = encoding, base_url = x$url,
as_html = as_html)
And default_encoding does this:
default_encoding <- function(x) {
type <- httr::headers(x)$`Content-Type`
if (is.null(type)) return(NULL)
media <- httr::parse_media(type)
media$params$charset
}
but rvest now only exposes read_xml methods for session and response objects (where it does the encoding guessing).
So, you can either:
do some manual introspection prior to scraping (after reading a site's ToS),
use httr to grab a page and pass that to read_xml, or
hook up your own reader function into your script with the same idiom

Related

Using readtext to extract text from XML

I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. I'm interested in the text within the tag "regtext" in this and other similar XML files.
I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:
regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Start tag expected, '<' not found [4]
I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including .csv or .docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.
Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) :
The xml format does not fit for the extraction without xPath
Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")
I end up with a dataframe with the correct doc_id, but no text.
From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.
It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT".
Here is a solution using the xml2 package. (I find this package provides a simpler interface and is easier to use)
library(xml2)
url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)
#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")
#convert the regtext nodes into vector of strings
xml_text(regtext)

How to use UTF-8 folder names in download_html and download.file

I would like to scrape country-specific information from a Chinese webpage into folders that are named according to these countries. Since I am extracting the list of countries from the Chinese page as well, the folder names would contain Chinese characters - which seems to be problematic.
My code is
url <- "https://www.baidu.com/"
path <- file.path("中国", "江苏")
dir.create(path, recursive = TRUE)
download_html(url, file = file.path(path, "baidu.html"))
download.file(url, destfile = file.path(path, "baidu.html"))
The error message of the last line reads
Error in download.file(url, destfile = file.path(path, "baidu.html")) :
cannot open destfile '<U+4E2D><U+56FD>/<U+6C5F><U+82CF>/baidu.html', reason 'Invalid argument'
so it seems that download.file converts Chinese characters internally. Interestingly, file.path has no issues creating folders containing Chinese characters. I am running Windows 10 64 bit and R version 4.0.2.
Is there a way (or alternative function) that accepts Chinese characters or coerces download.file to use the correct encoding? If not, what alternatives do I have? I could think of:
navigating into the folder using setwd (which does work but forces me to use a loop)
converting the Chinese names, for example by using its romanization (which is ambiguous and probably does not exist as an R function)
EDIT:
Perhaps this is part of a bigger issue on my machine. The first line of the following code works (i.e. shows "two" as a result), whereas the second line does not:
stringr::str_replace_all("one", c("one" = "two"))
stringr::str_replace_all("阿富汗", c("阿富汗" = "Afghanistan"))
Instead, the second line produces an error similar to the one above:
Warning message:
unable to translate '<U+963F><U+5BCC><U+6C57>' to native encoding
However, when I create a string containing Chinese characters, the result seems to be in UTF-8:
string <- "阿富汗"
stringi::stri_enc_isutf8(string)
shows TRUE.
EDIT 2:
On my old laptop running Ubuntu, stringr::str_replace_all() works just fine with Chinese characters.
I think the leading cause of your error during download.file is that your default encoding of Chinese in Windows is UTF-16. I have done some trials and meet similar errors about download_html. But I couldn't reproduce the same error. However I think its essence is the same as download.file. My problem is that download_html only can discern file in arguments which includes Chinese character in UTF-8. The code as below:
library(xml2)
url <- 'https://www.baidu.com/'
path <- file.path('中国', '江苏')
dir.create(path, recursive = TRUE)
download_html(url, file = file.path(path, 'baidu.html'))
download.file(url, destfile = file.path(path, "baidu.html")
The error occurred:
Error in curl::curl_download(url, file, quiet = quiet, mode = mode, handle = handle) :
Failed to open file C:\Users\16071098\Documents\涓浗\鍖椾含\baidu.html.
But when I change the command about download_html like below:
download_html(url, file = enc2utf8(file.path(path, 'baidu.html')))
download.file(url, destfile = enc2utf8(file.path(path, "baidu.html")))
It is okay.
Your error shows your Chinese part in path is encoded by UTF-16.
No matter what reason, I think it results from that your default encoding type is different from the function's accepted type.

R character Encoding goes wrong (English - Spanish)

Im trying to load a dataset into R using an API that lets me do a query and returns back the data I need (I cant configure on server side).
I know it has something to do with Encoding. When i check the string in from by dataframe in R in gives me ENC: UTF-8 "Cosmética".
When i copy the source string "Cosmética" it gives me latin1.
How can i get the UTF-8 string properly formatted like the latin1?
Ive tried this below:
Sys.setlocale("LC_ALL","Spanish")
tried directly on the string:
Enconding(Description) <- "latin1"
unfortunately i cant get it to work. Any ideas are welcome! Thanks.
You can use iconv to change to encoding of the string:
iconv(mystring, to = "ISO-8859-1")
# [1] "Cosmética"
ISO 8859-1 is the common character encoding in Western Europe.

Error fetching the json from forbes website in R using jsonlite

Here is my code:
forbesList<-fromJSON('https://www.forbes.com/ajax/list/data?year=2018&uri=powerful-brands&type=organization')
Error Details:
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
(!doctype html) (html lang="en")
(right here) ------^
Please help me out, i tried in many ways to resolve this issue, but failed. Any help is much appreciated.
The first parameter of fromJSON needs to be actual JSON text, not a URL pointing to some JSON text. Try downloading the JSON content first, then make your call to fromJSON:
library(httr)
url <- "https://www.forbes.com/ajax/list/data?year=2018&uri=powerful-brands&type=organization"
req <- GET(url)
stop_for_status(req)
json <- content(req, "text")
forbesList <- fromJSON(json)
I have verified here that the JSON content from your URL parses correctly, so I don't think this should be a problem.

Line by line reading from HTTPS connection in R

When a connection is created with open="r" it allows for line-by-line reading, which is useful for batch processing large data streams. For example this script parses a sizable gzipped JSON HTTP stream by reading 100 lines at a time. However unfortunately R does not support SSL:
> readLines(url("https://api.github.com/repos/jeroenooms/opencpu"))
Error in readLines(url("https://api.github.com/repos/jeroenooms/opencpu")) :
cannot open the connection: unsupported URL scheme
The RCurl and httr packages do support HTTPS, but I don't think they are capable of creating a connection object similar to url(). Is there some other way to do line-by-line reading of an HTTPS connection similar to the example in the script above?
Yes, RCurl can "do line-by-line reading". In fact, it always does it, but the higher level functions hide this for you for convenience. You use the writefunction (and headerfunction for the header) to specify a function that is called each time libcurl has received enough bytes from the body of the result. That function can do anything it wants. There are several examples of this in the RCurl package itself. But here is a simple one
curlPerform(url = "http://www.omegahat.org/index.html",
writefunction = function(txt, ...) {
cat("*", txt, "\n")
TRUE
})
One solution is to manually call the curl executable via pipe. The following seems to work.
library(jsonlite)
stream_https <- gzcon(pipe("curl https://jeroenooms.github.io/files/hourly_14.json.gz", open="r"))
batches <- list(); i <- 1
while(length(records <- readLines(gzstream, n = 100))){
message("Batch ", i, ": found ", length(records), " lines of json...")
json <- paste0("[", paste0(records, collapse=","), "]")
batches[[i]] <- fromJSON(json, validate=TRUE)
i <- i+1
}
weather <- rbind.pages(batches)
rm(batches); close(gzstream)
However this is suboptimal because the curl executable might not be available for various reasons. Would be much nicer to invoke this pipe directly via RCurl/libcurl.

Resources