Line by line reading from HTTPS connection in R - r

When a connection is created with open="r" it allows for line-by-line reading, which is useful for batch processing large data streams. For example this script parses a sizable gzipped JSON HTTP stream by reading 100 lines at a time. However unfortunately R does not support SSL:
> readLines(url("https://api.github.com/repos/jeroenooms/opencpu"))
Error in readLines(url("https://api.github.com/repos/jeroenooms/opencpu")) :
cannot open the connection: unsupported URL scheme
The RCurl and httr packages do support HTTPS, but I don't think they are capable of creating a connection object similar to url(). Is there some other way to do line-by-line reading of an HTTPS connection similar to the example in the script above?

Yes, RCurl can "do line-by-line reading". In fact, it always does it, but the higher level functions hide this for you for convenience. You use the writefunction (and headerfunction for the header) to specify a function that is called each time libcurl has received enough bytes from the body of the result. That function can do anything it wants. There are several examples of this in the RCurl package itself. But here is a simple one
curlPerform(url = "http://www.omegahat.org/index.html",
writefunction = function(txt, ...) {
cat("*", txt, "\n")
TRUE
})

One solution is to manually call the curl executable via pipe. The following seems to work.
library(jsonlite)
stream_https <- gzcon(pipe("curl https://jeroenooms.github.io/files/hourly_14.json.gz", open="r"))
batches <- list(); i <- 1
while(length(records <- readLines(gzstream, n = 100))){
message("Batch ", i, ": found ", length(records), " lines of json...")
json <- paste0("[", paste0(records, collapse=","), "]")
batches[[i]] <- fromJSON(json, validate=TRUE)
i <- i+1
}
weather <- rbind.pages(batches)
rm(batches); close(gzstream)
However this is suboptimal because the curl executable might not be available for various reasons. Would be much nicer to invoke this pipe directly via RCurl/libcurl.

Related

Trouble downloading and accessing zip files from API in R

I am trying to automate a process in R which involves downloading a zipped folder from an API* which contains a few .csv/.xml files, accessing its contents, and then extracting the .csv/.xml that I actually care about into a dataframe (or something else that is workable). However, I am having some problems accessing the contents of the API pull. From what I gather, the proper process for pulling from an API is to use GET() from the httr package to access the API's files, then the jsonlite package to process it. The second step in this process is failing me. The code I have been trying to use is roughly as follows:
library(httr)
library(jsonlite)
req <- "http://request.path.com/thisisanapi/SingleZip?option1=yes&option2=no"
res <- GET(url = req)
#this works as expected, with res$status_code == 200
#OPTION 1:
api_char <- rawToChar(res$content)
api_call <- fromJSON(api_char, flatten=T)
#OPTION 2:
api_char2 <- content(res, "text")
api_call2 <- fromJSON(api_char2, flatten=T)
In option 1, the first line fails with an "embedded nul in string" error. In option 2, the second line fails with a "lexical error: invalid char in json text" error.
I did some reading and found a few related threads. First, this person looks to be doing a very similar thing to me, but did not experience this error (this suggests that maybe the files are zipped/stored differently between the APIs that the two of us are using and that I have set up the GET() incorrectly?). Second, this person seems to be experiencing a similar problem with converting the raw data from the API. I attempted the fix from this thread, but it did not work. In option 1, the first line ran but the second line gave a similar "lexical error: invalid char in json text" as in option before and, in option 2, the second line gave a "if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < : missing value where TRUE/FALSE needed" error, which I am not quite sure how to interpret. This may be because the content_type differs between our API pulls: mine is application/x-zip-compressed and theirs is text/tab-separated-values; charset=utf-16le, so maybe removing the null characters is altogether inappropriate here.
There is some documentation on usage of the API I am using*, but a lot of it is a few years old now and seems to focus more on manual usage rather than integration with large automated downloads like I am working on (my end goal is a loop which executes the process described many times over slightly varying urls). I am most certainly a beginner to using APIs like this, and would really appreciate some insight!
* = specifically, I am pulling from CAISO's OASIS API. If you want to follow along with some real files, replace "http://request.path.com/thisisanapi/SingleZip?option1=yes&option2=no" with "http://oasis.caiso.com/oasisapi/SingleZip?resultformat=6&queryname=PRC_INTVL_LMP&version=3&startdatetime=20201225T09:00-0000&enddatetime=20201226T9:10-0000&market_run_id=RTM&grp_type=ALL"
I think the main issue here is that you don't have a JSON return from the API. You have a .zip file being returned, as binary (I think?) data. Your challenge is to process that data. I don't think fromJSON() will help you, as the data from the API isn't in JSON format.
Here's how I would do it. I prefer to use the httr2 package. The process below makes it nice and clear what the parameters of the query are.
library(httr2)
req <- httr2::request("http://oasis.caiso.com/oasisapi")
query <- req %>%
httr2::req_url_path_append("SingleZip") %>%
httr2::req_url_query(resultformat = 6) %>%
httr2::req_url_query(queryname = "PRC_INTVL_LMP") %>%
httr2::req_url_query(version = 3) %>%
httr2::req_url_query(startdatetime = "20201225T09:00-0000") %>%
httr2::req_url_query(enddatetime = "20201226T9:10-0000") %>%
httr2::req_url_query(market_run_id = "RTM") %>%
httr2::req_url_query(grp_type = "ALL")
# Check what our query looks like
query
#> <httr2_request>
#> GET
#> http://oasis.caiso.com/oasisapi/SingleZip?resultformat=6&queryname=PRC_INTVL_LMP&version=3&startdatetime=20201225T09%3A00-0000&enddatetime=20201226T9%3A10-0000&market_run_id=RTM&grp_type=ALL
#> Body: empty
resp <- query %>%
httr2::req_perform()
# Check what content type and encoding we have
# All looks good
resp %>%
httr2::resp_content_type()
#> [1] "application/x-zip-compressed"
resp %>%
httr2::resp_encoding()
#> [1] "UTF-8"
Created on 2022-08-30 with reprex v2.0.2
Then you have a choice what to do, if you want to write the data to a zip file.
I discovered that the brio package will write raw data to a file nicely. Or you can just use download.file to download the .zip from the URL (you can just do that without all the httr stuff above). You need to use mode = "wb".
resp %>%
httr2::resp_body_raw() %>%
brio::write_file_raw(path = "out.zip")
# alternative using your original URL or query$url
download.file(query$url, "out.zip", mode = "wb")

Cannot access EIA API in R

I'm having trouble accessing the Energy Information Administration's API through R (https://www.eia.gov/opendata/).
On my office computer, if I try the link in a browser it works, and the data shows up (the full url: https://api.eia.gov/series/?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json).
I am also successfully connected to Bloomberg's API through R, so R is able to access the network.
Since the API is working and not blocked by my company's firewall, and R is in fact able to connect to the Internet, I have no clue what's going wrong.
The script works fine on my home computer, but at my office computer it is unsuccessful. So I gather it is a network issue, but if somebody could point me in any direction as to what the problem might be I would be grateful (my IT department couldn't help).
library(XML)
api.key = "e122a1411ca0ac941eb192ede51feebe"
series.id = "PET.MCREXUS1.M"
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=", api.key, "&out=xml", sep="")
doc = xmlParse(file=my.url, isURL=TRUE) # yields error
Error msg:
No such file or directoryfailed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
Error: 1: No such file or directory2: failed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
I tried some other methods like read_xml() from the xml2 package, but this gives a "could not resolve host" error.
To get XML, you need to change your url to XML:
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=",
api.key, "&out=xml", sep="")
res <- httr::GET(my.url)
xml2::read_xml(res)
Or :
res <- httr::GET(my.url)
XML::xmlParse(res)
Otherwise with the post as is(ie &out=json):
res <- httr::GET(my.url)
jsonlite::fromJSON(httr::content(res,"text"))
or this:
xml2::read_xml(httr::content(res,"text"))
Please note that this answer simply provides a way to get the data, whether it is in the desired form is opinion based and up to whoever is processing the data.
If it does not have to be XML output, you can also use the new eia package. (Disclaimer: I'm the author.)
Using your example:
remotes::install_github("leonawicz/eia")
library(eia)
x <- eia_series("PET.MCREXUS1.M")
This assumes your key is set globally (e.g., in .Renviron or previously in your R session with eia_set_key). But you can also pass it directly to the function call above by adding key = "yourkeyhere".
The result returned is a tidyverse-style data frame, one row per series ID and including a data list column that contains the data frame for each time series (can be unnested with tidyr::unnest if desired).
Alternatively, if you set the argument tidy = FALSE, it will return the list result of jsonlite::fromJSON without the "tidy" processing.
Finally, if you set tidy = NA, no processing is done at all and you get the original JSON string output for those who intend to pass the raw output to other canned code or software. The package does not provide XML output, however.
There are more comprehensive examples and vignettes at the eia package website I created.

What is the method to save objects in a compressed form using multiple threads/cores in R?

There are a number of compression protocols that support multiple core/thread compression/decompression. However, it appears the base decompression/compression methods all use a single core (even though at least of the algorithms support multiple cores).
If there isn't an existing tool to accomplish this, is there a way to manage it without dropping out to Java or C?
I thought that I could use the ability to pipe or serialize to get what I wanted by passing the R object out to the shell/command-line somehow. But, I can't seem to get a usable form of the object out of R that way (maybe I'm missing something?). The 'obvious' solution would be use dput but the note in the help for that function makes it pretty clear that using dput to turn an R object into ASCII for the sake of saving isn't a suitable (or safe) purpose for dput. The alternative mentioned dump acts like save (I'd rather have something like saveRDS), and still routes the file to a .Internal that is impenetrable without digging into R's C code.
What other approaches to solve this problem should I consider?
With the resource provided by #Martin I wrote the following code.
xz
For compression, I wanted to try first with xz. However, the version of xz packaged on Ubuntu 14.04 LTS as of the moment (5.1.0alpha) does not support parallel compress. So, I use pxz instead.
saveRDS.xz <- function(object,file,threads=parallel::detectCores()) {
con <- pipe(paste0("pxz -T",threads," > ",file),"wb")
saveRDS(object, file = con, compress=FALSE)
close(con)
}
The companion code readRDS.xz code also works for any regular RDS files saved using xz compression. Also note that, although pxz does not decompress in parallel, there are time savings (about 1/3rd saved on my system) associated with using this code. This is because decompression moves to another thread and the R thread is left to process the incoming stream. On my system, R's single-threaded CPU usage maxes out well before pxz's does.
readRDS.xz <- function(file,threads=parallel::detectCores()) {
con <- pipe(paste0("pxz -d -k -c -T",threads," ",file))
object <- readRDS(file = con)
close(con)
return(object)
}
pigz
Alternatively we can use pigz (gz) to compress and decompress.
saveRDS.gz <- function(object,file,threads=parallel::detectCores()) {
con <- pipe(paste0("pigz -p",threads," > ",file),"wb")
saveRDS(object, file = con)
close(con)
}
The decompression in pigz is not entirely multi-threaded. But as with xz there is a speed benefit to taking the decompression off R's thread. On my system both pxz and pigz provide similar decompression times (perhaps due to the bottleneck in R mentioned in the above section).
saveRDS.gz <- function(object,file,threads=parallel::detectCores()) {
con <- pipe(paste0("pigz -p",threads," > ",file),"wb")
saveRDS(object, file = con)
close(con)
}
readRDS.gz <- function(file,threads=parallel::detectCores()) {
con <- pipe(paste0("pigz -d -c -p",threads," ",file))
object <- readRDS(file = con)
close(con)
return(object)
}
bzip2
Provided me performance somewhere between the other two, and so I skip it here.
Together
With those functions established we can write a quick wrapper such that it does the right thing.
readRDS.p <- function(file,threads=parallel::detectCores()) {
#Hypothetically we could use initial bytes to determine file format, but here we use the Linux command file because the readBin implementation was not immediately obvious
fileDetails <- system2("file",args=file,stdout=TRUE)
selector <- sapply(c("gzip","XZ"),function (x) {grepl(x,fileDetails)})
format <- names(selector)[selector]
if (format == "gz") {
object <- readRDS.gz(file, threads=threads)
} else if (format == "XZ") {
object <- readRDS.xz(file, threads=threads)
} else {
object <- readRDS(file)
}
return(object)
}

How to download an .xlsx file from a dropbox (https:) location

I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon
You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()
UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.

How to optimise scraping with getURL() in R

I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament. The pages cover 2002-2012 and represent less than 1,000 bills each.
For this, I scrape with getURL through this loop:
b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id
lapply(l, FUN = function(x) {
print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))
# scrape
data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
data <- paste(b, x, data, sep = "/")
data <- getURL(data)
write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})
Is there any way to optimise the getURL() function here? I cannot seem to use concurrent downloading by passing the async=TRUE option, which gives me the same error every time:
Error in function (type, msg, asError = TRUE) :
Failed to connect to 0.0.0.12: No route to host
Any ideas? Thanks!
Try mclapply {multicore} instead of lapply.
"mclapply is a parallelized version of lapply, it returns a list of
the same length as X, each element of which is the result of applying
FUN to the corresponding element of X."
(http://www.rforge.net/doc/packages/multicore/mclapply.html)
If that doesn't work, you may get better performance using the XML package. Functions like xmlTreeParse use asynchronous calling.
"Note that xmlTreeParse does allow a hybrid style of processing that
allows us to apply handlers to nodes in the tree as they are being
converted to R objects. This is a style of event-driven or
asynchronous calling."
(http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse)
Why use R? For big scraping jobs you are better off using something already developed for the task. I've had good results with Down Them All, a browser add on. Just tell it where to start, how deep to go, what patterns to follow, and where to dump the HTML.
Then use R to read the data from the HTML files.
Advantages are massive - these add-ons are developed especially for the task so they will do multiple downloads (controllable by you), they will send the right headers so your next question won't be 'how do I set the user agent string with RCurl?', and they can cope with retrying when some of the downloads fail, which they inevitably do.
Of course the disadvantage is that you can't easily start this process automatically, in which case maybe you'd be better off with 'curl' on the command line, or some other command-line mirroring utility.
Honestly, you've got better things to do with your time than write website code in R...

Resources