JSON character size limit using curlPerform or getURL - r

I am running into what appears to be character size limit in a JSON string when trying retrieve data from either curlPerform() or getURL(). Here is non-reproducible code [1], but it should shed some light on the problem.
# Note that .base.url is the basic url for the API, q is a query, user
# is specified, etc.
session = getCurlHandle()
curl.opts <- list(userpwd = paste(user, ":", key, sep = ""),
httpheader = "Content-Type: application/json")
request <- paste(.base.url, q, sep = "")
txt <- getURL(url = request, curl = session, .opts = curl.opts,
write = basicTextGatherer())
or
r = dynCurlReader()
curlPerform(url = request, writefunction = r$update, curl = session,
.opts = curl.opts)
My guess is that the update or value functions in the basicTextGather or dynCurlReader text handler objects are having trouble with the large strings. In this example, r$value() will return a truncated string that is approximately 2 MB. The code given above will work fine for queries < 2 MB.
Note that I can easily do the following from the command line (or using system() in R), but writing to disc seems like a waste if I am doing the subsequent analysis in R.
curl -v --header "Content-Type: application/json" --user username:register:passwd https://base.url.for.api/getdata/select+*+from+sometable > stream.json
where stream.json is a roughly 14MB json string. I can read the string into R using either
con <- file(paste(.project.path, "data/stream.json", sep = ""), "r")
string <- readLines(con)
or directly to list as
tmp <- fromJSON(file = paste(.project.path, "data/stream.json", sep = ""))
Any thoughts are very much appreciated.
Ryan
[1] - Sorry for not providing reproducible code, but I'm dealing with a govt firewall.

Related

Getting data from ftp file directly into environment or as file r

Trying to get file from ftp server with HTTR and RCurl any method doesn't work.
Real case. User and password credentials are real.
First HTTR
library(httr)
GET(url = "ftp://77.72.135.237/2993309138_Tigres.xls", authenticate("xxxxxxx", "xxxxxx"),
write_disk("~/Downloads/2993309138_Tigres.xls", overwrite = T))
#> Error: $ operator is invalid for atomic vectors
Second RCurl
library(RCurl)
my_data <- getURL(url = "ftp://77.72.135.237/2993309138_Tigres.xls", userpwd = "xxxxxx")
#> Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding): embedded nul in string: 'ÐÏ\021ࡱ\032á'
Is it server side bug or mine? :)
It seems to have something to do with the encoding. Try like this (you'll have to fill in the authentication) :
library(httr)
content(
GET(url = "ftp://77.72.135.237/2993309138_Tigres.xls",
authenticate("...", "..."),
write_disk("~/Downloads/2993309138_Tigres.xls", overwrite = T)
),
type = "text/csv",
as = "text",
encoding = "WINDOWS-1251"
)

Send an R (heterogeneous) list object as binary as a reply to a REST request, using plumber

I would like to send a complex heterogeneous list object in binary form as a reply to a REST request to the R service (using plumber), so that when decoded on the other side it converts back to its original list format.
So far, I have managed to do it using intermediate RDS files by reading through and slightly modifying Base64 encoding a .Rda file.
A MWE explaining my efforts and issues follows:
foodata <- list('a' = 1,
'b' = 'hello world',
'pi' = 3.14,
'strvec' = letters[1:5],
'intvec' = 1:5,
'df' = data.frame('df1' = letters[1:5], 'df2' = 1:5),
'z' = 26)
fn <- "test.rds"
fnb4 <- "test.rdsb64"
decoded <- "decoded.rds"
saveRDS(foodata, file = fn, compress = F)
#write base64 encoded version
library(base64enc)
txt <- base64enc::base64encode(fn)
#decode base64 encoded version
rdsbin <- base64enc::base64decode(txt)
# how to convert rdsbin back to the foodata list without using the intermediate step of saving to a file as follows?
ff <- file(decoded, "wb")
writeBin(rdsbin, ff)
close(ff)
bardata <- readRDS(decoded)
print(identical(foodata, bardata))
# [1] TRUE
Is there any way to avoid the reads/writes of the intermediate files? Or a completely different approach altogether?
You don't need intermediate files, you can use a rawConnection to do these types of operations in memory
#encore to base64 text string
encode_stream <- rawConnection(raw(), "r+")
saveRDS(foodata, file = encode_stream)
seek(encode_stream, 0) #reset to beginning of file
txt <- base64enc::base64encode(encode_stream)
close(encode_stream)
# txt is now a string that contains the data
# encoded in base64
# decode string to R object
rdsbin <- base64enc::base64decode(txt)
decode_stream <- rawConnection(rdsbin, "r")
bardata <- readRDS(decode_stream)
close(decode_stream)
# verify result
print(identical(foodata, bardata))
# [1] TRUE

Batch paste to an api using post in R

Im trying to paste multiple vin numbers to the nthsa API.
My working solution looks like this:
vins <- c('4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466','4JGCB5HE1CA138466',)
for (i in vins){
json <- fromJSON(paste0('https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVinValues/',i,'?format=json'))
print(json)
}
This solution is very slow. I tried pbapply, same thing because it pastes one vin at a time.
There is a batch paste option that i just cant figure out. Can someone please assist.
Here is my code so far:
data <- list(data='4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466')
json <- toJSON(list(data=data), auto_unbox = TRUE)
result <- POST('https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/', body = data)
Output <- content(result)
The vin numbers string has to be in the following format: vin;vin;vin;vin;
here is the link: https://vpic.nhtsa.dot.gov/api/ (the last one)
Thanks in advance.
UPDATE:
I also tried this from some other threads but no luck:
headers = c(
`Content-Type` = 'application/json'
)
data = '[{"data":"4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466"}]'
httr::POST(url = 'https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/', httr::add_headers(.headers=headers), body = data)
print(r$status_code)
I am getting status code 200 but server code 500 with no data.
I am not sure if this is possible. The batch endpoint is specifically looking for a dictionary to be passed (ruling out string representations). httr states:
body: must be NULL, FALSE, character, raw or list
I tried using collections library to generate dict
data <- Dict$new(list(format = 'json', data = "4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466"))
httr unsurprisingly rejected it as wrong body ype.
I tried using jsonlite to convert with:
data <- jsonlite::toJSON(data)
Yielding:
Error: No method asJSON S3 class: R6
I think due to data being an environment.
Trying to read in string dictionary to json returns no data:
library(httr)
library(jsonlite)
headers = c(
'Accept' = '*/*',
'Accept-Encoding' = 'gzip, deflate',
'Content-Type' = 'application/x-www-form-urlencoded',
'User-Agent' = 'Mozilla/5.0'
)
data = jsonlite::toJSON('{"format":"json","data":"4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466"}')
r<- httr::POST(url = 'https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/', httr::add_headers(.headers=headers), body = data. encode='json'
print(content(r))
If we examine converted data
> data
["{\"format\":\"json\",\"data\":\"4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466\"}"]
This is no longer the dictionary structure the server expects.
So, I am new to R but seems like it might be easier to just go with Python which has a dictionary object and also a json library which handles comfortably the conversion
string to json:
import requests,json
url = 'https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/'
data = json.loads('{"format": "json", "data":"4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466"}')
r = requests.post(url, data=data)
print(r.json())
dict
import requests
url = 'https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/'
data = {'format': 'json', 'data':'4JGCB5HE1CA138466;4JGCB5HE1CA138466;4JGCB5HE1CA138466'}
r = requests.post(url, data=data).json()
print(r)

Request to improve code to download sequence of URLs

in a file I have a table of 23,772 URLs that I need to download. In the code below, that is represented by dwsites. Due to the server restrictions, I am only able to download a block of 300 sites at a time. I have accomplished that task with the code below (it is an excerpt of the actual code), but I want to know a better way.
Can you offer any suggestions?
Thank you.
dwsites <- data.frame(sites = c(1:23772), url = rep("url", 23772))
dwsitessub <- dwsites[1:300,] # this is the part that I would like to change
curl = getCurlHandle()
pagesnew = list()
for(u in strpatnew) {pagesnew[[u]] = getURLContent(u, curl = curl)}
lapply(seq_along(strpatternew), function(u) cat(pagesnew[[u]], file = file.path("filepath", paste0(strpatternew[[u]], sep = ""))))
dwsitessub <- dwsites[301:459,]
curl = getCurlHandle()
pagesnew = list()
for(u in strpatnew) {pagesnew[[u]] = getURLContent(u, curl = curl)}
lapply(seq_along(strpatternew), function(u) cat(pagesnew[[u]], file = file.path("filepath", paste0(strpatternew[[u]], sep = ""))))
...
dwsitessub <- 23501:nrow(dwsites)
curl = getCurlHandle()
pagesnew = list()
for(u in strpatnew) {pagesnew[[u]] = getURLContent(u, curl = curl)}
lapply(seq_along(strpatternew), function(u) cat(pagesnew[[u]], file = file.path("filepath", paste0(strpatternew[[u]], sep = ""))))

how to download a large binary file with RCurl *after* server authentication

i originally asked this question about performing this task with the httr package, but i don't think it's possible using httr. so i've re-written my code to use RCurl instead -- but i'm still tripping up on something probably related to the writefunction.. but i really don't understand why.
you should be able to reproduce my work by using the 32-bit version of R, so you hit memory limits if you read anything into RAM. i need a solution that downloads directly to the hard disk.
to start, this code to works -- the zipped file is appropriately saved to the disk.
library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f#ref)
close(f)
# 2.1 GB file successfully written to disk
now here's some RCurl code that does not work. as stated in the previous question, reproducing this exactly will require creating an extract on ipums.
your.email <- "email#address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
and now that i'm logged in, try the same commands as above, but with the curl object to keep the cookies.
filename <- tempfile()
f <- CFILE(filename, mode = "wb")
this line breaks--
curlPerform(url = extract.path, writedata = f#ref, curl = curl)
close(f)
# the error is:
Error in curlPerform(url = extract.path, writedata = f#ref, curl = curl) :
embedded nul in string: [[binary jibberish here]]
the answer to my previous post referred me to this c-level writefunction answer, but i'm clueless about how to re-create that curl_writer C program (on windows?)..
dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)
..or why it's even necessary, given that the five lines of code at the top of this question work without anything crazy like getNativeSymbolInfo. i just don't understand why passing in that extra curl object that stores the authentication/cookies and tells it not to verify SSL would cause code that otherwise works.. to break?
From this link create a file named curl_writer.c and save it to C:\<folder where you save your R files>
#include <stdio.h>
/**
* Original code just sent some message to stderr
*/
size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) {
fwrite(buffer,size,nmemb,(FILE *)stream);
return size * nmemb;
}
Open a command window, go to the folder where you saved curl_writer.c and run the R compiler
c:> cd "C:\<folder where you save your R files>"
c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
Open R and run your script
C:> R
your.email <- "email#address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
# Load the DLL you created
# "writer" is the name of the function
# "curl_writer" is the name of the dll
dyn.load("curl_writer.dll")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
# Note that "URL" parameter is upper case, in your code it is lowercase
# I'm not sure if that has something to do
# "writer" is the symbol defined above
f <- CFILE(filename <- tempfile(), "wb")
curlPerform(URL=url, writedata=f#ref, writefunction=writer, curl=curl)
close(f)
this is now possible with the httr package. thanks hadley!
https://github.com/hadley/httr/issues/44

Resources