Get non html content from rvest_session object - r

I am trying to get a text file from a URL. From the browser, its fairly simple. I just have to "save as" the from the URL and I get the file I want. At first, i had some trouble logging in using rvest (see [https://stackoverflow.com/questions/66352322/how-to-get-txt-file-from-password-protected-website-jsp-in-r][1])(I uploaded a couple of probably useful picture there). When I use the following code:
fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"
session(fileurl)
I get the following (note how I am redirected to a different URL, as happens in the browser when you try to get to the fileurl without first logging in):
<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
Status: 200
Type: text/html; charset=ISO-8859-1
Size: 84
I managed to log in using the following code:
#Define URLs
loginurl <- "http://www1.bolsadecaracas.com/esp/usuarios/customize.jsp"
fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"
#Create session
pgsession <- session(loginurl)
pgform<-html_form(pgsession)[[1]] #Get form
#Create a fake submit button as form does not have one
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
pgform[["fields"]][["submit"]] <- fake_submit_button
#Create and submit filled form
filled_form<-html_form_set(pgform, login="******", passwd="******")
session_submit(pgsession, filled_form)
#Jump to new url
loggedsession <- session_jump_to(pgsession, url = fileurl)
#Output
loggedsession
It seems to me that the login was succesful, as the session output is the exact same size than the .txt file when I download it and I am no longer redirected. See the output.
<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
Status: 200
Type: text/plain; charset=ISO-8859-1
Size: 32193
However, whenever I try to extract the content of the session with read_html() or the like, i get the following error: "Error: Page doesn't appear to be html.". I dont know if it has anything to do with the "Type: text/plain" of the session.
When I run
loggedsession[["response"]][["content"]]
I get
[1] 0d 0a 0d 0a 0d 0a 0d 0a 0d 0a 7c 30 32 2f 30 31 2f 32 30 31 39 7c 52 7c 31 34 2c 39 30 7c 31 35 2c
[34] 30 30 7c 31 37 2c 38 33 7c 31 33 2c 35 30 7c 39 7c 31 33 2e 35 33 33 7c 32 30 33 2e 30 36 30 2c 31
[67] 39 7c 0a 7c 30 33 2f 30 31 2f 32 30 31 39 7c 52 7c 31 35 2c 30 30 7c 31 37 2c 39 38 7c 31 37 2c 39
Any help on how to extract the text file??? Would be greatly appreciated.
PS:
At one point, just playing with functions I managed to get something that would have worked with httr::: GET(fileurl). That was after playing with rvest functions and managing to log in. However, after closing and opening RStudio I was not able to get the same output with that function.

Because rvest uses httr package internally, you can use the httr and base to save your file. The key to the solution is that your response (in terms of the httr package) is in the session object:
library(rvest)
library(httr)
httr::content(loggedsession$response, as = "text") %>%
cat(file = "your_file.txt")
More importantly, if your file were binary (e.g. a zip archive), you would have to do:
library(rvest)
library(httr)
httr::content(loggedsession$response, as = "raw") %>%
writeBin(con = 'your_file.zip')

Related

RIS JSON API content compressed

I am trying to use the RIS API from the german railway. I managed to use the one on timetables that seems to be json.
https://developers.deutschebahn.com/db-api-marketplace/apis/product/fahrplan/api/9213#/Fahrplan_101/operation/%2Flocation%2F{name}/get
GET https://apis.deutschebahn.com/db-api-marketplace/apis/fahrplan/v1/location/Berlin
Header:
Accept: application/json
DB-Client-Id: 36fd91806420e2e937a599478a557e06
DB-Api-Key: d8a52f0f66184d80bdbd3a4a30c0cc33
Which using httr I can replicate:
url_location<-"https://apis.deutschebahn.com/db-api-marketplace/apis/fahrplan/v1/location/Bonn"
r<-GET(url_location,
add_headers(Accept="application/json",
`DB-Client-Id` = client_id,
`DB-Api-Key` = api_key))
content(r)
I know want to use another API about stations.
https://developers.deutschebahn.com/db-api-marketplace/apis/product/ris-stations/api/ris-stations#/RISStations_160/operation/%2Fstations/get
GET https://apis.deutschebahn.com/db-api-marketplace/apis/ris-stations/v1/stations?onlyActive=true&limit=100
Header:
Accept: application/vnd.de.db.ris+json
DB-Client-Id: 36fd91806420e2e937a599478a557e06
DB-Api-Key: d8a52f0f66184d80bdbd3a4a30c0cc33
I was hoping that it would work as well just adjusting the Accept:
url_station<-"https://apis.deutschebahn.com/db-api-marketplace/apis/ris-stations/v1/stations?onlyActive=true&limit=100"
r_stations<-GET(url_station,
add_headers(Accept="application/vnd.de.db.ris+json",
`DB-Client-Id` = client_id,
`DB-Api-Key` = api_key))
I recieve some data and the status code is 200. It was 415 before adjusting the Accept
When I am looking at the content using the content function or without it I get the following
> head(content(r_stations), 30)
[1] 7b 22 6f 66 66 73 65 74 22 3a 30 2c 22 6c 69 6d 69 74 22 3a 31 30 30 2c 22 74 6f 74 61 6c
r_stations$status_code
[1] 200
I should get something more like this:
{
"offset": 0,
"limit": 100,
"total": 5691,
"stations": [
{
"stationID": "1",
"names": {
"DE": {
"name": "Aachen Hbf"
}
},
I just need to add type='application/json'
content(r_stations, type='application/json')

Converting Data Stored in AWS back into original format in R

I've been trying to use the AWS S3 storage option via R. I've been using the aws.s3 package to help do that.
Everything seems to work on until I try to recall and use an rds file I had saved on AWS.
By way of example:
library("aws.s3")
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
#Create Dummy Data
testdata <- rep(1:3, 10)
#Save to AWS
s3saveRDS(testdata, object = "testdata.rds", bucket = "mybucket")
#Recall from AWS
newtestdata <- get_object("testdata.rds", bucket = "mybucket")
newtestdata comes back in a raw format but I can't find how to convert it into its original format. I've tried things such as rawToChar() but I get errors.
For info this is what the newtestdata file looks like in its raw form:
1f 8b 08 00 00 00 00 00 00 06 8b e0 62 60 60 60 62 60 66 61 64 60 62 06 32 19 78 81 58 0e 88 19 c1 e2 0c 0c cc f4 64 03 00 62 4b 7d f5 8e 00 00 00
What should I do to convert this file back to its original form?
You can try below snippet to read the data as it is as mentioned in [1] and see if it they are matching with identical() .
s3readRDS(object = "mtcars.rds", bucket = "myexamplebucket")
identical(mtcars, mtcars2)

looking to understand meaning of two bytes in HTTP request made with curl --trace

tl;dr "What would the bytes 0x33 0x39 0x0d 0x0a between the end of HTTP headers and the start of HTTP response body refer to?"
I'm using the thoroughly excellent libcurl to make HTTP requests to various 3rd party endpoints. These endpoints are not under my control and are required to implement a specification. To help debug and develop these endpoints I have implemented the text output functionality you might see if you make a curl request from the command line with the -v flag using curl.setopt(pycurl.VERBOSE, 1) and curl.setopt(pycurl.DEBUGFUNCTION, debug_function)
This has been working great but recently I've come across a request which my debug function does not handle in the same way as curl's debug output. I'm sure is due to me not understanding the HTTP spec.
If making a curl request from the command line with --verbose I get the following returned.
# redacted headers
< Via: 1.1 vegur
<
{"code":"InvalidCredentials","message":"Bad credentials"}*
Connection #0 to host redacted left intact
If making the same request with --trace the following is returned
0000: 56 69 61 3a 20 31 2e 31 20 76 65 67 75 72 0d 0a Via: 1.1 vegur..
<= Recv header, 2 bytes (0x2)
0000: 0d 0a ..
<= Recv data, 1 bytes (0x1)
0000: 33 3
<= Recv data, 62 bytes (0x3e)
0000: 39 0d 0a 7b 22 63 6f 64 65 22 3a 22 49 6e 76 61 9..{"code":"Inva
0010: 6c 69 64 43 72 65 64 65 6e 74 69 61 6c 73 22 2c lidCredentials",
0020: 22 6d 65 73 73 61 67 65 22 3a 22 42 61 64 20 63 "message":"Bad c
0030: 72 65 64 65 6e 74 69 61 6c 73 22 7d 0d 0a redentials"}..
<= Recv data, 1 bytes (0x1)
0000: 30 0
<= Recv data, 4 bytes (0x4)
0000: 0d 0a 0d 0a ....
== Info: Connection #0 to host redacted left intact
All HTTP client libs I've tested don't include these parts of the bytes in the response body so I'm guessing these are part of the HTTP spec I don't know about but I can't find a reference to them and I don't know how to handle them.
If it's helpful I think curl is using this https://github.com/curl/curl/blob/master/src/tool_cb_dbg.c for building the output in the first example bit I'm not really a c/c++ programmer and I haven't been able to reverse engineer the logic.
Does anyone know what these bytes are?
0d 0a are ASCII control characters representing carriage return and line feed, respectively. CRLF is used in HTTP to mark the end of a header field (there are some historic exceptions you should not worry about at this point). A double CRLF is supposed to mark the end of the fields section of a message.
The 33 39 you observe there is "39" in ascii. This is the chunk size indicator - treated as a hexdecimal number. The presence of Transfer-Encoding: chunked in the response headers may support this.

Errors while opening dicom files in R

I am trying to open dicom files in R using following code:
library(oro.dicom)
dcmobject <- readDICOMFile(filename)
Some files open properly and I can display them. However, some files give errors of different types:
First error: For some, I get the error:
Error in file(con, "rb") : cannot open the connection
Second error: In others, I get following error with dicom file: http://www.barre.nom.fr/medical/samples/files/OT-MONO2-8-hip.gz :
Error in readDICOMFile(filename) : DICM != DICM
Third error: This file gives following error: http://www.barre.nom.fr/medical/samples/files/CT-MONO2-16-chest.gz
Error in parsePixelData(fraw[(132 + dcm$data.seek + 1):fsize], hdr, endian, :
Number of bytes in PixelData not specified
Fourth error: One dicom file gives following error:
Error in rawToChar(fraw[129:132]) : embedded nul in string: '\0\0\b'
How can I get rid of these errors and display these images in R?
EDIT:
This sample file gives the error 'embed nul in string...':
http://www.barre.nom.fr/medical/samples/files/CT-MONO2-12-lomb-an2.gz
> jj = readDICOMFile( "CT-MONO2-12-lomb-an2.dcm" )
Error in rawToChar(fraw[129:132]) : embedded nul in string: '3\0\020'
There are four different errors highlighted in this ticket:
Error in file(con, "rb") : cannot open the connection
This is not a problem with oro.dicom, it is simply the fact that the file path and/or name has been mis-specified.
Error in readDICOMFile(filename) : DICM != DICM
The file is not a valid DICOM file. That is, section 7.1 in Part 10 of the DICOM Standard (available at http://dicom.nema.org) specifies that there should be (a) the File Preample of length 128 bytes and (b) the four-byte DICOM Prefix "DICM" at the beginning of a DICOM file. The file OT-MONO2-8-hip does not follow this standard. One can investigate this problem further using the debug=TRUE input parameter
> dcm <- readDICOMFile("OT-MONO2-8-hip.dcm", debug=TRUE)
# First 128 bytes of DICOM header =
[1] 08 00 00 00 04 00 00 00 b0 00 00 00 08 00 08 00 2e 00 00 00 4f 52 49 47 49 4e 41 4c 5c 53 45
[32] 43 4f 4e 44 41 52 59 5c 4f 54 48 45 52 5c 41 52 43 5c 44 49 43 4f 4d 5c 56 41 4c 49 44 41 54
[63] 49 4f 4e 20 08 00 16 00 1a 00 00 00 31 2e 32 2e 38 34 30 2e 31 30 30 30 38 2e 35 2e 31 2e 34
[94] 2e 31 2e 31 2e 37 00 08 00 18 00 1a 00 00 00 31 2e 33 2e 34 36 2e 36 37 30 35 38 39 2e 31 37
[125] 2e 31 2e 37
Error in readDICOMFile("OT-MONO2-8-hip.dcm", debug = TRUE) : DICM != DICM
It is apparent that the first 128 bytes contain information. One can now use the parameters skipFirst128=FALSE and DICM=FALSE to start reading information from the beginning of the file
dcm <- readDICOMFile("OT-MONO2-8-hip.dcm", skipFirst128=FALSE, DICM=FALSE)
image(t(dcm$img), col=grey(0:64/64), axes=FALSE, xlab="", ylab="")
3.
Error in parsePixelData(fraw[(132 + dcm$data.seek + 1):fsize], hdr, endian, :
Number of bytes in PixelData not specified
The file CT-MONO2-16-chest.dcm is encoded using JPEG compression. The R package oro.dicom does not support compression.
Error in rawToChar(fraw[129:132]) : embedded nul in string: '\0\0\b'
I have to speculate, since the file is not available for direct interrogation. This problem is related to the check for "DICM" characters as part of the DICOM standard. If it failed, then one can assume the file is not a valid DICM file. I will look into making this error more informative in future versions of oro.dicom.
EDIT: Thank-you for providing a link to the appropriate file. The file is in "ARC-NEMA 2" format. The R package oro.dicom has not been designed to read such a file. I have modified the code to improve the error tracking.

Can I get the byte representation of an R float?

I'm trying to read in a complicated data file that has floating point values. Some C code has been supplied that handles this format (Met Office PP file) and it does a lot of bit twiddling and swapping. And it doesn't work. It gets a lot right, like the size of the data, but the numerical values in the returned matrix are nonsensical, have NaNs and values like 1e38 and -1e38 liberally sprinkled.
However, I have a binary exe ("convsh") that can convert these to netCDF, and the netCDFs look fine - nice swirly maps of wind speed.
What I'm thinking is that the bytes of the PP file are being read in in the wrong order. If I could compare the bytes of the floats returned correctly in the netCDF data with the bytes in the floats returned wrongly from the C code, then I might figure out the correct swappage.
So is there a plain R function to dump the four (or eight?) bytes of a floating point number? Something like:
> as.bytes(pi)
[1] 23 54 163 73 99 00 12 45 # made up values
searches for "bytes" and "float" and "binary" haven't helped.
Its trivial in C, I could probably have written it in the time it took me to write this...
rdyncall might give you what you're looking for:
library(rdyncall)
as.floatraw(pi)
# [1] db 0f 49 40
# attr(,"class")
# [1] "floatraw"
Or maybe writeBin(pi, raw(8))?
Yes, that must exist in the serialization code because R merrily sends stuff across the wire, taking care of endianness too. Did you look at eg Rserve using it, or how digest passes the char representation to chosen hash functions?
After a quick glance at digest.R:
R> serialize(pi, connection=NULL, ascii=TRUE)
[1] 41 0a 32 0a 31 33 34 39 31 34 0a 31 33 31 38 34 30 0a
[19] 31 34 0a 31 0a 33 2e 31 34 31 35 39 32 36 35 33 35 38
[37] 39 37 39 33 0a
and
R> serialize(pi, connection=NULL, ascii=FALSE)
[1] 58 0a 00 00 00 02 00 02 0f 02 00 02 03 00 00 00 00 0e
[19] 00 00 00 01 40 09 21 fb 54 44 2d 18
R>
That might get you going.
Come to think about it, this includes header meta-data.
The package mcga (machine-coded genetic algorithms) includes some functions for bytes-to-double and doubles-to-byte conversions. For handling the bytes of pi, you can use DoubleToBytes like:
> DoubleToBytes(pi)
1 24 45 68 84 251 33 9 64
For converting bytes to double again, BytesToDouble() can be used instead:
> BytesToDouble(c(24,45,68,84,251,33,9,64))
1 3.141593
Links:
CRAN page of mcga

Resources