RIS JSON API content compressed - r

I am trying to use the RIS API from the german railway. I managed to use the one on timetables that seems to be json.
https://developers.deutschebahn.com/db-api-marketplace/apis/product/fahrplan/api/9213#/Fahrplan_101/operation/%2Flocation%2F{name}/get
GET https://apis.deutschebahn.com/db-api-marketplace/apis/fahrplan/v1/location/Berlin
Header:
Accept: application/json
DB-Client-Id: 36fd91806420e2e937a599478a557e06
DB-Api-Key: d8a52f0f66184d80bdbd3a4a30c0cc33
Which using httr I can replicate:
url_location<-"https://apis.deutschebahn.com/db-api-marketplace/apis/fahrplan/v1/location/Bonn"
r<-GET(url_location,
add_headers(Accept="application/json",
`DB-Client-Id` = client_id,
`DB-Api-Key` = api_key))
content(r)
I know want to use another API about stations.
https://developers.deutschebahn.com/db-api-marketplace/apis/product/ris-stations/api/ris-stations#/RISStations_160/operation/%2Fstations/get
GET https://apis.deutschebahn.com/db-api-marketplace/apis/ris-stations/v1/stations?onlyActive=true&limit=100
Header:
Accept: application/vnd.de.db.ris+json
DB-Client-Id: 36fd91806420e2e937a599478a557e06
DB-Api-Key: d8a52f0f66184d80bdbd3a4a30c0cc33
I was hoping that it would work as well just adjusting the Accept:
url_station<-"https://apis.deutschebahn.com/db-api-marketplace/apis/ris-stations/v1/stations?onlyActive=true&limit=100"
r_stations<-GET(url_station,
add_headers(Accept="application/vnd.de.db.ris+json",
`DB-Client-Id` = client_id,
`DB-Api-Key` = api_key))
I recieve some data and the status code is 200. It was 415 before adjusting the Accept
When I am looking at the content using the content function or without it I get the following
> head(content(r_stations), 30)
[1] 7b 22 6f 66 66 73 65 74 22 3a 30 2c 22 6c 69 6d 69 74 22 3a 31 30 30 2c 22 74 6f 74 61 6c
r_stations$status_code
[1] 200
I should get something more like this:
{
"offset": 0,
"limit": 100,
"total": 5691,
"stations": [
{
"stationID": "1",
"names": {
"DE": {
"name": "Aachen Hbf"
}
},

I just need to add type='application/json'
content(r_stations, type='application/json')

Related

R - httr POST request to website investing.com to get a JSON response

I try to scrape information from site investing.com, based on the Isin code of a stock.
When I fill the website top form with the Isin code, an xhr request is sent via a POST request. Here is the JSON content I get :
{"total":{"articles":10,"allResults":16,"quotes":6},"score":{"articles":25.00122},"articles":[...],
"quotes":[
{"pairId":386,"name":"Accor SA","flag":"France","link":"\/equities\/accor","symbol":"ACCP","type":"Action - Paris","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":2,"region":6,"industry":55,"isCrypto":false,"exchange":"Paris","exchangeID":9},
{"pairId":948559,"name":"Accor SA","flag":"UK","link":"\/equities\/accor?cid=948559","symbol":"0H59","type":"Action - Londres","pair_type_raw":"Equities","pair_type":"equities","countryID":4,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Londres","exchangeID":3},
{"pairId":33386,"name":"Accor SA","flag":"France","link":"\/equities\/accor?cid=33386","symbol":"ACp","type":"Action - BATS Europe","pair_type_raw":"Equities","pair_type":"equities","countryID":22,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"BATS Europe","exchangeID":121},
{"pairId":963294,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963294","symbol":"ACCP","type":"Action - Francfort","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":16,"region":6,"industry":129,"isCrypto":false,"exchange":"Francfort","exchangeID":104},
{"pairId":963914,"name":"Accor SA","flag":"Germany","link":"\/equities\/accor?cid=963914","symbol":"ACCP","type":"Action - TradeGate","pair_type_raw":"Equities","pair_type":"equities","countryID":17,"sector":0,"region":6,"industry":0,"isCrypto":false,"exchange":"TradeGate","exchangeID":105},
{"pairId":993697,"name":"Accor SA","flag":"Mexico","link":"\/equities\/accor?cid=993697","symbol":"ACCN","type":"Action - Mexico","pair_type_raw":"Equities","pair_type":"equities","countryID":7,"sector":16,"region":2,"industry":129,"isCrypto":false,"exchange":"Mexico","exchangeID":53}]}
I derived a POST request from the browser's inspection tools, to retrieve the JSON piece of information I need, not the whole page :
library(httr)
codeIsin <- 'FR0000120404'
investing_url <- list(scheme="https",
host="fr.investing.com",
filename="/search/service/searchTopBar")
investing_url <- modify_url(url="",
scheme=investing_url$scheme,
hostname=investing_url$host,
path=investing_url$filename)
investing_query <- paste0("search_text=",codeIsin)
investing_headers <- list("Host" = "fr.investing.com",
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept" = "application/json, text/javascript, */*; q=0.01",
"Accept-Language" = "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
"Accept-Encoding" = "gzip, deflate, br",
"Content-Type" = "application/x-www-form-urlencoded",
"X-Requested-With" = "XMLHttpRequest",
"Content-Length" = "23",
"Origin" = "https://fr.investing.com",
"Connection" = "keep-alive",
"Pragma" = "no-cache",
"Cache-Control" = "no-cache",
"TE" = "Trailers"
)
response <- POST(url = investing_url,
query = investing_query,
header = investing_headers)
I get back a raw content :
typeof(response$content)
[1] "raw"
response$content
[1] 20 3c 21 44 4f 43 54 59 50 45 20 48 54 4d 4c 3e 0a 3c 68 74 6d 6c 20 64 69 72 3d 22 6c 74 72 22 20
[34] 78 6d 6c 6e 73 3d 22 68 74 74 70 3a 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f 31 39 39 39 2f 78 68 74
...
[958] 65 35 2a 64 29 3b 65 2b 3d 27 3b 65 78 70 69 72 65 73 3d 22 27 3b 65 2b 3d 6e 2e 74 6f 47 4d 54 53
[991] 74 72 69 6e 67 28 29 3b 65 2b
[ reached getOption("max.print") -- omitted 688441 entries ]
Once decoded with content(response, "text"), it appears to be the main page of the website.
response$request shows that all headers are not sent, specially "Content-Type" = "application/x-www-form-urlencoded" :
> response$request
<request>
POST https://fr.investing.com/search/service/searchTopBar?search_text=FR0000120404
Output: write_memory
Options:
* useragent: libcurl/7.74.0 r-curl/4.3 httr/1.4.2
* post: TRUE
* postfieldsize: 0
Headers:
* Accept: application/json, text/xml, application/xml, */*
* Content-Type:
Where does it get wrong in my request?
If you are not too tied to the syntax used, you can switch as follows, noting I have added a cookie header to allow for onward redirect within httr:
library(httr)
library(jsonlite)
headers = c(
'user-agent' = 'Safari/537.36',
'x-requested-with' = 'XMLHttpRequest',
'cookie' = 'adBlockerNewUserDomains=on')
data = list(
'search_text' = 'FR0000120404'
)
r <- httr::POST(url = 'https://fr.investing.com/search/service/searchTopBar', httr::add_headers(.headers=headers),
body =data, encode = 'form') |>
content() |>
html_element('p') |>
html_text() |>
jsonlite::parse_json()
r

Get non html content from rvest_session object

I am trying to get a text file from a URL. From the browser, its fairly simple. I just have to "save as" the from the URL and I get the file I want. At first, i had some trouble logging in using rvest (see [https://stackoverflow.com/questions/66352322/how-to-get-txt-file-from-password-protected-website-jsp-in-r][1])(I uploaded a couple of probably useful picture there). When I use the following code:
fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"
session(fileurl)
I get the following (note how I am redirected to a different URL, as happens in the browser when you try to get to the fileurl without first logging in):
<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
Status: 200
Type: text/html; charset=ISO-8859-1
Size: 84
I managed to log in using the following code:
#Define URLs
loginurl <- "http://www1.bolsadecaracas.com/esp/usuarios/customize.jsp"
fileurl <- "http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T"
#Create session
pgsession <- session(loginurl)
pgform<-html_form(pgsession)[[1]] #Get form
#Create a fake submit button as form does not have one
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
pgform[["fields"]][["submit"]] <- fake_submit_button
#Create and submit filled form
filled_form<-html_form_set(pgform, login="******", passwd="******")
session_submit(pgsession, filled_form)
#Jump to new url
loggedsession <- session_jump_to(pgsession, url = fileurl)
#Output
loggedsession
It seems to me that the login was succesful, as the session output is the exact same size than the .txt file when I download it and I am no longer redirected. See the output.
<session> http://www1.bolsadecaracas.com/esp/productos/dinamica/downloadDetail.jsp?symbol=BNC&dateFrom=20190101&dateTo=20210101&typePlazo=nada&typeReg=T
Status: 200
Type: text/plain; charset=ISO-8859-1
Size: 32193
However, whenever I try to extract the content of the session with read_html() or the like, i get the following error: "Error: Page doesn't appear to be html.". I dont know if it has anything to do with the "Type: text/plain" of the session.
When I run
loggedsession[["response"]][["content"]]
I get
[1] 0d 0a 0d 0a 0d 0a 0d 0a 0d 0a 7c 30 32 2f 30 31 2f 32 30 31 39 7c 52 7c 31 34 2c 39 30 7c 31 35 2c
[34] 30 30 7c 31 37 2c 38 33 7c 31 33 2c 35 30 7c 39 7c 31 33 2e 35 33 33 7c 32 30 33 2e 30 36 30 2c 31
[67] 39 7c 0a 7c 30 33 2f 30 31 2f 32 30 31 39 7c 52 7c 31 35 2c 30 30 7c 31 37 2c 39 38 7c 31 37 2c 39
Any help on how to extract the text file??? Would be greatly appreciated.
PS:
At one point, just playing with functions I managed to get something that would have worked with httr::: GET(fileurl). That was after playing with rvest functions and managing to log in. However, after closing and opening RStudio I was not able to get the same output with that function.
Because rvest uses httr package internally, you can use the httr and base to save your file. The key to the solution is that your response (in terms of the httr package) is in the session object:
library(rvest)
library(httr)
httr::content(loggedsession$response, as = "text") %>%
cat(file = "your_file.txt")
More importantly, if your file were binary (e.g. a zip archive), you would have to do:
library(rvest)
library(httr)
httr::content(loggedsession$response, as = "raw") %>%
writeBin(con = 'your_file.zip')

Why is the hex value of a period in a DNS request not 0x2E, and why does it change?

Looking at a DNS request in wireshark for www.google.com and the hex for it is 03 77 77 77 06 67 6f 6f 67 6c 65 03 63 6f 6d 00
Little confused why the first period is 03 (and why it's there), the second is 06, and the last is 03
The DNS protocol layer is defined in RFC 1035. To cite from "3.1. Name space definitions":
Domain names in messages are expressed in terms of a sequence of labels.
Each label is represented as a one octet length field followed by that
number of octets. Since every domain name ends with the null label of
the root, a domain name is terminated by a length byte of zero.
Thus www.google.com is encoded in the DNS packet as:
03 77 77 77 length 3, "www"
06 67 6f 6f 67 6c 65 length 6, "google"
03 63 6f 6d length 3, "com"
00 length 0 (end of label)

looking to understand meaning of two bytes in HTTP request made with curl --trace

tl;dr "What would the bytes 0x33 0x39 0x0d 0x0a between the end of HTTP headers and the start of HTTP response body refer to?"
I'm using the thoroughly excellent libcurl to make HTTP requests to various 3rd party endpoints. These endpoints are not under my control and are required to implement a specification. To help debug and develop these endpoints I have implemented the text output functionality you might see if you make a curl request from the command line with the -v flag using curl.setopt(pycurl.VERBOSE, 1) and curl.setopt(pycurl.DEBUGFUNCTION, debug_function)
This has been working great but recently I've come across a request which my debug function does not handle in the same way as curl's debug output. I'm sure is due to me not understanding the HTTP spec.
If making a curl request from the command line with --verbose I get the following returned.
# redacted headers
< Via: 1.1 vegur
<
{"code":"InvalidCredentials","message":"Bad credentials"}*
Connection #0 to host redacted left intact
If making the same request with --trace the following is returned
0000: 56 69 61 3a 20 31 2e 31 20 76 65 67 75 72 0d 0a Via: 1.1 vegur..
<= Recv header, 2 bytes (0x2)
0000: 0d 0a ..
<= Recv data, 1 bytes (0x1)
0000: 33 3
<= Recv data, 62 bytes (0x3e)
0000: 39 0d 0a 7b 22 63 6f 64 65 22 3a 22 49 6e 76 61 9..{"code":"Inva
0010: 6c 69 64 43 72 65 64 65 6e 74 69 61 6c 73 22 2c lidCredentials",
0020: 22 6d 65 73 73 61 67 65 22 3a 22 42 61 64 20 63 "message":"Bad c
0030: 72 65 64 65 6e 74 69 61 6c 73 22 7d 0d 0a redentials"}..
<= Recv data, 1 bytes (0x1)
0000: 30 0
<= Recv data, 4 bytes (0x4)
0000: 0d 0a 0d 0a ....
== Info: Connection #0 to host redacted left intact
All HTTP client libs I've tested don't include these parts of the bytes in the response body so I'm guessing these are part of the HTTP spec I don't know about but I can't find a reference to them and I don't know how to handle them.
If it's helpful I think curl is using this https://github.com/curl/curl/blob/master/src/tool_cb_dbg.c for building the output in the first example bit I'm not really a c/c++ programmer and I haven't been able to reverse engineer the logic.
Does anyone know what these bytes are?
0d 0a are ASCII control characters representing carriage return and line feed, respectively. CRLF is used in HTTP to mark the end of a header field (there are some historic exceptions you should not worry about at this point). A double CRLF is supposed to mark the end of the fields section of a message.
The 33 39 you observe there is "39" in ascii. This is the chunk size indicator - treated as a hexdecimal number. The presence of Transfer-Encoding: chunked in the response headers may support this.

nginx returning netstring with wrong length?

I installed nginx (nginx version: nginx/1.7.9) via macports on my mac running the latest OSX.
I configured a URI to use SCGI:
location /server {
include /Users/ruipacheco/Projects/Assorted/nginx/conf/scgi_params;
scgi_pass unix:/var/tmp/rpc.sock;
#scgi_pass 127.0.0.1:9000;
}
And when I do a GET request on 127.0.0.1/server, I see the following on my SCGI server:
633:CONTENT_LENGTH0REQUEST_METHODGETREQUEST_URI/serverQUERY_STRINGCONTENT_TYPEDOCUMENT_URI/serverDOCUMENT_ROOT/opt/local/htmlSCGI1SERVER_PROTOCOLHTTP/1.1REMOTE_ADDR127.0.0.1REMOTE_PORT62088SERVER_PORT80SERVER_NAMElocalhostHTTP_HOST127.0.0.1HTTP_CONNECTIONkeep-aliveHTTP_CACHE_CONTROLmax-age=0HTTP_ACCEPTtext/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8HTTP_USER_AGENTMozilla/5.0
(Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/40.0.2214.115
Safari/537.36HTTP_DNT1HTTP_ACCEPT_ENCODINGgzip, deflate,
sdchHTTP_ACCEPT_LANGUAGEen-US,en;q=0.8,End of file
The problem is that the length of the netstring, 633, does not match the interpretation. If I understand the netstrings spec correctly, 633 should be the length of characters between the first : and the last ,:
Any string of 8-bit bytes may be encoded as [len]":"[string]",". Here [string] is the string and [len] is a nonempty sequence of ASCII digits giving the length of [string] in decimal. The ASCII digits are <30> for 0, <31> for 1, and so on up through <39> for 9. Extra zeros at the front of [len] are prohibited: [len] begins with <30> exactly when [string] is empty.
For example, the string hello world! is encoded as 31 32 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c, i.e., 12:hello world!,.
So, I'm getting the wrong length. How can this be explained?
As far as I can tell, your example response has correct length.
According to example here:
http://en.wikipedia.org/wiki/Simple_Common_Gateway_Interface
Field values are preceded and followed by <00> symbol (ASCII symbol with hex code 00), eg.:
REQUEST_METHOD <00>GET<00>
Once I added missing spaces to your response snippet – it quickly got back to 633 bytes, as advertised.
I suppose somewhere in the process of passing that response to us here, some piece of software stripped <00>'s, which is a totally normal behaviour?
Anyway, the answer seems to be – your nginx is either returning a correct length, or your response is stripping <00>'s somewhere.
Well,
The hexadecimal <31 32 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21>
in ASCII is "12:hello world!" (no quotes) and the lenght is 12 (hello world!)
And this one <31 32 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c> in the example is wrong (at least it didnt match the nginx norm.)(since the internal lenght is 13 and the lenght specified in hex is 12):
The ASCII "12:hello world!," should be "13:hello world!," and in hex <31 33 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c>
This line is the mess:
For example, the string "hello world!" is encoded as <31 32 3a 68 65
6c 6c 6f 20 77 6f 72 6c 64 21 2c>, i.e., "12:hello world!,".
OK) 12:hello world! ---> <31 *32* 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21>
KO) 12:hello world!, ---> <31 *32* 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c>
OK) 13:hello world!, ---> <31 *33* 3a 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 2c>
The hex inside the ** is the second number of the lenght.
Then your concept about this Ok, the example is bad.

Resources