I have a .json file (more than 100,000 lines) containing the following information:
POST /log?lat=36.804121354&lon=-1.270256482&time=2016-05-18T17:39:59.004Z
{ 'content-type': 'application/x-www-form-urlencoded',
'content-length': '29',
host: 'ip_address:port',
connection: 'Keep-Alive',
'accept-encoding': 'gzip',
'user-agent': 'okhttp/3.7.0' }
BODY: lat=36.804121354&lon=-1.270256482
POST /log?lat=36.804123256&lon=-1.270254711&time=2016-05-18T17:40:13.004Z
{ 'content-type': 'application/x-www-form-urlencoded',
'content-length': '29',
host: 'ip_address:port',
connection: 'Keep-Alive',
'accept-encoding': 'gzip',
'user-agent': 'okhttp/3.7.0' }
BODY: lat=36.804123256&lon=-1.270254711
POST /log?lat=36.804124589&lon=-1.270255641&time=2016-05-18T17:41:05.004Z
{ 'content-type': 'application/x-www-form-urlencoded',
'content-length': '29',
host: 'ip_address:port',
connection: 'Keep-Alive',
'accept-encoding': 'gzip',
'user-agent': 'okhttp/3.7.0' }
BODY: lat=36.804124589&lon=-1.270255641
.......
The above information repeats with updated latitude, longitude and time. Using R, how can I extract latitude, longitude and time from this file? and store them in a dataframe like this:
id lat lon time
1 36.804121354 -1.270256482 2016-05-18 17:39:59
2 36.804123256 -1.270254711 2016-05-18 17:40:13
3 36.804124589 -1.270255641 2016-05-18 17:41:05
It doesn't appear your data is strictly JSON. Since the requested data is all contained on the "Post" lines, an one solution is to filter those lines out and then parse them.
#Read lines
x<-readLines("test.txt")
#Find lines beginning with "POST"
posts<-x[grep("^POST", x)]
#Remove the prefix: "POST /log?"
posts<-sub("^POST /log\\?", "", posts)
#split remaining fields on the &
fields<-unlist(strsplit(posts, "\\&"))
#remove the prefixes ("lat=", "lon=", "time=")
fields<-sub("^.*=", "", fields)
#make a dataframe (assume the fields are always in the same order)
df<-as.data.frame(matrix(fields, ncol=3, byrow=TRUE), stringsAsFactors = FALSE)
names(df)<-c("lat", "lon", "time")
#convert the columns to the proper type.
df$lat<-as.numeric(df$lat)
df$lon<-as.numeric(df$lon)
df$time<-as.POSIXct(df$time, "%FT%T", tz="UTC")
Related
I've been working on this project that scrapes data from the nba.com stats website using R. A couple of months ago, I was able to use it easily, but now the url does not seem to work and I can't figure out why. Looking at the website, it doesn't seem like the url changed at all, but I can't access it via my browser.
library(rjson)
url <- "https://stats.nba.com/stats/scoreboardV2?DayOffset=0&LeagueID=00&gameDate=02%2F07%2F2020"
data_json <- fromJSON(file = url)
Is anyone else experiencing this problem?
It was a header related issue. The following fixed it:
url <- "https://stats.nba.com/stats/scoreboardV2?DayOffset=0&LeagueID=00&gameDate=02%2F07%2F2020"
headers = c(
`Connection` = 'keep-alive',
`Accept` = 'application/json, text/plain, */*',
`x-nba-stats-token` = 'true',
`User-Agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
`x-nba-stats-origin` = 'stats',
`Sec-Fetch-Site` = 'same-origin',
`Sec-Fetch-Mode` = 'cors',
`Referer` = 'http://stats.nba.com/%referer%/',
`Accept-Encoding` = 'gzip, deflate, br',
`Accept-Language` = 'en-US,en;q=0.9'
)
res <- GET(url, add_headers(.headers = headers))
data_json <- res$content %>%
rawToChar() %>%
fromJSON()
httr 1.4.1
R version 3.6.1 (also tried with 3.5.3)
Edit (adding verbose()) output.
I've got a request as follows:
r <- GET("https://my.cool.domain",add_headers(.headers = c('x-api-key' = 'abcdefg', 'Accept' = "text/csv")), verbose())
On my machine it responds with:
-> GET / HTTP/1.1
-> Host: https://my.cool.domain
-> User-Agent: libcurl/7.54.0 r-curl/4.2 httr/1.4.1
-> Accept-Encoding: deflate, gzip
-> x-api-key: abcdefg
-> Accept: text/csv
->
<- HTTP/1.1 200 OK
<- Date: Tue, 26 Nov 2019 17:50:15 GMT
<- Content-Type: text/csv
<- Content-Length: 24902
<- Connection: keep-alive
<- x-amzn-RequestId: ...
<- Content-Encoding: deflate
<- x-amz-apigw-id: ...
<- X-Amzn-Trace-Id: ...
Response [https://my.cool.domain]
Date: 2019-11-26 17:20
Status: 200
Content-Type: text/csv
Size: 209 kB
cats,dogs...
yes,no...
yes,yes...
no,no...
However on my colleague's machine (same version of httr and R, and also with an updated version of R) I get the following:
-> GET / HTTP/2
-> Host: https://my.cool.domain
-> User-Agent: libcurl/7.64.1 r-curl/4.2 httr/1.4.1
-> Accept-Encoding: deflate, gzip
-> x-api-key: abcdefg
-> Accept: text/csv
->
<- HTTP/2 200
<- date: Tue, 26 Nov 2019 17:46:17 GMT
<- content-type: application/json
<- content-length: 21501
<- x-amzn-requestid: ...
<- content-encoding: deflate
<- x-amz-apigw-id: ...
<- x-amzn-trace-id: ...
Response [https://my.cool.domain]
Date: 2019-11-26 17:30
Status: 200
Content-Type: application/json
Size: 377 kB
I'm working with the developer of the https://my.cool.domain domain and I can confirm that the request header params (x-api-key and 'Accept' = "text/csv") are perfect. And the request works on my machine, and several others, but not this one colleague's.
What's going wrong here and how can I debug this?
Thanks
This was fixed by doing httr::set_config(httr::config(http_version = 1.1)) to force 1.1.
I'm aware of the existence of libraries that return HTTP responses as structs:
> HTTPoison.get!("http://httpbin.org/get")
> %HTTPoison.Response{body: "{\n \"args\": {}, \n \"headers\": {\n \"Host\": \"httpbin.org\", \n \"User-Agent\": \"hackney/1.6.6\"\n }, \n \"origin\": \"86.30.176.31\", \n \"url\": \"http://httpbin.org/get\"\n}\n", headers: [{"Server", "nginx"}, {"Date", "Sun, 12 Mar 2017 06:05:29 GMT"},{"Content-Type", "application/json"}, {"Content-Length", "165"}, {"Connection", "keep-alive"}, {"Access-Control-Allow-Origin", "*"}, {"Access-Control-Allow-Credentials", "true"}], status_code: 200}
However how do I get the raw binary form of a ipv4 HTTP response packet in elixir?
As per Dogbert's suggestion, I tried using gen_tcp, but got the following:
iex(1)> {:ok, port} = :gen_tcp.connect('httpbin.org',80,[:binary, active:
false, packet: :http])
{:ok, #Port<0.6531>}
iex(2)> :gen_tcp.send(port, "GET /get HTTP/1.1\r\nHost: httpbin.org\r\n")
:ok
iex(3)> :gen_tcp.recv(port,0)
{:error, :closed}
What am I doing wrong here?
Getting rid of the packet: :http option in gen_tcp.connect and adding another \r\n at the end of the HTTP text solved it:
iex(3)> {:ok, packet_binary} = :gen_tcp.recv(port,0)
{:ok, {:http_response, {1, 1}, 200, 'OK'}}
iex(4)> {:ok, port} = :gen_tcp.connect('httpbin.org',80,[:binary, active: false])
{:ok, #Port<0.6706>}
iex(5)> :gen_tcp.send(port, "GET /get HTTP/1.1\r\nHost: httpbin.org\r\n\r\n")
:ok
iex(6)> {:ok, packet_binary} = :gen_tcp.recv(port,0)
{:ok,
"HTTP/1.1 200 OK\r\nServer: nginx\r\nDate: Sun, 12 Mar 2017 14:01:05 GMT\r\nContent-Type: application/json\r\nContent-Length: 129\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\n\r\n{\n \"args\": {}, \n \"headers\": {\n \"Host\": \"httpbin.org\"\n }, \n \"origin\": \"86.30.176.31\", \n \"url\": \"http://httpbin.org/get\"\n}\n"}
iex(7)> IO.puts(packet_binary)
HTTP/1.1 200 OK
Server: nginx
Date: Sun, 12 Mar 2017 14:01:05 GMT
Content-Type: application/json
Content-Length: 129
Connection: keep-alive
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
{
"args": {},
"headers": {
"Host": "httpbin.org"
},
"origin": "86.30.176.31",
"url": "http://httpbin.org/get"
}
:ok
iex(8)> is_binary(packet_binary)
true
How do I convert this command:
curl -v -u abcdefghij1234567890:X -H "Content-Type: application/json" -X GET 'https://domain.freshdesk.com/api/v2/tickets'
to
curl command in Rcurl?
The dev version of curlconverter (devtools::install_github("hrbrmstr/curlconverter") can convert curl command-line strings with authentication and verbose params now:
Copy your URL to the clipboard:
curl -v -u abcdefghij1234567890:X -H "Content-Type: application/json" -X GET 'https://domain.freshdesk.com/api/v2/tickets'
Then run:
library(curlconverter)
req <- make_req(straighten())[[1]]
The following will now be in your clipboard:
httr::VERB(verb = "GET", url = "https://domain.freshdesk.com/api/v2/tickets",
httr::authenticate(user = "abcdefghij1234567890",
password = "X"), httr::verbose(),
httr::add_headers(), encode = "json")
but req is now also a callable function. You can see that by doing:
req
## function ()
## httr::VERB(verb = "GET", url = "https://domain.freshdesk.com/api/v2/tickets",
## httr::authenticate(user = "abcdefghij1234567890", password = "X"),
## httr::verbose(), httr::add_headers(), encode = "json")
or by actually calling it:
req()
I usually reformat the function source to make it more readable:
httr::VERB(verb = "GET",
url = "https://domain.freshdesk.com/api/v2/tickets",
httr::authenticate(user = "abcdefghij1234567890", password = "X"),
httr::verbose(),
httr::add_headers(),
encode = "json")
and you can easily translate that to a plain GET call without namespacing:
GET(url = "https://domain.freshdesk.com/api/v2/tickets",
authenticate(user = "abcdefghij1234567890", password = "X"),
verbose(),
add_headers(),
encode = "json"))
We can validate it working with authenticated curl command-lines by a small substitution in your example:
curl_string <- 'curl -v -u abcdefghij1234567890:X -H "Content-Type: application/json" -X GET "https://httpbin.org/basic-auth/abcdefghij1234567890/X"'
make_req(straighten(curl_string))[[1]]()
## -> GET /basic-auth/abcdefghij1234567890/X HTTP/1.1
## -> Host: httpbin.org
## -> Authorization: Basic YWJjZGVmZ2hpajEyMzQ1Njc4OTA6WA==
## -> User-Agent: libcurl/7.43.0 r-curl/1.2 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## ->
## <- HTTP/1.1 200 OK
## <- Server: nginx
## <- Date: Tue, 30 Aug 2016 14:13:12 GMT
## <- Content-Type: application/json
## <- Content-Length: 63
## <- Connection: keep-alive
## <- Access-Control-Allow-Origin: *
## <- Access-Control-Allow-Credentials: true
## <-
## Response [https://httpbin.org/basic-auth/abcdefghij1234567890/X]
## Date: 2016-08-30 14:13
## Status: 200
## Content-Type: application/json
## Size: 63 B
## {
## "authenticated": true,
## "user": "abcdefghij1234567890"
## }
You can use httr to do this as follows:
require(httr)
GET('https://domain.freshdesk.com/api/v2/tickets',
verbose(),
authenticate("user", "passwd"),
content_type("application/json"))
I'm following the official manual of opencpu package in R. In chapter 4.3 Calling a function It uses curl to test API:
curl http://your.server.com/ocpu/library/stats/R/rnorm -d "n=10&mean=100"
and the sample output is:
/ocpu/tmp/x032a8fee/R/.val
/ocpu/tmp/x032a8fee/stdout
/ocpu/tmp/x032a8fee/source
/ocpu/tmp/x032a8fee/console
/ocpu/tmp/x032a8fee/info
I can use curl to get similar result, but when I try to send this http request using httr package in R, I don't know how to replicate the result. Here is what I tried:
resp <- POST(
url = "localhost/ocpu/library/stats/R/rnorm",
body= "n=10&mean=100"
)
resp
the output is:
Response [HTTP://localhost/ocpu/library/stats/R/rnorm]
Date: 2015-10-16 00:51
Status: 400
Content-Type: text/plain; charset=utf-8
Size: 30 B
No Content-Type header found.
I guess I don't understand what's the equivalence of curl -d parameter in httr, how can I get it correct?
Try this :)
library(httr)
library(jsonlite)
getFunctionEndPoint <- function(url, format) {
return(paste(url, format, sep = '/'))
}
resp <- POST(
url = getFunctionEndPoint(
url = "https://public.opencpu.org/ocpu/library/stats/R/rnorm",
format = "json"),
body = list(n = 10, mean = 100),
encode = 'json')
fromJSON(rawToChar(resp$content))