I am getting time-out with GET function from httr package in R with this settings:
GET("https://isir.justice.cz/isir/common/index.do", add_headers(.headers = c('"authority"="isir.justice.cz",
"scheme"="https",
"path"="/isir/common/index.do",
"cache-control"="max-age=0",
"sec-ch-ua-mobile"="?0",
"sec-ch-ua-platform"= "Windows",
"upgrade-insecure-requests"="1",
"accept"="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-site"="none",
"sec-fetch-mode"="navigate",
"sec-fetch-user"="?1",
"sec-fetch-dest"="document",
"accept-encoding"="gzip, deflate, br",
"accept-language"="cs-CZ,cs;q=0.9"'
)))
But the seemingly same query via powershell returns a webpage.
Invoke-WebRequest -UseBasicParsing -Uri "https://isir.justice.cz/isir/common/index.do" `
-WebSession $session `
-Headers #{
"method"="GET"
"authority"="isir.justice.cz"
"scheme"="https"
"path"="/isir/common/index.do"
"cache-control"="max-age=0"
"sec-ch-ua-mobile"="?0"
"sec-ch-ua-platform"="`"Windows`""
"upgrade-insecure-requests"="1"
"accept"="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
"sec-fetch-site"="none"
"sec-fetch-mode"="navigate"
"sec-fetch-user"="?1"
"sec-fetch-dest"="document"
"accept-encoding"="gzip, deflate, br"
"accept-language"="cs-CZ,cs;q=0.9"
}
Do I have a problem with my R code or is it simple a matter of difference between using R vs powershell?
Your code didn't run for me as it had an extra ' somewhere. Correcting this, it ran fine. If you keep getting timeout messages, you can increase the maximum request time using timeout():
library(httr)
x <- GET("https://isir.justice.cz/isir/common/index.do", timeout(10), add_headers(
.headers = c("authority" = "isir.justice.cz",
"scheme" = "https",
"path" = "/isir/common/index.do",
"cache-control" = "max-age=0",
"sec-ch-ua-mobile" = "?0",
"sec-ch-ua-platform" = "Windows",
"upgrade-insecure-requests" = "1",
"accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-site" = "none",
"sec-fetch-mode" = "navigate",
"sec-fetch-user" = "?1",
"sec-fetch-dest" = "document",
"accept-encoding" = "gzip, deflate, br",
"accept-language" = "cs-CZ,cs;q=0.9")
))
As a sidenote: there is a successor package by the same people called httr2. I'm also still using httr but it's probably a good idea to learn the new package. Here is how that would look like:
library(httr2)
req <- request("https://isir.justice.cz/isir/common/index.do") %>%
req_headers("authority" = "isir.justice.cz",
"scheme" = "https",
"path" = "/isir/common/index.do",
"cache-control" = "max-age=0",
"sec-ch-ua-mobile" = "?0",
"sec-ch-ua-platform" = "Windows",
"upgrade-insecure-requests" = "1",
"accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-site" = "none",
"sec-fetch-mode" = "navigate",
"sec-fetch-user" = "?1",
"sec-fetch-dest" = "document",
"accept-encoding" = "gzip, deflate, br",
"accept-language" = "cs-CZ,cs;q=0.9") %>%
req_timeout(seconds = 10)
# check your request in a dry run
req %>%
req_dry_run()
#> GET /isir/common/index.do HTTP/1.1
#> Host: isir.justice.cz
#> User-Agent: httr2/0.1.1 r-curl/4.3.2 libcurl/7.80.0
#> authority: isir.justice.cz
#> scheme: https
#> path: /isir/common/index.do
#> cache-control: max-age=0
#> sec-ch-ua-mobile: ?0
#> sec-ch-ua-platform: Windows
#> upgrade-insecure-requests: 1
#> accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
#> sec-fetch-site: none
#> sec-fetch-mode: navigate
#> sec-fetch-user: ?1
#> sec-fetch-dest: document
#> accept-encoding: gzip, deflate, br
#> accept-language: cs-CZ,cs;q=0.9
resp <- req_perform(req)
resp
#> <httr2_response>
#> GET https://isir.justice.cz/isir/common/index.do
#> Status: 200 OK
#> Content-Type: text/html
#> Body: In memory (116916 bytes)
Created on 2022-01-03 by the reprex package (v2.0.1)
Related
I am trying to take follow Postman Get request to Microsoft Graph API and convert it into Karate test
https://graph.microsoft.com/v1.0/users/moo#moo.com/messages?$search="body:'979f13ea-5c87-45e3-98e2-7243d321b238'"
The issue I am having is how to handle the query parameters with the single quote inside the double quotes.
Try this:
* url 'https://httpbin.org/anything'
* param $search = `"body:'979f13ea-5c87-45e3-98e2-7243d321b238'"`
* method get
Actual request:
1 > GET https://httpbin.org/anything?%24search=%22body%3A%27979f13ea-5c87-45e3-98e2-7243d321b238%27%22
1 > Host: httpbin.org
1 > Connection: Keep-Alive
1 > User-Agent: Apache-HttpClient/4.5.14 (Java/17.0.5)
1 > Accept-Encoding: gzip,deflate
But, you can see from the server response that the data was encoded correctly:
1 < 200
1 < Date: Mon, 09 Jan 2023 18:52:15 GMT
1 < Content-Type: application/json
1 < Content-Length: 516
1 < Connection: keep-alive
1 < Server: gunicorn/19.9.0
1 < Access-Control-Allow-Origin: *
1 < Access-Control-Allow-Credentials: true
{
"args": {
"$search": "\"body:'979f13ea-5c87-45e3-98e2-7243d321b238'\""
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept-Encoding": "gzip,deflate",
"Host": "httpbin.org",
"User-Agent": "Apache-HttpClient/4.5.14 (Java/17.0.5)",
"X-Amzn-Trace-Id": "Root=1-63bc625f-36a4b2e92b1976b303454a8a"
},
"json": null,
"method": "GET",
"origin": "49.205.149.94",
"url": "https://httpbin.org/anything?%24search=\"body%3A'979f13ea-5c87-45e3-98e2-7243d321b238'\""
}
Using back-ticks gives you a nice option to dynamically change data:
* def id = '979f13ea-5c87-45e3-98e2-7243d321b238'
* param $search = `"body:'${id}'"`
Escaping the single-quote would also work:
* param $search = '"body:\'979f13ea-5c87-45e3-98e2-7243d321b238\'"'
Also see: https://stackoverflow.com/a/59977660/143475
Good evening everybody,
I am in a fight with PowerQuery. I am trying to recieve an answer from a website in PowerBI (Power Query). For obvious reasons I renamed passwords and such. When I run the code below, PowerQuery answers that:
Expression.Error: The content-length heading must be changed using the
appropriate property or method. Parameter name: name Details: 103
It does not matter if I use another number. I tried also to remove the header entirly but that results in:
DataSource.Error: Web.Contents could not retrieve the content from
'https://xxxx.nl/api/token/' (500): Internal Server Error Details:
DataSourceKind=Web DataSourcePath=https://xxxx.nl/api/token
Url=https://xxxx.nl/api/token/
I do miss something but I cannot find out what that is. Could you find out what it is? Thanks in advance!
let
url = "https://xxxxx.nl/api/token/",
body = Text.ToBinary("{""""username"""":""""xxxx"""",""""password"""":""""xxxx"""",""""group"""":""""xxxx"""",""""deleteOtherSessions"""":false"),
Data = Web.Contents(
url,
[
Headers = [
#"authority" = "xxx.nl",
#"method" = "POST",
#"path" = "/api/Token",
#"scheme" = "https",
#"accept" = "application/json",
#"accept-encoding" = "gzip, deflate",
#"transfer-encoding" = "deflate",
#"accept-language" = "nl-NL,nl;q=0.7",
#"cache-control" = "no-cache",
#"content-length" = "103",
#"content-type" = "application/json",
#"expires" = "Sat, 01 Jan 2000 00:00:00 GMT",
#"origin" = "https://xxxx.nl",
#"pragma" = "no-cache",
#"referer" = "https://xxxxx.nl/",
#"sec-fetch-dest" = "empty",
#"sec-fetch-mode" = "cors",
#"sec-fetch-site" = "same-origin",
#"sec-gpc" = "1",
#"user-agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
],
Content = body
]
)
in
Data
I'm attempting to scrape a JSON object. Referencing Extract data from an unreachable JsonObject(), I'm able to get the status code 200 but I'm not getting any results. I assume there is something wrong with how I'm constructing the query.
Code I have so far:
library(httr) # web scraping
library(jsonlite) #parsing json data
library(rvest) # web scraping
library(polite) # check robot.txt files
library(tidyverse) # data wrangling
library(curlconverter) # decode curl commands
library(urltools) # URL encoding
r <-
POST(
url = "https://vfm4x0n23a-dsn.algolia.net/1/indexes/*/queries" ,
add_headers(
#.headers=c(
#'Accept' = "*/*",
#'Accept-Encoding' = 'gzip, deflate, br',
#'Accept-Language' = "en-US,en;q=0.9",
#'Cache-Control' = "no-cache",
#'Connection' = "keep-alive",
#'Content-Length' = '450',
#'Content-Type' = 'application/x-www-form-urlencoded',
#'Host' = 'vfm4x0n23a-dsn.algolia.net',
#'Origin' = "https://www.iheartjane.com",
#'Pragma' = "no-cache",
#'Referer' = "https://www.iheartjane.com/",
#'sec-ch-ua' = '".Not/A)Brand"";v=99""Google Chrome";v="103",Chromium";v="103"',
#'sec-ch-ua-mobile' = "?0",
#'sec-ch-ua-platform' = "Windows",
#'Sec-Fetch-Dest' = "empty",
#'Sec-Fetch-Mode' = "cors",
#'Sec-Fetch-Site' = "cross-site",
'User-Agent' = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"#,
#'x-algolia-api-key' = "b499e29eb7542dc373ec0254e007205d",
#'x-algolia-application-id' = "VFM4X0N23A"
#)
),
config = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.13.0); Browser; JS Helper (3.7.4); react (16.14.0); react-instantsearch (6.23.1)',
'x-algolia-application-id' = 'VFM4X0N23A',
'x-algolia-api-key' = 'b499e29eb7542dc373ec0254e007205d'
),
body = FALSE,
encode = 'form',
query = '{"requests":[{"indexName":"menu-products-production","params":"query=highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&page=0&hitsPerPage=48&filters=store_id%20%3D%201641%20AND%20kind%3A%22sale%22%20OR%20root_types%3A%22sale%22&optionalFilters=brand%3AVerano%2Cbrand%3AAvexia%2Cbrand%3AEncore%20Edibles%2Croot_types%3AFeatured%2Croot_types%3ASale&userToken=Zu0iU4Uo2whpmqNBjUGOJ&facets=%5B%5D&tagFilters="}]}',
verbose() )
json <- content(r, type = "application/json")
The payload & query parameters:
Payload
The website
How can I restructure my code to send the query correctly?
I am trying to scrape a table of recent events from the following link: https://www.tapology.com/fightcenter
When visiting the link, the table shows upcoming events, so you have to click under schedule and change the option to "result".
I have scraped what appears to be the raw data below in the variable resp, but I don't know what language that code is written in and don't know how to parse it.
library(httr)
url <- paste0("https://www.tapology.com/fightcenter_events")
fd <- list(
group = "all",
region = "",
schedule = "results",
sport = "all"
)
postdata <- POST(url = url, query = fd, encode = "form",
add_headers(
"Accept" = "text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01",
"Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" = "_ga=GA1.2.1873043703.1537368153; __utmz=88071069.1563301531.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); remember_id=149246; remember_token=315e68b7a95fa6cda391fc3e2ae0e1fb1466335ed9a15480558bd4ef8d52d832; __utmc=88071069; __utma=88071069.1873043703.1537368153.1563983348.1563985208.3; _tapology_mma_session=Z2RWaU1XZ0hOQmIwcUhjN1Bac0twN0JZQktnVUlLUjVsVkdMMDR4bTBITGdnSDFlRW9WeHprQ2lRaWdJM0lRbW5PNTFYSG9kbVlaMWFlR3liZmEyZWhnRWVVNm03UVIwRUJLWHl1MmJXRlQ1dEFJTGJsTnVLQWx4MWpUMTJOYlBxQ1N1Y0pQREZlZTNzMDA0NTJINEpLS2FMNXZvaXZjQ3g2dFMzM1dJeTRmekc4TG5JTk9YZDlZdWx5WnpZd3luZlY1ZXliQ0RWS1B1aXJYQnpqVVp4UT09LS10am5XNVI0c0pXa2p1dHJ5OW9PME5nPT0%3D--7488fef85f733279f15da594ea47f0345aa16938",
"Host" = "www.tapology.com",
"Origin" = "https://www.tapology.com",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"Referer" = "https://www.tapology.com/fightcenter",
"X-CSRF-Token" = "NS9M1Y5RMShdIfFaIKpYiqr+JuOZ8kwZvn9KSW7daZmgT9eJ4Q0ZyGLZSUHR4wjCdiE840HcQzLHHZSe0WgVJw==",
"X-Requested-With" = "XMLHttpRequest"
)
)
resp <- content(postdata, "text")
substr(resp, 1, 200)
[1] "$(\".fightcenterEvents\").html(\"<h3>\\n<span>Event Results<\\/span>\\n<span class=\\'moreLink\\'> <nav class=\\\"pagination\\\" role=\\\"navigation\\\" aria-label=\\\"pager\\\">\\n \\n \\n <span class=\\\"page "
I'm trying to query QPX Exprss (Google) from R (httr) but for whatever reason I'm getting 0 results. This is my query:
x <- list(
request = list(
slice = list(c(origin = "BOS", destination = "LAX", date = "2014-07-29")),
passengers = c(adultCount = 1, infantInLapCount = 0, infantInSeatCount = 0,
childCount = 0, seniorCount = 0),
solutions = 10,
refundable = "false")
)
And this is the command:
POST("https://www.googleapis.com/qpxExpress/v1/trips/search?key=MY_KEY",
body = toJSON(x), add_headers("Content-Type" = "application/json"), verbose(),
add_headers(Expect = ""))
Finally, the response from Google:
* About to connect() to www.googleapis.com port 443 (#0)
* Trying 173.194.66.95... * connected
* Connected to www.googleapis.com (173.194.66.95) port 443 (#0)
* successfully set certificate verify locations:
* CAfile: C:/Users/XXXX/Documents/R/win-library/3.1/httr/cacert.pem
CApath: none
* SSL re-using session ID
* SSL connection using ECDHE-RSA-RC4-SHA
* Server certificate:
* subject: C=US; ST=California; L=Mountain View; O=Google Inc; CN=*.googleapis.com
* start date: 2014-07-02 13:35:47 GMT
* expire date: 2014-09-30 00:00:00 GMT
* subjectAltName: www.googleapis.com matched
* issuer: C=US; O=Google Inc; CN=Google Internet Authority G2
* SSL certificate verify ok.
> POST /qpxExpress/v1/trips/search?key=MY_KEY HTTP/1.1
Host: www.googleapis.com
Accept: */*
Accept-Encoding: gzip
user-agent: curl/7.19.6 Rcurl/1.95.4.1 httr/0.3
Content-Type: application/json
Content-Length: 220
< HTTP/1.1 200 OK
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< Pragma: no-cache
< Expires: Fri, 01 Jan 1990 00:00:00 GMT
< Date: Mon, 21 Jul 2014 10:39:46 GMT
< ETag: "FHaaT3rgbj6tTc1zJmPkVQ6bD-8/wa9h__cUdEwRE2bp0yW5NTA6fec"
< Content-Type: application/json; charset=UTF-8
< Content-Encoding: gzip
< X-Content-Type-Options: nosniff
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Server: GSE
< Alternate-Protocol: 443:quic
< Transfer-Encoding: chunked
<
* Connection #0 to host www.googleapis.com left intact
Response [https://www.googleapis.com/qpxExpress/v1/trips/search?key=MY_KEY]
Status: 200
Content-type: application/json; charset=UTF-8
{
"kind": "qpxExpress#tripsSearch",
"trips": {
"kind": "qpxexpress#tripOptions",
"requestId": "UTlu4NDcLz3Ypcicp0KKI3",
"data": {
"kind": "qpxexpress#data",
"airport": [
{
"kind": "qpxexpress#airportData", ...
Has anybody had any luck with this?
Thanks a bunch!
Carlos
In case anybody is interested. These are the answers from Duncan and Hadley:
The postForm command from Windows would be as follows:
library(RCurl)
library(RJSONIO)
x <- list(
request = list(
passengers = list(
adultCount = 1
),
slice = list(
list(
origin = "BOS",
destination = "LAX",
date = "2014-07-29"
)
),
refundable = "false",
solutions = 10
)
)
postForm("https://www.googleapis.com/qpxExpress/v1/trips/search?key=my_KEY",
.opts = list(postfields = toJSON(x),
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"),
httpheader = c('Content-Type' = 'application/json',
Accept = 'application/json'),
verbose = TRUE
))
You can omit
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")
if you are running on a Linux machine.
If you rather use httr, here it is the command:
library(httr)
url <- "https://www.googleapis.com/qpxExpress/v1/trips/search"
json <- jsonlite::toJSON(x, auto_unbox = TRUE)
POST(url, query = list(key = my_Key), body = json, content_type_json())
Cheers.
Carlos