This is my code the last curl gets a 500 error most of the time, can somebody explain to me why this is happening I'm a newbie here!
headers = {"Accept-Encoding": "gzip, deflate, sdch, br",
"Accept-Language": "en-GB,en-US;q=0.8,en;q=0.6",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Connection": "keep-alive --compressed"}
url_ = "https://www.apct.gov.in/apportal/index.aspx"
r = requests.get(url_,headers=headers)
#open("/tmp/1.html","w").write(r.content)
c1 = r.cookies
headers = {
'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://www.apct.gov.in/apportal/index.html',
'Connection': 'keep-alive --compressed'}
url = "https://www.apct.gov.in/apportal/index.aspx"
r1 = requests.get(url,headers=headers,cookies=c1)
c2 = r1.cookies
# f = open("/tmp/1.html","w+")
# f.write(r.content)
url = "https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx"
time.sleep(2.5)
headers = { 'Accept-Encoding': 'gzip, deflate, sdch, br',
'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://www.apct.gov.in/apportal/index.html',
'Connection': 'keep-alive'}
#r = requests.get(url,headers=headers,cookies=c2,timeout=60,verify=True)
Cert ='certifi.old_where()'
r = requests.get('https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx', headers=headers, cookies=c2,verify=certifi.old_where())
r = requests.get('https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx', headers=headers, cookies=c2, timeout=60,verify=True)
soup = BeautifulSoup(r.content)
try:
view_state = soup.find('input', attrs={'id': '__VIEWSTATE'}).get("value")
except:
time.sleep(2.5)
r = requests.get('https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx', headers=headers, cookies=c2, timeout=60,verify=True)
#pdb.set_trace()
# r = requests.get(url,headers=headers,cookies=c2,timeout=60, verify=False)
c3 = r.cookies
soup = BeautifulSoup(r.content)
This is my code above, the last curl is getting a 500 error most of the time - can somebody explain to me why this is happening
Related
Good evening everybody,
I am in a fight with PowerQuery. I am trying to recieve an answer from a website in PowerBI (Power Query). For obvious reasons I renamed passwords and such. When I run the code below, PowerQuery answers that:
Expression.Error: The content-length heading must be changed using the
appropriate property or method. Parameter name: name Details: 103
It does not matter if I use another number. I tried also to remove the header entirly but that results in:
DataSource.Error: Web.Contents could not retrieve the content from
'https://xxxx.nl/api/token/' (500): Internal Server Error Details:
DataSourceKind=Web DataSourcePath=https://xxxx.nl/api/token
Url=https://xxxx.nl/api/token/
I do miss something but I cannot find out what that is. Could you find out what it is? Thanks in advance!
let
url = "https://xxxxx.nl/api/token/",
body = Text.ToBinary("{""""username"""":""""xxxx"""",""""password"""":""""xxxx"""",""""group"""":""""xxxx"""",""""deleteOtherSessions"""":false"),
Data = Web.Contents(
url,
[
Headers = [
#"authority" = "xxx.nl",
#"method" = "POST",
#"path" = "/api/Token",
#"scheme" = "https",
#"accept" = "application/json",
#"accept-encoding" = "gzip, deflate",
#"transfer-encoding" = "deflate",
#"accept-language" = "nl-NL,nl;q=0.7",
#"cache-control" = "no-cache",
#"content-length" = "103",
#"content-type" = "application/json",
#"expires" = "Sat, 01 Jan 2000 00:00:00 GMT",
#"origin" = "https://xxxx.nl",
#"pragma" = "no-cache",
#"referer" = "https://xxxxx.nl/",
#"sec-fetch-dest" = "empty",
#"sec-fetch-mode" = "cors",
#"sec-fetch-site" = "same-origin",
#"sec-gpc" = "1",
#"user-agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
],
Content = body
]
)
in
Data
I'm attempting to scrape a JSON object. Referencing Extract data from an unreachable JsonObject(), I'm able to get the status code 200 but I'm not getting any results. I assume there is something wrong with how I'm constructing the query.
Code I have so far:
library(httr) # web scraping
library(jsonlite) #parsing json data
library(rvest) # web scraping
library(polite) # check robot.txt files
library(tidyverse) # data wrangling
library(curlconverter) # decode curl commands
library(urltools) # URL encoding
r <-
POST(
url = "https://vfm4x0n23a-dsn.algolia.net/1/indexes/*/queries" ,
add_headers(
#.headers=c(
#'Accept' = "*/*",
#'Accept-Encoding' = 'gzip, deflate, br',
#'Accept-Language' = "en-US,en;q=0.9",
#'Cache-Control' = "no-cache",
#'Connection' = "keep-alive",
#'Content-Length' = '450',
#'Content-Type' = 'application/x-www-form-urlencoded',
#'Host' = 'vfm4x0n23a-dsn.algolia.net',
#'Origin' = "https://www.iheartjane.com",
#'Pragma' = "no-cache",
#'Referer' = "https://www.iheartjane.com/",
#'sec-ch-ua' = '".Not/A)Brand"";v=99""Google Chrome";v="103",Chromium";v="103"',
#'sec-ch-ua-mobile' = "?0",
#'sec-ch-ua-platform' = "Windows",
#'Sec-Fetch-Dest' = "empty",
#'Sec-Fetch-Mode' = "cors",
#'Sec-Fetch-Site' = "cross-site",
'User-Agent' = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"#,
#'x-algolia-api-key' = "b499e29eb7542dc373ec0254e007205d",
#'x-algolia-application-id' = "VFM4X0N23A"
#)
),
config = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.13.0); Browser; JS Helper (3.7.4); react (16.14.0); react-instantsearch (6.23.1)',
'x-algolia-application-id' = 'VFM4X0N23A',
'x-algolia-api-key' = 'b499e29eb7542dc373ec0254e007205d'
),
body = FALSE,
encode = 'form',
query = '{"requests":[{"indexName":"menu-products-production","params":"query=highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&page=0&hitsPerPage=48&filters=store_id%20%3D%201641%20AND%20kind%3A%22sale%22%20OR%20root_types%3A%22sale%22&optionalFilters=brand%3AVerano%2Cbrand%3AAvexia%2Cbrand%3AEncore%20Edibles%2Croot_types%3AFeatured%2Croot_types%3ASale&userToken=Zu0iU4Uo2whpmqNBjUGOJ&facets=%5B%5D&tagFilters="}]}',
verbose() )
json <- content(r, type = "application/json")
The payload & query parameters:
Payload
The website
How can I restructure my code to send the query correctly?
I am getting time-out with GET function from httr package in R with this settings:
GET("https://isir.justice.cz/isir/common/index.do", add_headers(.headers = c('"authority"="isir.justice.cz",
"scheme"="https",
"path"="/isir/common/index.do",
"cache-control"="max-age=0",
"sec-ch-ua-mobile"="?0",
"sec-ch-ua-platform"= "Windows",
"upgrade-insecure-requests"="1",
"accept"="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-site"="none",
"sec-fetch-mode"="navigate",
"sec-fetch-user"="?1",
"sec-fetch-dest"="document",
"accept-encoding"="gzip, deflate, br",
"accept-language"="cs-CZ,cs;q=0.9"'
)))
But the seemingly same query via powershell returns a webpage.
Invoke-WebRequest -UseBasicParsing -Uri "https://isir.justice.cz/isir/common/index.do" `
-WebSession $session `
-Headers #{
"method"="GET"
"authority"="isir.justice.cz"
"scheme"="https"
"path"="/isir/common/index.do"
"cache-control"="max-age=0"
"sec-ch-ua-mobile"="?0"
"sec-ch-ua-platform"="`"Windows`""
"upgrade-insecure-requests"="1"
"accept"="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
"sec-fetch-site"="none"
"sec-fetch-mode"="navigate"
"sec-fetch-user"="?1"
"sec-fetch-dest"="document"
"accept-encoding"="gzip, deflate, br"
"accept-language"="cs-CZ,cs;q=0.9"
}
Do I have a problem with my R code or is it simple a matter of difference between using R vs powershell?
Your code didn't run for me as it had an extra ' somewhere. Correcting this, it ran fine. If you keep getting timeout messages, you can increase the maximum request time using timeout():
library(httr)
x <- GET("https://isir.justice.cz/isir/common/index.do", timeout(10), add_headers(
.headers = c("authority" = "isir.justice.cz",
"scheme" = "https",
"path" = "/isir/common/index.do",
"cache-control" = "max-age=0",
"sec-ch-ua-mobile" = "?0",
"sec-ch-ua-platform" = "Windows",
"upgrade-insecure-requests" = "1",
"accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-site" = "none",
"sec-fetch-mode" = "navigate",
"sec-fetch-user" = "?1",
"sec-fetch-dest" = "document",
"accept-encoding" = "gzip, deflate, br",
"accept-language" = "cs-CZ,cs;q=0.9")
))
As a sidenote: there is a successor package by the same people called httr2. I'm also still using httr but it's probably a good idea to learn the new package. Here is how that would look like:
library(httr2)
req <- request("https://isir.justice.cz/isir/common/index.do") %>%
req_headers("authority" = "isir.justice.cz",
"scheme" = "https",
"path" = "/isir/common/index.do",
"cache-control" = "max-age=0",
"sec-ch-ua-mobile" = "?0",
"sec-ch-ua-platform" = "Windows",
"upgrade-insecure-requests" = "1",
"accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-site" = "none",
"sec-fetch-mode" = "navigate",
"sec-fetch-user" = "?1",
"sec-fetch-dest" = "document",
"accept-encoding" = "gzip, deflate, br",
"accept-language" = "cs-CZ,cs;q=0.9") %>%
req_timeout(seconds = 10)
# check your request in a dry run
req %>%
req_dry_run()
#> GET /isir/common/index.do HTTP/1.1
#> Host: isir.justice.cz
#> User-Agent: httr2/0.1.1 r-curl/4.3.2 libcurl/7.80.0
#> authority: isir.justice.cz
#> scheme: https
#> path: /isir/common/index.do
#> cache-control: max-age=0
#> sec-ch-ua-mobile: ?0
#> sec-ch-ua-platform: Windows
#> upgrade-insecure-requests: 1
#> accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
#> sec-fetch-site: none
#> sec-fetch-mode: navigate
#> sec-fetch-user: ?1
#> sec-fetch-dest: document
#> accept-encoding: gzip, deflate, br
#> accept-language: cs-CZ,cs;q=0.9
resp <- req_perform(req)
resp
#> <httr2_response>
#> GET https://isir.justice.cz/isir/common/index.do
#> Status: 200 OK
#> Content-Type: text/html
#> Body: In memory (116916 bytes)
Created on 2022-01-03 by the reprex package (v2.0.1)
In this program I'm trying to get all the prices of rentals in Ottawa but it just returns one price that is random each time, why?
import scrapy
class RentalPricesSpider(scrapy.Spider):
name = 'rental_prices'
allowed_domains = ['www.kijiji.ca']
start_urls = ['https://www.kijiji.ca/b-real-estate/ottawa/c34l1700185']
def parse(self, response):
rental_price = response.xpath('normalize-space(//div[#class="price"]/text())').getall()
yield {
'rent': rental_price,
}
You are select wrong xpath for that you are not getting expected output. Use css selector div.price::text instead xpath.
import scrapy
class RentalPricesSpider(scrapy.Spider):
name = 'rental_prices'
allowed_domains = ['www.kijiji.ca']
start_urls = ['https://www.kijiji.ca/b-real-estate/ottawa/c34l1700185']
def parse(self, response):
rental_price = response.css('div.price::text').getall()
rental_price = [x.strip() for x in rental_price if x.strip()]
# rental_price = list(map(str.strip ,x) for x in rental_price)
yield {
'rent': rental_price,
}
process = CrawlerProcess(settings={
"USER_AGENT" : "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36",
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(RentalPricesSpider)
process.start()
I am trying to scrape a table of recent events from the following link: https://www.tapology.com/fightcenter
When visiting the link, the table shows upcoming events, so you have to click under schedule and change the option to "result".
I have scraped what appears to be the raw data below in the variable resp, but I don't know what language that code is written in and don't know how to parse it.
library(httr)
url <- paste0("https://www.tapology.com/fightcenter_events")
fd <- list(
group = "all",
region = "",
schedule = "results",
sport = "all"
)
postdata <- POST(url = url, query = fd, encode = "form",
add_headers(
"Accept" = "text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01",
"Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" = "_ga=GA1.2.1873043703.1537368153; __utmz=88071069.1563301531.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); remember_id=149246; remember_token=315e68b7a95fa6cda391fc3e2ae0e1fb1466335ed9a15480558bd4ef8d52d832; __utmc=88071069; __utma=88071069.1873043703.1537368153.1563983348.1563985208.3; _tapology_mma_session=Z2RWaU1XZ0hOQmIwcUhjN1Bac0twN0JZQktnVUlLUjVsVkdMMDR4bTBITGdnSDFlRW9WeHprQ2lRaWdJM0lRbW5PNTFYSG9kbVlaMWFlR3liZmEyZWhnRWVVNm03UVIwRUJLWHl1MmJXRlQ1dEFJTGJsTnVLQWx4MWpUMTJOYlBxQ1N1Y0pQREZlZTNzMDA0NTJINEpLS2FMNXZvaXZjQ3g2dFMzM1dJeTRmekc4TG5JTk9YZDlZdWx5WnpZd3luZlY1ZXliQ0RWS1B1aXJYQnpqVVp4UT09LS10am5XNVI0c0pXa2p1dHJ5OW9PME5nPT0%3D--7488fef85f733279f15da594ea47f0345aa16938",
"Host" = "www.tapology.com",
"Origin" = "https://www.tapology.com",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"Referer" = "https://www.tapology.com/fightcenter",
"X-CSRF-Token" = "NS9M1Y5RMShdIfFaIKpYiqr+JuOZ8kwZvn9KSW7daZmgT9eJ4Q0ZyGLZSUHR4wjCdiE840HcQzLHHZSe0WgVJw==",
"X-Requested-With" = "XMLHttpRequest"
)
)
resp <- content(postdata, "text")
substr(resp, 1, 200)
[1] "$(\".fightcenterEvents\").html(\"<h3>\\n<span>Event Results<\\/span>\\n<span class=\\'moreLink\\'> <nav class=\\\"pagination\\\" role=\\\"navigation\\\" aria-label=\\\"pager\\\">\\n \\n \\n <span class=\\\"page "