I'm trying to do a simple get request but no matter how I'm configuring the headers I keep getting a 403 response. The page loads fine in a browser. No login is required and there are no tracked cookies either. The link I'm trying to get a response from is below, followed by my simple code.
https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
headers = {
'Host': 'i7.sportsdatabase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
}
r = requests.get(url, headers)
I'm not seeing any other headers that need adding to the request. The full, in browser, request headers are below:
Host: i7.sportsdatabase.com
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
DNT: 1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en,en-US;q=0.9,it;q=0.8,es;q=0.7
If-None-Match: "be833f0fb26eb81487fc09e05c85ac8c8646fc7b"
Try:
Make your URL a string
Add the accepts
This works:
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
headers = {
'Host': 'i7.sportsdatabase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}
r = requests.get(url, headers=headers)
Try using .Session()
import requests
s = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36',
}
s.get('https://i7.sportsdatabase.com/nba/trends', headers=headers)
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
r = s.get(url, headers=headers)
print(r)
Output:
print(r)
<Response [200]>
Related
I am trying to scrape the following page:
https://apps.fcc.gov/oetcf/tcb/reports/Tcb731GrantForm.cfm?mode=COPY&RequestTimeout=500&tcb_code=&application_id=ll686%2BwlPnFzHQb6tru2vw%3D%3D&fcc_id=QDS-BRCM1095
headers_initial = {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Mobile Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US,en;q=0.9,de;q=0.8',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
}
r = requests.get(url, timeout=100, headers=headers_initial)
print(r.status_code)
print(r.headers)
print(r.text)
my status code is 400
my requests.get gets hung up. I would be very appreciative of any help someone can provide.
I am trying to scrape a website using a package in R.
When I run the following:
library(idealisto) #https://github.com/hmeleiro/idealisto
get_city("https://www.idealista.com/alquiler-viviendas/madrid-madrid/", "sale")
I get:
Error in read_html.response(.) : Forbidden (HTTP 403).
Looking into more details of the function get_city() I find that the problem is with the following part of the code:
desktop_agents <- c("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0")
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(`user-agent` = desktop_agents[sample(1:10, 1)]))
Which returns the following output:
Response
[https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
Date: 2022-04-04 18:52 Status: 403 Content-Type:
application/json;charset=utf-8 Size: 360 B
However, I should be getting a Status: 200. I try to pass some headers manually but I still get the same Status error:
headers = c(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(headers))
Any idea how I can get around this Status error?
Your syntax for add_headers is wrong. You can't pass a named vector - you have to pass the arguments directly to the function:
library(httr)
headers <- add_headers(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
GET(url, headers)
#> Response [https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
#> Date: 2022-04-04 19:10
#> Status: 200
#> Content-Type: text/html; charset=UTF-8
#> Size: 263 kB
#> <!DOCTYPE html>
#> <html lang="en" env="es" username="" data-userauth="false" class="">
#> <head>
#> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
#> <title>Property for sale in Madrid province, Spain: houses and flats — ...
#> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
#> <meta name="description" content="37,980 houses and flats for sale in Madrid,...
#> <meta name="author" content="idealista.com">
#> <meta http-equiv="cleartype" content="on">
#> <meta name="pragma" content="no-cache"/>
#> ...
Created on 2022-04-04 by the reprex package (v2.0.1)
I am trying to login on this website using the following snippet of code:
from bs4 import BeautifulSoup as bs
import requests
URL='https://app.acvauctions.com/login/'
HEADERS = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36' ,
'origin': 'https://app.acvauctions.com'
'referer': 'https://app.acvauctions.com/'
}
s = requests.session()
login_payload = { 'email': "****", 'password': "****", 'web': 'true' }
login_req = s.post(URL, headers=HEADERS, data=login_payload, allow_redirects=True)
The header requests that I get when I login with browser is the following:
accept: application/json, text/plain, */*
accept-encoding: gzip, deflate, br
accept-language: en-GB,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,en-US;q=0.6
content-length: 67
content-type: application/json
newrelic: eyJ2IjpbMCwxXSwiZCI6eyJ0eSI6IkJyb3dzZXIiLCJhYyI6IjE2NTA3NDUiLCJhcCI6IjE3ODI2MTk5MiIsImlkIjoiMGIzZjc3MmQxNGI0MWI5YSIsInRyIjoiMTc0MGVlNDA4NTE0MzA1YTBkNWU4NTJkODRlZTMxNzAiLCJ0aSI6MTYyNzUxODY1MDA0N319
origin: https://app.acvauctions.com
referer: https://app.acvauctions.com/
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
traceparent: 00-592fab3e40957cc0ffa37c47c8914000-c656f42dbf8336a1-01
tracestate: 1650745#nr=0-1-1650745-178261992-0b3f772d14b41b9a----1627518650047
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36
newrelic, traceparent, and tracestate change every time I login but I do not know how to handle them?
I am trying to scrape the table under the Market Segment tab as in the below image, The code logic blow used to work with similar tasks, however it is not working here and returning
Error Send failure: Connection was reset
link<-'https://www.egx.com.eg/en/prices.aspx'
headers.id <-c('User-Agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Referer'= 'https://www.egx.com.eg/en/prices.aspx',
'Content-Type'= 'application/x-www-form-urlencoded',
'Origin'='https://www.egx.com.eg',
'Host'= 'www.egx.com.eg',
'Content-Type'= 'application/x-www-form-urlencoded',
'sec-ch-ua'='" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"'
)
pgsession<-session(link,httr::add_headers(.headers=headers.id), verbose())
pgform<-html_form(pgsession)[[1]]
page<-POST(link, body=list(
'__EVENTTARGET'= pgform$fields$`__EVENTTARGET`$value,
'__EVENTARGUMENT'=pgform$fields$`__EVENTARGUMENT`$value,
'__VIEWSTATE'=pgform$fields$`__VIEWSTATE`$value,
'ctl00$H$txtSearchAll'=pgform$fields$`ctl00$H$txtSearchAll`$value,
'ctl00$H$rblSearchType'=pgform$fields$`ctl00$H$rblSearchType`$value,
'ctl00$H$rblSearchType'=pgform$fields$`ctl00$H$rblSearchType`$value,
'ctl00$H$imgBtnSearch'=pgform$fields$`ctl00$H$imgBtnSearch`$value,
'ctl00$C$S$TextBox1'=pgform$fields$`ctl00$C$S$TextBox1`$value
),
encode="form", verbose()
)
I keeped seacrhing till i find the solution using rvest as follows:
link<-'https://www.egx.com.eg/en/prices.aspx'
headers.id <-c('Accept'='*/*',
'Accept-Encoding'='gzip, deflate, br',
'Accept-Language'='en-US,en;q=0.9',
'Cache-Control'='no-cache',
'Connection'='keep-alive',
'Content-Type'='application/x-www-form-urlencoded',
'Host'='www.egx.com.eg',
'Origin'='https://www.egx.com.eg',
'Referer'='https://www.egx.com.eg/en/prices.aspx',
'sec-ch-ua'='" Not;A Brand";v="99", "Microsoft Edge";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile'='?0',
'Sec-Fetch-Dest'='empty',
'Sec-Fetch-Mode'='cors',
'Sec-Fetch-Site'='same-origin',
'User-Agent'='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.59',
'X-MicrosoftAjax'='Delta=true'
)
pgsession<-session(link,httr::add_headers(.headers=headers.id))
pgform<-html_form(pgsession)[[1]]
filled_form<-html_form_set(pgform,
'__EVENTTARGET'= 'ctl00$C$S$lkMarket',
'__EVENTARGUMENT'=pgform$fields$`__EVENTARGUMENT`$value,
'__VIEWSTATE'=pgform$fields$`__VIEWSTATE`$value,
'ctl00$H$txtSearchAll'=pgform$fields$`ctl00$H$txtSearchAll`$value,
'ctl00$H$rblSearchType'=pgform$fields$`ctl00$H$rblSearchType`$value,
'ctl00$H$rblSearchType'="1",
'ctl00$H$imgBtnSearch'=pgform$fields$`ctl00$H$imgBtnSearch`$value,
'ctl00$C$S$TextBox1'=pgform$fields$`ctl00$C$S$TextBox1`$value
)
page<-session_submit(pgsession,filled_form)
# in the above example change eventtarget as "ctl00$ContentPlaceHolder1$DataList2$ctl02$lnk_blok" to get different table
page.html <-read_html(page$response)%>%html_table%>%.[[7]]
the link is: https://angel.co/medical-marijuana-dispensaries-1
every time I use requests.get(url), it keeps giving me 403 response, so I cannot parse it
I tried changing the headers: user-agent and the referer but it did not work
import requests
page=requests.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
session.headers
{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
page=requests.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>
page=session.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>