I am trying to login on this website using the following snippet of code:
from bs4 import BeautifulSoup as bs
import requests
URL='https://app.acvauctions.com/login/'
HEADERS = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36' ,
'origin': 'https://app.acvauctions.com'
'referer': 'https://app.acvauctions.com/'
}
s = requests.session()
login_payload = { 'email': "****", 'password': "****", 'web': 'true' }
login_req = s.post(URL, headers=HEADERS, data=login_payload, allow_redirects=True)
The header requests that I get when I login with browser is the following:
accept: application/json, text/plain, */*
accept-encoding: gzip, deflate, br
accept-language: en-GB,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,en-US;q=0.6
content-length: 67
content-type: application/json
newrelic: eyJ2IjpbMCwxXSwiZCI6eyJ0eSI6IkJyb3dzZXIiLCJhYyI6IjE2NTA3NDUiLCJhcCI6IjE3ODI2MTk5MiIsImlkIjoiMGIzZjc3MmQxNGI0MWI5YSIsInRyIjoiMTc0MGVlNDA4NTE0MzA1YTBkNWU4NTJkODRlZTMxNzAiLCJ0aSI6MTYyNzUxODY1MDA0N319
origin: https://app.acvauctions.com
referer: https://app.acvauctions.com/
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
traceparent: 00-592fab3e40957cc0ffa37c47c8914000-c656f42dbf8336a1-01
tracestate: 1650745#nr=0-1-1650745-178261992-0b3f772d14b41b9a----1627518650047
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36
newrelic, traceparent, and tracestate change every time I login but I do not know how to handle them?
Related
I am trying to scrape a website using a package in R.
When I run the following:
library(idealisto) #https://github.com/hmeleiro/idealisto
get_city("https://www.idealista.com/alquiler-viviendas/madrid-madrid/", "sale")
I get:
Error in read_html.response(.) : Forbidden (HTTP 403).
Looking into more details of the function get_city() I find that the problem is with the following part of the code:
desktop_agents <- c("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0")
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(`user-agent` = desktop_agents[sample(1:10, 1)]))
Which returns the following output:
Response
[https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
Date: 2022-04-04 18:52 Status: 403 Content-Type:
application/json;charset=utf-8 Size: 360 B
However, I should be getting a Status: 200. I try to pass some headers manually but I still get the same Status error:
headers = c(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(headers))
Any idea how I can get around this Status error?
Your syntax for add_headers is wrong. You can't pass a named vector - you have to pass the arguments directly to the function:
library(httr)
headers <- add_headers(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
GET(url, headers)
#> Response [https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
#> Date: 2022-04-04 19:10
#> Status: 200
#> Content-Type: text/html; charset=UTF-8
#> Size: 263 kB
#> <!DOCTYPE html>
#> <html lang="en" env="es" username="" data-userauth="false" class="">
#> <head>
#> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
#> <title>Property for sale in Madrid province, Spain: houses and flats — ...
#> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
#> <meta name="description" content="37,980 houses and flats for sale in Madrid,...
#> <meta name="author" content="idealista.com">
#> <meta http-equiv="cleartype" content="on">
#> <meta name="pragma" content="no-cache"/>
#> ...
Created on 2022-04-04 by the reprex package (v2.0.1)
I'm trying to do a simple get request but no matter how I'm configuring the headers I keep getting a 403 response. The page loads fine in a browser. No login is required and there are no tracked cookies either. The link I'm trying to get a response from is below, followed by my simple code.
https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
headers = {
'Host': 'i7.sportsdatabase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
}
r = requests.get(url, headers)
I'm not seeing any other headers that need adding to the request. The full, in browser, request headers are below:
Host: i7.sportsdatabase.com
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
DNT: 1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en,en-US;q=0.9,it;q=0.8,es;q=0.7
If-None-Match: "be833f0fb26eb81487fc09e05c85ac8c8646fc7b"
Try:
Make your URL a string
Add the accepts
This works:
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
headers = {
'Host': 'i7.sportsdatabase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}
r = requests.get(url, headers=headers)
Try using .Session()
import requests
s = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36',
}
s.get('https://i7.sportsdatabase.com/nba/trends', headers=headers)
url = 'https://i7.sportsdatabase.com/nba/query.json?sdql=50+%3C+Kobe+Bryant%3Apoints+and+site%3Daway&sport=nba'
r = s.get(url, headers=headers)
print(r)
Output:
print(r)
<Response [200]>
I managed to get the soup and the html of the webpage, but for some reason can't find the robots tag even though I can find it when scraping in other languages.
Example:
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8',
'Cache-Control': 'max-age=0', 'Connection': 'keep-alive',
}
res=requests.get('http://{}'.format("radverdirect.com"), headers=headers, allow_redirects = True)
number= str(res.status_code)
soup = BeautifulSoup(res.text, 'html.parser')
x=soup.find('meta', attrs={'name':'robots'})
out=x.get("content", None)
out
This site returns to me noodp in other languages but here I can't find this tag. Why and how do I fix it?
I have defined a custom nginx log format using below template :
log_format main escape=none '"$time_local" client=$remote_addr '
'request="$request" '
'status=$status'
'Req_Header=$req_headers '
'request_body=$req_body '
'response_body=$resp_body '
'referer=$http_referer '
'user_agent="$http_user_agent" '
'upstream_addr=$upstream_addr '
'upstream_status=$upstream_status '
'request_time=$request_time '
'upstream_response_time=$upstream_response_time '
'upstream_connect_time=$upstream_connect_time ';
In return i get the request logged like below
"09/Sep/2019:13:28:39 +0530" client=59.152.52.190 request="POST /api/onboard/checkExistence HTTP/1.1"status=200Req_Header=Headers: accept: application/json
host: uat-pwa.abc.com
from: https://uat-pwa.abc.com/onboard/mf/onboard-info_v1.2.15.3
sec-fetch-site: same-origin
accept-language: en-GB,en-US;q=0.9,en;q=0.8
content-type: application/json
connection: keep-alive
content-length: 46
cookie: _ga=GA1.2.51303468.1558948708; _gid=GA1.2.1607663960.1568015582; _gat_UA-144276655-2=1
referer: https://uat-pwa.abc.com/onboard/mf/onboard-info
accept-encoding: gzip, deflate, br
ticket: aW52ZXN0aWNh
businessunit: MF
sec-fetch-mode: cors
userid: Onboarding
origin: https://uat-pwa.abc.com
investorid:
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
request_body={"PAN":"ABCDEGH","mobile":null,"both":"no"} response_body={"timestamp":"2019-09-09T13:28:39.132+0530","message":"Client Already Exist. ","details":"Details are in Logger database","payLoad":null,"errorCode":"0050","userId":"Onboarding","investorId":"","sessionUUID":"a2161b89-d2d7-11e9-aa73-3dba15bc0e1c"} referer=https://uat-pwa.abc.com/onboard/mf/onboard-info user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36" upstream_addr=[::1]:8080 upstream_status=200 request_time=0.069 upstream_response_time=0.068 upstream_connect_time=0.000
I am facing issues in writing a parser rule in telegraf logparser section. Once this data is parsed properly, Telegraf can write it into influx DB.
I have tried various solutions online to find a parsing rule but not able to do so as I am new to it. Any assistance will be appreciated.
Thanks and do let me know if any further information is required.
the link is: https://angel.co/medical-marijuana-dispensaries-1
every time I use requests.get(url), it keeps giving me 403 response, so I cannot parse it
I tried changing the headers: user-agent and the referer but it did not work
import requests
page=requests.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'})
session.headers
{'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
page=requests.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>
page=session.get('https://angel.co/medical-marijuana-dispensaries-1')
page
<Response [403]>