import requests
from bs4 import BeautifulSoup
import re
R = []
url = "https://ascscotties.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \
'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'html.parser')
links= soup.find_all('a',href=re.compile("roster"))
s=[url + link.get("href") for link in links]
for i in s:
r = requests.get(i, allow_redirects=True, headers=headers)
if r.status_code < 400:
R.append(r.url)
Output
['https://ascscotties.com/sports/womens-basketball/roster',
'https://ascscotties.com/sports/womens-cross-country/roster',
'https://ascscotties.com/sports/womens-soccer/roster',
'https://ascscotties.com/sports/softball/roster',
'https://ascscotties.com/sports/womens-tennis/roster',
'https://ascscotties.com/sports/womens-volleyball/roster']
The code looks for roster links from url's and gives output, but like "https://auyellowjackets.com/" it fails as the url takes use to a splash screen. What can be done?
The site uses a cookie to indicate it has shown a splash screen before. So set it to get to the main page:
import re
import requests
from bs4 import BeautifulSoup
R = []
url = "https://auyellowjackets.com"
cookies = {"splash_2": "splash_2"} # <--- set cookie
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; "
"Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0"
}
reqs = requests.get(url, headers=headers, cookies=cookies)
soup = BeautifulSoup(reqs.text, "html.parser")
links = soup.find_all("a", href=re.compile("roster"))
s = [url + link.get("href") for link in links]
for i in s:
r = requests.get(i, allow_redirects=True, headers=headers)
if r.status_code < 400:
R.append(r.url)
print(*R, sep="\n")
Prints:
https://auyellowjackets.com/sports/mens-basketball/roster
https://auyellowjackets.com/sports/mens-cross-country/roster
https://auyellowjackets.com/sports/football/roster
https://auyellowjackets.com/sports/mens-track-and-field/roster
https://auyellowjackets.com/sports/mwrest/roster
https://auyellowjackets.com/sports/womens-basketball/roster
https://auyellowjackets.com/sports/womens-cross-country/roster
https://auyellowjackets.com/sports/womens-soccer/roster
https://auyellowjackets.com/sports/softball/roster
https://auyellowjackets.com/sports/womens-track-and-field/roster
https://auyellowjackets.com/sports/volleyball/roster
I am trying to scrape the current AQI in my location by beautifulsoup 4.
url = "https://www.airnow.gov/?city=Burlingame&state=CA&country=USA"
header = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
}
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
aqi = soup.find("div", class_="aqi")
when I print the aqi, it is just empty div like this:
However, on the website, there should be a element inside this div containing the aqi number that I want.
I am trying to access the Amadeus travel API
To obtain a token, the given curl is:
curl "https://test.api.amadeus.com/v1/security/oauth2/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=client_credentials&client_id={client_id}&client_secret={client_secret}"
My RScript Attempt is:
library("httr")
# Get Token
response <- POST("https://test.api.amadeus.com/v1/security/oauth2/token",
add_headers("Content-Type" = "application/x-www-form-urlencoded"),
body = list(
"grant_type" = "client_credentials",
"client_id" = API_KEY,
"client_secret" = API_SECRET),
encode = "json")
response
rsp_content <- content(response, as = "parsed", type = "application/json")
rsp_content
Resulting in the error:
Response [https://test.api.amadeus.com/v1/security/oauth2/token]
Date: 2021-07-23 00:59
Status: 400
Content-Type: application/json
Size: 217 B
{
"error":"invalid_request",
"error_description": "Mandatory grant_type form parameter missing",
"code": 38187,
"title": "Invalid parameters"
}
>
What is the correct way to call this API to obtain a token using R?
The curl -d option is used to send data in the same way an HTML form would. To match that format, use encode="form" rather than encode="json" in the call to POST().
I run the script, but I got none, but there are data on the url
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
class GetSpider(scrapy.Spider):
name = 'gets'
start_urls = ['https://www.retailmenot.com/coupons/insurance?u=ZTF65B5PJZEU3JDF326WY2SXOQ']
def parse(self, response):
s = Selector(response)
code = s.xpath("//button[contains(#class,'CopyCode')][1]/text()").get()
yield {'code':code}
I expect 52YR, but i got None
The easiest way to go about this is probably to load the json in the script as a python dictionary and navigate through it to get to the codes.
The below code should get you started:
import scrapy
import json
import logging
class GetSpider(scrapy.Spider):
name = 'gets'
start_urls = ['https://www.retailmenot.com/coupons/insurance?u=ZTF65B5PJZEU3JDF326WY2SXOQ']
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
}
custom_settings = {'ROBOTSTXT_OBEY': False}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url,
callback=self.parse,
headers=self.headers,
dont_filter=True)
def parse(self, response):
script = response.xpath(
'//script[contains(text(), "__NEXT_DATA__")]/text()'
).extract_first()
dict_start_index = script.index('{')
dict_end_index = script.index('};') + 1
data = json.loads(script[dict_start_index:dict_end_index])
coupon_data = data['props']['pageProps']['serverState']['apollo']['data']
for key, value in coupon_data.items():
try:
code = value['code']
except KeyError:
logging.debug("no code found")
else:
yield {'code': code}
I have a small python app running via uwsgi with requests served by nginx.
I'm printing the environment variables... and it looks like after a couple of ok requests, nginx is sending the same HTTP_COOKIE param for unrelated requests:
For example:
{'UWSGI_CHDIR': '/ebs/py', 'HTTP_COOKIE':
'ge_t_c=4fcee8450c3bee709800920c', 'UWSGI_SCRIPT': 'server',
'uwsgi.version': '1.1.2', 'REQUEST_METHOD': 'GET', 'PATH_INFO':
'/redirect/ebebaf3b-475a-4010-9a72-96eeff797f1e', 'SERVER_PROTOCOL':
'HTTP/1.1', 'QUERY_STRING': '', 'x-wsgiorg.fdevent.readable':
, 'CONTENT_LENGTH': '',
'uwsgi.ready_fd': None, 'HTTP_USER_AGENT': 'Mozilla/5.0 (compatible;
MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)', 'HTTP_CONNECTION':
'close', 'HTTP_REFERER': 'http://www.facebook.com/', 'SERVER_NAME':
'pixel.domain.com', 'REMOTE_ADDR': '10.load.bal.ip',
'wsgi.url_scheme': 'http', 'SERVER_PORT': '80', 'wsgi.multiprocess':
True, 'uwsgi.node': 'py.domain.com', 'DOCUMENT_ROOT':
'/etc/nginx/html', 'UWSGI_PYHOME': '/ebs/py', 'uwsgi.core': 127,
'HTTP_X_FORWARDED_PROTO': 'http', 'x-wsgiorg.fdevent.writable':
, 'wsgi.input':
,
'HTTP_HOST': 'track.domain.com', 'wsgi.multithread': False,
'REQUEST_URI': '/redirect/ebebaf3b-475a-4010-9a72-96eeff797f1e',
'HTTP_ACCEPT': 'text/html, application/xhtml+xml, /',
'wsgi.version': (1, 0), 'x-wsgiorg.fdevent.timeout': None,
'HTTP_X_FORWARDED_FOR': '10.load.bal.ip', 'wsgi.errors': , 'REMOTE_PORT': '36462',
'HTTP_ACCEPT_LANGUAGE': 'en-US', 'wsgi.run_once': False,
'HTTP_X_FORWARDED_PORT': '80', 'CONTENT_TYPE': '',
'wsgi.file_wrapper': ,
'HTTP_ACCEPT_ENCODING': 'gzip, deflate'}
and
{'UWSGI_CHDIR': '/ebs/py', 'HTTP_COOKIE':
'ge_t_c=4fcee8450c3bee709800920c', 'UWSGI_SCRIPT': 'server',
'uwsgi.version': '1.1.2', 'REQUEST_METHOD': 'GET', 'PATH_INFO':
'/redirect/2391e658-95ef-4300-80f5-83dbb1a0e526', 'SERVER_PROTOCOL':
'HTTP/1.1', 'QUERY_STRING': '', 'x-wsgiorg.fdevent.readable':
, 'CONTENT_LENGTH': '',
'uwsgi.ready_fd': None, 'HTTP_USER_AGENT': 'Mozilla/5.0 (iPad; CPU OS
5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko)
Version/5.1 Mobile/9B206 Safari/7534.48.3', 'HTTP_CONNECTION':
'close', 'HTTP_REFERER': 'http://www.facebook.com/', 'SERVER_NAME':
'pixel.domain.com', 'REMOTE_ADDR': '10.load.balancer.ip',
'wsgi.url_scheme': 'http', 'SERVER_PORT': '80', 'wsgi.multiprocess':
True, 'uwsgi.node': 'py.domain.com', 'DOCUMENT_ROOT':
'/etc/nginx/html', 'UWSGI_PYHOME': '/ebs/py', 'uwsgi.core': 127,
'HTTP_X_FORWARDED_PROTO': 'http', 'x-wsgiorg.fdevent.writable':
, 'wsgi.input':
,
'HTTP_HOST': 'fire.domain.com', 'wsgi.multithread': False,
'REQUEST_URI': '/redirect/2391e658-95ef-4300-80f5-83dbb1a0e526',
'HTTP_ACCEPT':
'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'wsgi.version': (1, 0), 'x-wsgiorg.fdevent.timeout': None,
'HTTP_X_FORWARDED_FOR': '10.load.bal.ip', 'wsgi.errors': , 'REMOTE_PORT': '39498',
'HTTP_ACCEPT_LANGUAGE': 'en-us', 'wsgi.run_once': False,
'HTTP_X_FORWARDED_PORT': '80', 'CONTENT_TYPE': '',
'wsgi.file_wrapper': ,
'HTTP_ACCEPT_ENCODING': 'gzip, deflate'}
These are 2 distinct clients. I opened an incognito session, confirmed that no cookie was sent in the headers, and the uwsgi log shows that it received the same HTTP_COOKIE.
How can I make sure that nginx only passes the proper information for the current request, without regard to other requests?
Figured it out...
I had to add this line to uwsgi_params in /etc/nginx/
uwsgi_param HTTP_COOKIE $http_cookie;
Without it, the HTTP_COOKIE variable could not be trusted in uwsgi/python app.