Python - Change requests params so it doesn't start by "?=" and "&" - python-requests

So I have been trying to figure out how to work out things with requests.
So right now I have done something like:
url = 'www.helloworld.com'
params = {
"": page_num,
"orderBy": 'Published'
}
headers = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
' (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')
}
resp = requests.get(url, headers=headers, params=params, timeout=12)
resp.raise_for_status()
print(resp.url)
and basically how it prints out now is:
www.helloworld.com/?=2&orderBy=Published
and what I wish to have is:
www.helloworld.com/2?orderBy=Published
How would I be able to change the params requests so it will end up like above?

Your issue is that you are trying to modify the target URL path, not the parameters. So you can't use the params parameters from requests to do that.
I suggest 2 options to do what you want:
construct the url by hand. You can do it with string concatenations for simple cases, but there are modules to do it properly: https://pypi.org/project/furl/ , https://hyperlink.readthedocs.io/en/latest/ , that are easier to use and more powerful than urllib.parse.urljoin
use apirequests which is a simple wrapper around requests: https://pypi.org/project/apirequests
Sample using apirequests:
import apirequests
client = apirequests.Client('www.helloworld.com')
resp = client.get('/2', headers=headers, params=params, timeout=12)
# note that apirequests calls "resp.raise_for_status() automatically

Related

Access Token Meta Data via Solscan API with Python

I try to access the meta data of a solana token via the Solscan API.
The following code works in principle but the API doesn't provide the expected data.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
params = {
'token': '24jvtWN7qCf5GQ5MaE7V2R4SUgtRxND1w7hyvYa2PXG6',
}
response = requests.get('https://api.solscan.io/token/meta', headers=headers, params=params)
print(response.content.decode())
It returns:
{"succcess":true,"data":{"holder":1}}
However, I expected the following according to the docs https://public-api.solscan.io/docs/#/Token/get_token_meta:
Any help? Thx!
Tried this with another token and got the full response. It seems like the example SPL is lacking metadata to display.
import requests
from requests.structures import CaseInsensitiveDict
url = "https://public-api.solscan.io/token/meta?tokenAddress=4k3Dyjzvzp8eMZWUXbBCjEvwSkkk59S5iCNLY3QrkX6R"
headers = CaseInsensitiveDict()
headers["accept"] = "application/json"
resp = requests.get(url, headers=headers)
print(resp.status_code)

is there a way to scrape the underlying data of a particular button?

I'm trying to scrape a webpage, for few elements using class attribute I got the data but the problem is when my loop is going to each URL to extract the information then it should extract the contact number.
Contact number is not directly available, when we click "CALL NOW" button then a pop up card is opening to show the contact number.
I tried using the class function of that phone number element but still, I'm not getting the phone number.
try:
contact = soup.find('div', class_= 'c-vn-full__number u-bold').text.strip()
except:
contact = "N/A"
Is there any way to achieve the result?
Also I left with one more element to extract "consulting fees"(Price) as a text but it has no class attribute
Try this:
import requests
from bs4 import BeautifulSoup
url = "https://www.practo.com/Bangalore/doctor/dr-venkata-krishna-rao-diabetologist-1?practice_id=776084&specialization=general%20physician"
soup = BeautifulSoup(requests.get(url).text, "html.parser").select(".u-no-margin--top")[-1]
print(soup.getText())
Output:
₹400
EDIT:
To get contact details, you need to get practice_id, doctor_id, and query_string from the source HTML. There's a huge JSON embedded there but I thought it's less hassle just scooping out the necessary parts rather than parsing this monster.
Once you have all the parts, you can use an endpoint to get the contact details.
Here's how to get this done:
import json
import re
import requests
url = "https://www.practo.com/Bangalore/doctor/" \
"dr-venkata-krishna-rao-diabetologist-1?" \
"practice_id=776084&specialization=general%20physician"
page = requests.get(url).text
query_string_pattern = re.compile(r"query_string\":\"(.*?)\"")
practice_doctor_uuid = re.compile(
r"(practice|doctor)_id\":"
r"\"([a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12})"
)
practice_id, doctor_id = [i[1] for i in re.findall(practice_doctor_uuid, page)[:2]]
query_string = re.search(query_string_pattern, page).group(1)
practice_url = "https://www.practo.com/health/api/vn/vnpractice"
query = f"{query_string}&practice_uuid={practice_id}&doctor_uuid={doctor_id}"
endpoint_url = f"{practice_url}{query}"
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
}
contact_info = requests.get(endpoint_url, headers=headers).json()
print(json.dumps(contact_info["vn_phone_number"], indent=2))
Output:
{
"number": "+918046801985",
"operator": "VOICE",
"vn_zone_id": 1,
"country_code": "IN",
"extension": true,
"id": 49090
}

How to crawl the result of another crawl in the same parse function?

Hi so i am crawling a website with articles, within each article is a link to a file, i managed to crawl all the article links, now i want to access each one and collect the link within it, instead of maybe having to save the result of the first crawl to a json and then writting another script.
thing is i am new to scrapy so i dont really know how to do that, thanks in advance!
import scrapy
class SgbdSpider(scrapy.Spider):
name = "sgbd"
start_urls = [
"http://www.sante.gouv.sn/actualites/"
]
def parse(self,response):
base = "http://www.sante.gouv.sn/actualites/"
for link in response.css(".card-title a"):
title = link.css("a::text").get()
href = link.css("a::attr(href)").extract()
#here instead of yield, i want to parse the href and then maybe yield the result of THAT parse.
yield{
"title" : title,
"href" : href
}
# next step for each href, parse again and get link in that page for pdf file
# pdf link can easily be collected with response.css(".file a::attr(href)").get()
# then write that link in a json file
next_page = response.css("li.pager-next a::attr(href)").get()
if next_page is not None and next_page.split("?page=")[-1] != "35":
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page,callback=self.parse)
you can yield a request to those pdf link with a new callback where you will put the logic extracting
A crawl spider rather than the simple basic spider is better suited to handle this. The basic spider template is generated by default so you have to specify the template to use when generating the spider.
Assuming you've created the project & are in the root folder:
$ scrapy genspider -t crawl sgbd sante.sec.gouv.sn
Opening up sgbd.py file, you'll notice the difference between it & the basic spider template.
If you're unfamiliar with XPath, here's a run-through
LinkExtractor & Rule will define your spider's behavior as per the documentation
Edit the file:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SgbdSpider(CrawlSpider):
name = 'sgbd'
allowed_domains = ['sante.sec.gouv.sn']
start_urls = ['https://sante.sec.gouv.sn/actualites']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
def set_user_agent(self, request, spider):
request.headers['User-Agent'] = self.user_agent
return request
# First rule get the links to the articles; callback is the function executed after following the link to each article
# Second rule handles pagination
# Couldn't get it to work when passing css selectors in LinkExtractor as restrict_css,
# used XPaths instead
rules = (
Rule(
LinkExtractor(restrict_xpaths='//*[#id="main-content"]/div[1]/div/div[2]/ul/li[11]/a'),
callback='parse_item',
follow=True,
process_request='set_user_agent',
),
Rule(
LinkExtractor(restrict_xpaths='//*[#id="main-content"]/div[1]/div/div[1]/div/div/div/div[3]/span/div/h4/a'),
process_request='set_user_agent',
)
)
# Extract title & link to pdf
def parse_item(self, response):
yield {
'title': response.xpath('//*[#id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/font/font/text()').get(),
'href': response.xpath('//*[#id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/#href').get()
}
Unfortunately this is as far as I could go as the site was inaccessible even with different proxies, it was taking too long to respond. You might have to tweak those XPaths a littler further. Better luck on your side.
Run the spider & save output to json
$ scrapy crawl sgbd -o results.json
Parse links in another function. Then parse again in yet another function. You can yield whatever results you want in any of those functions.
I agree with what #bens_ak47 & #user9958765 said, use a separate function.
For example, change this:
yield scrapy.Request(next_page, callback=self.parse)
to this:
yield scrapy.Request(next_page, callback=self.parse_pdffile)
then add the new method:
def parse_pdffile(self, response):
print(response.url)

Losing information when using BeautifulSoup

I am following the guide of 'Automate the Boring Stuff with Python'
practicing a project called 'Project: “I’m Feeling Lucky” Google Search'
but the CSS selector returns nothing
import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()
res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
webbrowser.open('http://google.com' + linkElems[i].get('href'))**
I already tested the same code in the IDLE shell
It seems that
linkElems = soup.select('.r')
returns nothing
and after I checked the value returned by beautiful soup
soup = bs4.BeautifulSoup(res.text,"html.parser")
I found all class='r' and class='rc' is gone for no reason.
But they were there in the raw HTML file.
Please tell me why and how to avoid such problems
To get version of HTML where it's defined class r, it's necessary to set User-Agent in headers:
import requests
from bs4 import BeautifulSoup
address = 'linux'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for a in linkElems:
if a.text.strip() == '':
continue
print(a.text)
Prints:
Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux
...and so on.
The reason why Google blocks your request is because default requests user-agent is python-requests. Check what's your user-agent thus blocking your request and resulting in completely different HTML with different elements and selectors. But sometimes you can receive a different HTML, with different selectors when using user-agent.
Learn more about user-agent and HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Try to use lxml parser instead, it's faster.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
-----
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Disclaimer, I work for SerpApi.

Scrapy splash download file from js click event

I'm using scrapy + splash plugin, I have a button which triggers a download event via ajax, I need to get the downloaded file, but don't know how.
My lua script is something like this
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local get_dimensions = splash:jsfunc([[
function () {
var rect = document.querySelector('a[aria-label="Download XML"]').getClientRects()[0];
return {"x": rect.left, "y": rect.top}
}
]])
splash:set_viewport_full()
splash:wait(0.1)
local dimensions = get_dimensions()
-- FIXME: button must be inside a viewport
splash:mouse_click(dimensions.x, dimensions.y)
splash:wait(0.1)
return splash:html()
end
My request object from my spider:
yield SplashFormRequest(self.urls['url'],
formdata=FormBuilder.build_form(response, some_object[0]),
callback=self.parse_cuenta,
cache_args=['lua_source'],
endpoint='execute',
args={'lua_source': self.script_click_xml})
Thanks in advance
I just tried this with SplashFormRequest and it looks like splash won't work for you. Instead you can send the same Ajax request using python Requests.
here is an example
data = {'__EVENTTARGET': 'main_0$body_0$lnkDownloadBio',
'__EVENTARGUMENT': '',
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategen,
'__EVENTVALIDATION': eventvalid,
'search': '',
'filters': '',
'score': ''}
HEADERS = {
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36',
'Accept': 'text / html, application / xhtml + xml, application / xml;q = 0.9, image / webp, image / apng, * / *;q = 0.8'
}
data = urllib.urlencode(data)
r = requests.post(submit_url, data=data, allow_redirects=False, headers=HEADERS)
filename = 'name-%s.pdf' % item['first_name']
with open(filename, 'wb') as f:
f.write(r.content)
Please make sure the data and headers you sending are correct.

Resources