web scraping gearbest with python - web-scraping

EDITIED:
i've been trying to pull some data from Gearbest.com about several products and I have some real trouble with pulling the shipping price.
i'm working with requests and beautifulsoup and so far i managed to get the name + link + price.
how can I get it's shipping price?
the urls are:
https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363
https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i've tried:
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong)
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong).text
shipping = soup.find("strong", class_="goodsIntro_shippingCost")
shipping = soup.find("strong", class_="goodsIntro_shippingCost").text
soup is the return value from here(the url is each product link):
def get_page(url):
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"})
try:
client.raise_for_status()
except requests.exceptions.HTTPError as e:
print("Error in gearbest with the url:", url)
exit(0)
soup = BeautifulSoup(client.content, 'lxml')
return soup
any ideas what can I do?

You want to use soup not souo. Also, there seems to be a difference between what is returned from request versus what is on the page for me.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363','https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
print(soup.select_one('.goodsIntro_price').text)
print(soup.select_one('.goodsIntro_shippingCost').text) # soup.find("strong", class_="goodsIntro_shippingCost").text
For the actual price it seems there are dynamic feeds for price in the network tab though it is stored under actual fee. So, perhaps there is dynamic location based updating of shipping prices.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/goods/goods-shipping?goodSn=455718101&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=540DCE6E4F455639641E0BB2B6356F15&goodPrice=1729.99&num=1&categoryId=13300&saleSizeLong=50&saleSizeWide=40&saleSizeHigh=10&saleWeight=4.5&volumeWeight=4.5&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=2&backRuleId=',
'https://www.gearbest.com/goods/goods-shipping?goodSn=459768501&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=91D909FDFFE8F8F1F9D1EC1D5D1B7C2C&goodPrice=159.99&num=1&categoryId=12004&saleSizeLong=12&saleSizeWide=10.5&saleSizeHigh=6.5&saleWeight=0.266&volumeWeight=0.266&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=1&backRuleId=']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'}).json()
print(r['data']['shippingMethodList'][0]['actualFee'])

Related

Beautiful soup not identifying children of an element

I am trying to scrape this webpage. I am interested in scraping the text under DIV CLASS="example".
This is the the snippet of the script I am interested in (Stackoverflow automatically banned my post when I tried to post the code, lol):
snapshot of the sourcecode
I tried using the find function from beautifulsoup. The code I used was:
testurl = "https://www.snopes.com/fact-check/dark-profits/"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
HEADERS = {'User-agent':user_agent}
req = urllib.request.Request(testurl, headers = HEADERS) # visit disguised as browser
pagehtml = urllib.request.urlopen(req).read() # read the website
pagesoup = soup(pagehtml,'html.parser')
potentials = pagesoup.findAll("div", { "class" : "example" })
potentials[0]
potentials[0].find_children
potentials[0].find_children was not able to find anything. I have also tried potentials[0].findChildren() and it was notable to find anything either. Why is find_children not picking up the children of the div tag?
Try to change the parser from html.parser to html5lib:
import requests
from bs4 import BeautifulSoup
url = "https://www.snopes.com/fact-check/dark-profits/"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(soup.select_one(".example").get_text(strip=True, separator="\n"))
Prints:
Welcome to the site www.darkprofits.com, it's us again, now we extended our offerings, here is a list:
...and so on.

Scroll to next page and extract data

i'm trying to extract all body texts in Latest updates from 'https://www.bbc.com/news/coronavirus'
i have successfully extracted body texts from the first page (1 out of 50).
I would like to scroll to the next page and do this process again.
This is the code that i have written.
from bs4 import BeautifulSoup as soup
import requests
links = []
header = []
body_text = []
r = requests.get('https://www.bbc.com/news/coronavirus')
b = soup(r.content,'lxml')
# Selecting Latest update selection
latest = b.find(class_="gel-layout__item gel-3/5#l")
# Getting title
for news in latest.findAll('h3'):
header.append(news.text)
#print(news.text)
# Getting sub-links
for news in latest.findAll('h3',{'class':'lx-stream-post__header-title gel-great-primer-bold qa-post-title gs-u-mt0 gs-u-mb-'}):
links.append('https://www.bbc.com' + news.a['href'])
# Entering sub-links and extracting texts
for link in links:
page = requests.get(link)
bsobj = soup(page.content,'lxml')
for news in bsobj.findAll('div',{'class':'ssrcss-18snukc-RichTextContainer e5tfeyi1'}):
body_text.append(news.text.strip())
#print(news.text.strip())
How should i scroll to the next page ?
Not sure what text you are exactly after, but you can go through the api.
import requests
url = 'https://push.api.bbci.co.uk/batch'
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Mobile Safari/537.36'}
for page in range(1,51):
payload = '?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2F63b2bbc8-6bea-4a82-9f6b-6ecc470d0c45%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F{page}%2Fversion%2F1.5.4?timeout=5'.format(page=page)
jsonData = requests.get(url+payload, headers=headers).json()
results = jsonData['payload'][0]['body']['results']
for result in results:
print(result['title'])
print('\t',result['summary'],'\n')

Why is this CSS selector returning no results?

I am following along with a webscraping example in Automate-the-boring-stuff-with-python but my CSS selector is returning no results
import bs4
import requests
import sys
import webbrowser
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.find_all(".r a")
numopen = min(5, len(linkelems))
for i in range(numopen):
webbrowser.open('https://google.com' + linkelems[i].get('href'))
Has google since modified how they store search links ?
From inspecting the search page elements I see no reason this selector would not work.
There are two problems:
1.) Instead of soup.find_all(".r a") use soup.select(".r a") Only .select() method accepts CSS selectors
2.) Google page needs that you specify User-Agent header to return correct page.
import bs4
import sys
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]), headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.select(".r a")
for a in linkelems:
print(a.text)
Prints (for example):
Googling ...
Tree - Wikipediaen.wikipedia.org › wiki › Tree
... and so on.
A complimentary answer to Andrej Kesely's answer.
If you don't want to deal with figuring out what selectors to use or how to bypass blocks from Google, then you can try to use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that bypass blocks, data extraction, and more is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get the data you want.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google", # search engine
"q": "fus ro dah", # query
"api_key": os.getenv("API_KEY"), # environment variable with your API-KEY
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.nexusmods.com/skyrimspecialedition/mods/14094/
https://tenor.com/search/fus-ro-dah-gifs
'''
Disclaimer, I work for SerpApi.

Losing information when using BeautifulSoup

I am following the guide of 'Automate the Boring Stuff with Python'
practicing a project called 'Project: “I’m Feeling Lucky” Google Search'
but the CSS selector returns nothing
import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()
res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
webbrowser.open('http://google.com' + linkElems[i].get('href'))**
I already tested the same code in the IDLE shell
It seems that
linkElems = soup.select('.r')
returns nothing
and after I checked the value returned by beautiful soup
soup = bs4.BeautifulSoup(res.text,"html.parser")
I found all class='r' and class='rc' is gone for no reason.
But they were there in the raw HTML file.
Please tell me why and how to avoid such problems
To get version of HTML where it's defined class r, it's necessary to set User-Agent in headers:
import requests
from bs4 import BeautifulSoup
address = 'linux'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for a in linkElems:
if a.text.strip() == '':
continue
print(a.text)
Prints:
Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux
...and so on.
The reason why Google blocks your request is because default requests user-agent is python-requests. Check what's your user-agent thus blocking your request and resulting in completely different HTML with different elements and selectors. But sometimes you can receive a different HTML, with different selectors when using user-agent.
Learn more about user-agent and HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Try to use lxml parser instead, it's faster.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
-----
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Disclaimer, I work for SerpApi.

Web Scraping Javascript Table using BeautifulSoup

I am relatively new to web scraping and prototyping using various websites. I am having difficulties with scraping what seems to be Javascript loaded tables. Any help would be much appreciated. The following is my code:
import requests
from bs4 import BeautifulSoup
url='https://onlineservice.cvo.org/webs/cvo/register/#/search/
toronto/0/1/0/10'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all(class_='table')
print(tables)
Try the below url to get all the information with the blink of an eye. You can retrieve that url by using chrome dev tools at xhr request under network tab. Give it a shot:
import requests
URL = 'https://onlineservice.cvo.org/rest/public/registrant/search/?query=%20toronto&status=0&type=1&skip=0&take=427'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}
response = requests.get(URL,headers=headers,verify=False)
for items in response.json()['result']:
lastname = items['lastName']
firstname = items['firstName']
commonname = items['commonName']
status = items['registrationStatus']['name']
print(lastname,firstname,commonname,status)
Partial results:
Aadoson Andres Andres Active
Aarabi Alireza Allen Active
Aarnes Turi Turi Expired
Abbasi Tashfeen Tashfeen Active
Abbott Jonathan Jonathan Resigned
Abd El Nour Emad Emad Active
Abdel Hady Medhat Hady Active
Abdelhalim Khaled Khaled Active

Resources