Scroll to next page and extract data

Scroll to next page and extract data - web-scraping

i'm trying to extract all body texts in Latest updates from 'https://www.bbc.com/news/coronavirus'
i have successfully extracted body texts from the first page (1 out of 50).
I would like to scroll to the next page and do this process again.
This is the code that i have written.
from bs4 import BeautifulSoup as soup
import requests
links = []
header = []
body_text = []
r = requests.get('https://www.bbc.com/news/coronavirus')
b = soup(r.content,'lxml')
# Selecting Latest update selection
latest = b.find(class_="gel-layout__item gel-3/5#l")
# Getting title
for news in latest.findAll('h3'):
header.append(news.text)
#print(news.text)
# Getting sub-links
for news in latest.findAll('h3',{'class':'lx-stream-post__header-title gel-great-primer-bold qa-post-title gs-u-mt0 gs-u-mb-'}):
links.append('https://www.bbc.com' + news.a['href'])
# Entering sub-links and extracting texts
for link in links:
page = requests.get(link)
bsobj = soup(page.content,'lxml')
for news in bsobj.findAll('div',{'class':'ssrcss-18snukc-RichTextContainer e5tfeyi1'}):
body_text.append(news.text.strip())
#print(news.text.strip())
How should i scroll to the next page ?

Not sure what text you are exactly after, but you can go through the api.
import requests
url = 'https://push.api.bbci.co.uk/batch'
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Mobile Safari/537.36'}
for page in range(1,51):
payload = '?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2F63b2bbc8-6bea-4a82-9f6b-6ecc470d0c45%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F{page}%2Fversion%2F1.5.4?timeout=5'.format(page=page)
jsonData = requests.get(url+payload, headers=headers).json()
results = jsonData['payload'][0]['body']['results']
for result in results:
print(result['title'])
print('\t',result['summary'],'\n')

Related

Beautiful soup not identifying children of an element

I am trying to scrape this webpage. I am interested in scraping the text under DIV CLASS="example".
This is the the snippet of the script I am interested in (Stackoverflow automatically banned my post when I tried to post the code, lol):
snapshot of the sourcecode
I tried using the find function from beautifulsoup. The code I used was:
testurl = "https://www.snopes.com/fact-check/dark-profits/"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
HEADERS = {'User-agent':user_agent}
req = urllib.request.Request(testurl, headers = HEADERS) # visit disguised as browser
pagehtml = urllib.request.urlopen(req).read() # read the website
pagesoup = soup(pagehtml,'html.parser')
potentials = pagesoup.findAll("div", { "class" : "example" })
potentials[0]
potentials[0].find_children
potentials[0].find_children was not able to find anything. I have also tried potentials[0].findChildren() and it was notable to find anything either. Why is find_children not picking up the children of the div tag?

Try to change the parser from html.parser to html5lib:
import requests
from bs4 import BeautifulSoup
url = "https://www.snopes.com/fact-check/dark-profits/"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(soup.select_one(".example").get_text(strip=True, separator="\n"))
Prints:
Welcome to the site www.darkprofits.com, it's us again, now we extended our offerings, here is a list:
...and so on.

is there a way to scrape the underlying data of a particular button?

I'm trying to scrape a webpage, for few elements using class attribute I got the data but the problem is when my loop is going to each URL to extract the information then it should extract the contact number.
Contact number is not directly available, when we click "CALL NOW" button then a pop up card is opening to show the contact number.
I tried using the class function of that phone number element but still, I'm not getting the phone number.
try:
contact = soup.find('div', class_= 'c-vn-full__number u-bold').text.strip()
except:
contact = "N/A"
Is there any way to achieve the result?
Also I left with one more element to extract "consulting fees"(Price) as a text but it has no class attribute

Try this:
import requests
from bs4 import BeautifulSoup
url = "https://www.practo.com/Bangalore/doctor/dr-venkata-krishna-rao-diabetologist-1?practice_id=776084&specialization=general%20physician"
soup = BeautifulSoup(requests.get(url).text, "html.parser").select(".u-no-margin--top")[-1]
print(soup.getText())
Output:
₹400
EDIT:
To get contact details, you need to get practice_id, doctor_id, and query_string from the source HTML. There's a huge JSON embedded there but I thought it's less hassle just scooping out the necessary parts rather than parsing this monster.
Once you have all the parts, you can use an endpoint to get the contact details.
Here's how to get this done:
import json
import re
import requests
url = "https://www.practo.com/Bangalore/doctor/" \
"dr-venkata-krishna-rao-diabetologist-1?" \
"practice_id=776084&specialization=general%20physician"
page = requests.get(url).text
query_string_pattern = re.compile(r"query_string\":\"(.*?)\"")
practice_doctor_uuid = re.compile(
r"(practice|doctor)_id\":"
r"\"([a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12})"
)
practice_id, doctor_id = [i[1] for i in re.findall(practice_doctor_uuid, page)[:2]]
query_string = re.search(query_string_pattern, page).group(1)
practice_url = "https://www.practo.com/health/api/vn/vnpractice"
query = f"{query_string}&practice_uuid={practice_id}&doctor_uuid={doctor_id}"
endpoint_url = f"{practice_url}{query}"
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
}
contact_info = requests.get(endpoint_url, headers=headers).json()
print(json.dumps(contact_info["vn_phone_number"], indent=2))
Output:
{
"number": "+918046801985",
"operator": "VOICE",
"vn_zone_id": 1,
"country_code": "IN",
"extension": true,
"id": 49090
}

Losing information when using BeautifulSoup

I am following the guide of 'Automate the Boring Stuff with Python'
practicing a project called 'Project: “I’m Feeling Lucky” Google Search'
but the CSS selector returns nothing
import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()
res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
webbrowser.open('http://google.com' + linkElems[i].get('href'))**
I already tested the same code in the IDLE shell
It seems that
linkElems = soup.select('.r')
returns nothing
and after I checked the value returned by beautiful soup
soup = bs4.BeautifulSoup(res.text,"html.parser")
I found all class='r' and class='rc' is gone for no reason.
But they were there in the raw HTML file.
Please tell me why and how to avoid such problems

To get version of HTML where it's defined class r, it's necessary to set User-Agent in headers:
import requests
from bs4 import BeautifulSoup
address = 'linux'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for a in linkElems:
if a.text.strip() == '':
continue
print(a.text)
Prints:
Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux
...and so on.

The reason why Google blocks your request is because default requests user-agent is python-requests. Check what's your user-agent thus blocking your request and resulting in completely different HTML with different elements and selectors. But sometimes you can receive a different HTML, with different selectors when using user-agent.
Learn more about user-agent and HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Try to use lxml parser instead, it's faster.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
-----
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Disclaimer, I work for SerpApi.

web scraping gearbest with python

EDITIED:
i've been trying to pull some data from Gearbest.com about several products and I have some real trouble with pulling the shipping price.
i'm working with requests and beautifulsoup and so far i managed to get the name + link + price.
how can I get it's shipping price?
the urls are:
https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363
https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i've tried:
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong)
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong).text
shipping = soup.find("strong", class_="goodsIntro_shippingCost")
shipping = soup.find("strong", class_="goodsIntro_shippingCost").text
soup is the return value from here(the url is each product link):
def get_page(url):
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"})
try:
client.raise_for_status()
except requests.exceptions.HTTPError as e:
print("Error in gearbest with the url:", url)
exit(0)
soup = BeautifulSoup(client.content, 'lxml')
return soup
any ideas what can I do?

You want to use soup not souo. Also, there seems to be a difference between what is returned from request versus what is on the page for me.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363','https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
print(soup.select_one('.goodsIntro_price').text)
print(soup.select_one('.goodsIntro_shippingCost').text) # soup.find("strong", class_="goodsIntro_shippingCost").text
For the actual price it seems there are dynamic feeds for price in the network tab though it is stored under actual fee. So, perhaps there is dynamic location based updating of shipping prices.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/goods/goods-shipping?goodSn=455718101&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=540DCE6E4F455639641E0BB2B6356F15&goodPrice=1729.99&num=1&categoryId=13300&saleSizeLong=50&saleSizeWide=40&saleSizeHigh=10&saleWeight=4.5&volumeWeight=4.5&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=2&backRuleId=',
'https://www.gearbest.com/goods/goods-shipping?goodSn=459768501&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=91D909FDFFE8F8F1F9D1EC1D5D1B7C2C&goodPrice=159.99&num=1&categoryId=12004&saleSizeLong=12&saleSizeWide=10.5&saleSizeHigh=6.5&saleWeight=0.266&volumeWeight=0.266&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=1&backRuleId=']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'}).json()
print(r['data']['shippingMethodList'][0]['actualFee'])

Web Scraping Javascript Table using BeautifulSoup

I am relatively new to web scraping and prototyping using various websites. I am having difficulties with scraping what seems to be Javascript loaded tables. Any help would be much appreciated. The following is my code:
import requests
from bs4 import BeautifulSoup
url='https://onlineservice.cvo.org/webs/cvo/register/#/search/
toronto/0/1/0/10'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all(class_='table')
print(tables)

Try the below url to get all the information with the blink of an eye. You can retrieve that url by using chrome dev tools at xhr request under network tab. Give it a shot:
import requests
URL = 'https://onlineservice.cvo.org/rest/public/registrant/search/?query=%20toronto&status=0&type=1&skip=0&take=427'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}
response = requests.get(URL,headers=headers,verify=False)
for items in response.json()['result']:
lastname = items['lastName']
firstname = items['firstName']
commonname = items['commonName']
status = items['registrationStatus']['name']
print(lastname,firstname,commonname,status)
Partial results:
Aadoson Andres Andres Active
Aarabi Alireza Allen Active
Aarnes Turi Turi Expired
Abbasi Tashfeen Tashfeen Active
Abbott Jonathan Jonathan Resigned
Abd El Nour Emad Emad Active
Abdel Hady Medhat Hady Active
Abdelhalim Khaled Khaled Active

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scroll to next page and extract data - web-scraping

Related

Beautiful soup not identifying children of an element

is there a way to scrape the underlying data of a particular button?

Losing information when using BeautifulSoup

web scraping gearbest with python

Web Scraping Javascript Table using BeautifulSoup

Categories

Resources