Web Scraping Javascript Table using BeautifulSoup - web-scraping

I am relatively new to web scraping and prototyping using various websites. I am having difficulties with scraping what seems to be Javascript loaded tables. Any help would be much appreciated. The following is my code:
import requests
from bs4 import BeautifulSoup
url='https://onlineservice.cvo.org/webs/cvo/register/#/search/
toronto/0/1/0/10'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all(class_='table')
print(tables)

Try the below url to get all the information with the blink of an eye. You can retrieve that url by using chrome dev tools at xhr request under network tab. Give it a shot:
import requests
URL = 'https://onlineservice.cvo.org/rest/public/registrant/search/?query=%20toronto&status=0&type=1&skip=0&take=427'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}
response = requests.get(URL,headers=headers,verify=False)
for items in response.json()['result']:
lastname = items['lastName']
firstname = items['firstName']
commonname = items['commonName']
status = items['registrationStatus']['name']
print(lastname,firstname,commonname,status)
Partial results:
Aadoson Andres Andres Active
Aarabi Alireza Allen Active
Aarnes Turi Turi Expired
Abbasi Tashfeen Tashfeen Active
Abbott Jonathan Jonathan Resigned
Abd El Nour Emad Emad Active
Abdel Hady Medhat Hady Active
Abdelhalim Khaled Khaled Active

Related

Beautiful soup not identifying children of an element

I am trying to scrape this webpage. I am interested in scraping the text under DIV CLASS="example".
This is the the snippet of the script I am interested in (Stackoverflow automatically banned my post when I tried to post the code, lol):
snapshot of the sourcecode
I tried using the find function from beautifulsoup. The code I used was:
testurl = "https://www.snopes.com/fact-check/dark-profits/"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
HEADERS = {'User-agent':user_agent}
req = urllib.request.Request(testurl, headers = HEADERS) # visit disguised as browser
pagehtml = urllib.request.urlopen(req).read() # read the website
pagesoup = soup(pagehtml,'html.parser')
potentials = pagesoup.findAll("div", { "class" : "example" })
potentials[0]
potentials[0].find_children
potentials[0].find_children was not able to find anything. I have also tried potentials[0].findChildren() and it was notable to find anything either. Why is find_children not picking up the children of the div tag?
Try to change the parser from html.parser to html5lib:
import requests
from bs4 import BeautifulSoup
url = "https://www.snopes.com/fact-check/dark-profits/"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(soup.select_one(".example").get_text(strip=True, separator="\n"))
Prints:
Welcome to the site www.darkprofits.com, it's us again, now we extended our offerings, here is a list:
...and so on.

Scroll to next page and extract data

i'm trying to extract all body texts in Latest updates from 'https://www.bbc.com/news/coronavirus'
i have successfully extracted body texts from the first page (1 out of 50).
I would like to scroll to the next page and do this process again.
This is the code that i have written.
from bs4 import BeautifulSoup as soup
import requests
links = []
header = []
body_text = []
r = requests.get('https://www.bbc.com/news/coronavirus')
b = soup(r.content,'lxml')
# Selecting Latest update selection
latest = b.find(class_="gel-layout__item gel-3/5#l")
# Getting title
for news in latest.findAll('h3'):
header.append(news.text)
#print(news.text)
# Getting sub-links
for news in latest.findAll('h3',{'class':'lx-stream-post__header-title gel-great-primer-bold qa-post-title gs-u-mt0 gs-u-mb-'}):
links.append('https://www.bbc.com' + news.a['href'])
# Entering sub-links and extracting texts
for link in links:
page = requests.get(link)
bsobj = soup(page.content,'lxml')
for news in bsobj.findAll('div',{'class':'ssrcss-18snukc-RichTextContainer e5tfeyi1'}):
body_text.append(news.text.strip())
#print(news.text.strip())
How should i scroll to the next page ?
Not sure what text you are exactly after, but you can go through the api.
import requests
url = 'https://push.api.bbci.co.uk/batch'
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Mobile Safari/537.36'}
for page in range(1,51):
payload = '?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2F63b2bbc8-6bea-4a82-9f6b-6ecc470d0c45%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F{page}%2Fversion%2F1.5.4?timeout=5'.format(page=page)
jsonData = requests.get(url+payload, headers=headers).json()
results = jsonData['payload'][0]['body']['results']
for result in results:
print(result['title'])
print('\t',result['summary'],'\n')

Losing information when using BeautifulSoup

I am following the guide of 'Automate the Boring Stuff with Python'
practicing a project called 'Project: “I’m Feeling Lucky” Google Search'
but the CSS selector returns nothing
import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()
res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
webbrowser.open('http://google.com' + linkElems[i].get('href'))**
I already tested the same code in the IDLE shell
It seems that
linkElems = soup.select('.r')
returns nothing
and after I checked the value returned by beautiful soup
soup = bs4.BeautifulSoup(res.text,"html.parser")
I found all class='r' and class='rc' is gone for no reason.
But they were there in the raw HTML file.
Please tell me why and how to avoid such problems
To get version of HTML where it's defined class r, it's necessary to set User-Agent in headers:
import requests
from bs4 import BeautifulSoup
address = 'linux'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for a in linkElems:
if a.text.strip() == '':
continue
print(a.text)
Prints:
Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux
...and so on.
The reason why Google blocks your request is because default requests user-agent is python-requests. Check what's your user-agent thus blocking your request and resulting in completely different HTML with different elements and selectors. But sometimes you can receive a different HTML, with different selectors when using user-agent.
Learn more about user-agent and HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Try to use lxml parser instead, it's faster.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
-----
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''
Disclaimer, I work for SerpApi.

web scraping gearbest with python

EDITIED:
i've been trying to pull some data from Gearbest.com about several products and I have some real trouble with pulling the shipping price.
i'm working with requests and beautifulsoup and so far i managed to get the name + link + price.
how can I get it's shipping price?
the urls are:
https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363
https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i've tried:
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong)
shipping = soup.find("span", class_="goodsIntro_attrText").get("strong).text
shipping = soup.find("strong", class_="goodsIntro_shippingCost")
shipping = soup.find("strong", class_="goodsIntro_shippingCost").text
soup is the return value from here(the url is each product link):
def get_page(url):
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"})
try:
client.raise_for_status()
except requests.exceptions.HTTPError as e:
print("Error in gearbest with the url:", url)
exit(0)
soup = BeautifulSoup(client.content, 'lxml')
return soup
any ideas what can I do?
You want to use soup not souo. Also, there seems to be a difference between what is returned from request versus what is on the page for me.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/gaming-laptops/pp_009863068586.html?wid=1433363','https://www.gearbest.com/smart-watch-phone/pp_009309925869.html?wid=1433363i']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
print(soup.select_one('.goodsIntro_price').text)
print(soup.select_one('.goodsIntro_shippingCost').text) # soup.find("strong", class_="goodsIntro_shippingCost").text
For the actual price it seems there are dynamic feeds for price in the network tab though it is stored under actual fee. So, perhaps there is dynamic location based updating of shipping prices.
from bs4 import BeautifulSoup as bs
import requests
urls = ['https://www.gearbest.com/goods/goods-shipping?goodSn=455718101&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=540DCE6E4F455639641E0BB2B6356F15&goodPrice=1729.99&num=1&categoryId=13300&saleSizeLong=50&saleSizeWide=40&saleSizeHigh=10&saleWeight=4.5&volumeWeight=4.5&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=2&backRuleId=',
'https://www.gearbest.com/goods/goods-shipping?goodSn=459768501&countryCode=GB&realWhCodeList=1433363&realWhCode=1433363&priceMd5=91D909FDFFE8F8F1F9D1EC1D5D1B7C2C&goodPrice=159.99&num=1&categoryId=12004&saleSizeLong=12&saleSizeWide=10.5&saleSizeHigh=6.5&saleWeight=0.266&volumeWeight=0.266&properties=8&shipTemplateId=&isPlatform=0&virWhCode=1433363&deliveryType=0&platformCategoryId=&recommendedLevel=1&backRuleId=']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent':'Mozilla\5.0'}).json()
print(r['data']['shippingMethodList'][0]['actualFee'])

POST data into ASPX with requests. How to fix "remote mashine error" message

I'm trying to scrap data from aspx page using request with POST data.
On parsed html I'm getting an error "An application error occurred on the server. The current custom error settings for this application prevent the details of the application error from being viewed remotely (for security reasons). It could, however, be viewed by browsers running on the local server machine."
I was searching for solutions a while but frankly I'm new in Python and can't really figure out what's wrong.
The ASPX has javaonclick function which opens a new window with data in html.
The code I've created is below.
Any help or suggestions would be greatly welcomed. Thank you!
import requests
from bs4 import BeautifulSoup
session = requests.Session()
url = 'http://ws1.osfi-bsif.gc.ca/WebApps/FINDAT/Insurance.aspx?T=0&LANG=E'
r=session.get(url)
soup = BeautifulSoup(r.content,'lxml')
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
payload = {
r'__EVENTTARGET': r'',
r'__EVENTARGUMENT': r'',
r'__LASTFOCUS': r'',
r'__VIEWSTATE': viewstate,
r'__VIEWSTATEGENERATOR': r'B2E4460D',
r'__EVENTVALIDATION': eventvalidation,
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionType': r'radioButton1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$institutionDropDownList': r'F018',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportTemplateDropDownList': r'C_LIFE-1',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$reportDateDropDownList': r'3+-+2015',
r'InsuranceWebPartManager$gwpinsuranceControl$insuranceControl$submitButton': r'Submit'
}
HEADER = {
"Content-Type":"application/x-www-form-urlencoded",
"Content-Length":"11759",
"Host":"ws1.osfi-bsif.gc.ca",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language":"en-US,en;q=0.5",
"Cache-Control": "max-age=0",
"Accept-Encoding":"gzip, deflate",
"Connection":"keep-alive",
}
df = session.post(url, data=payload, headers=HEADER)
print df.text

Resources