Beautiful soup not identifying children of an element - web-scraping

I am trying to scrape this webpage. I am interested in scraping the text under DIV CLASS="example".
This is the the snippet of the script I am interested in (Stackoverflow automatically banned my post when I tried to post the code, lol):
snapshot of the sourcecode
I tried using the find function from beautifulsoup. The code I used was:
testurl = "https://www.snopes.com/fact-check/dark-profits/"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0'
HEADERS = {'User-agent':user_agent}
req = urllib.request.Request(testurl, headers = HEADERS) # visit disguised as browser
pagehtml = urllib.request.urlopen(req).read() # read the website
pagesoup = soup(pagehtml,'html.parser')
potentials = pagesoup.findAll("div", { "class" : "example" })
potentials[0]
potentials[0].find_children
potentials[0].find_children was not able to find anything. I have also tried potentials[0].findChildren() and it was notable to find anything either. Why is find_children not picking up the children of the div tag?

Try to change the parser from html.parser to html5lib:
import requests
from bs4 import BeautifulSoup
url = "https://www.snopes.com/fact-check/dark-profits/"
soup = BeautifulSoup(requests.get(url).content, "html5lib")
print(soup.select_one(".example").get_text(strip=True, separator="\n"))
Prints:
Welcome to the site www.darkprofits.com, it's us again, now we extended our offerings, here is a list:
...and so on.

Related

Scroll to next page and extract data

i'm trying to extract all body texts in Latest updates from 'https://www.bbc.com/news/coronavirus'
i have successfully extracted body texts from the first page (1 out of 50).
I would like to scroll to the next page and do this process again.
This is the code that i have written.
from bs4 import BeautifulSoup as soup
import requests
links = []
header = []
body_text = []
r = requests.get('https://www.bbc.com/news/coronavirus')
b = soup(r.content,'lxml')
# Selecting Latest update selection
latest = b.find(class_="gel-layout__item gel-3/5#l")
# Getting title
for news in latest.findAll('h3'):
header.append(news.text)
#print(news.text)
# Getting sub-links
for news in latest.findAll('h3',{'class':'lx-stream-post__header-title gel-great-primer-bold qa-post-title gs-u-mt0 gs-u-mb-'}):
links.append('https://www.bbc.com' + news.a['href'])
# Entering sub-links and extracting texts
for link in links:
page = requests.get(link)
bsobj = soup(page.content,'lxml')
for news in bsobj.findAll('div',{'class':'ssrcss-18snukc-RichTextContainer e5tfeyi1'}):
body_text.append(news.text.strip())
#print(news.text.strip())
How should i scroll to the next page ?
Not sure what text you are exactly after, but you can go through the api.
import requests
url = 'https://push.api.bbci.co.uk/batch'
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Mobile Safari/537.36'}
for page in range(1,51):
payload = '?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2F63b2bbc8-6bea-4a82-9f6b-6ecc470d0c45%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F{page}%2Fversion%2F1.5.4?timeout=5'.format(page=page)
jsonData = requests.get(url+payload, headers=headers).json()
results = jsonData['payload'][0]['body']['results']
for result in results:
print(result['title'])
print('\t',result['summary'],'\n')

is there a way to scrape the underlying data of a particular button?

I'm trying to scrape a webpage, for few elements using class attribute I got the data but the problem is when my loop is going to each URL to extract the information then it should extract the contact number.
Contact number is not directly available, when we click "CALL NOW" button then a pop up card is opening to show the contact number.
I tried using the class function of that phone number element but still, I'm not getting the phone number.
try:
contact = soup.find('div', class_= 'c-vn-full__number u-bold').text.strip()
except:
contact = "N/A"
Is there any way to achieve the result?
Also I left with one more element to extract "consulting fees"(Price) as a text but it has no class attribute
Try this:
import requests
from bs4 import BeautifulSoup
url = "https://www.practo.com/Bangalore/doctor/dr-venkata-krishna-rao-diabetologist-1?practice_id=776084&specialization=general%20physician"
soup = BeautifulSoup(requests.get(url).text, "html.parser").select(".u-no-margin--top")[-1]
print(soup.getText())
Output:
₹400
EDIT:
To get contact details, you need to get practice_id, doctor_id, and query_string from the source HTML. There's a huge JSON embedded there but I thought it's less hassle just scooping out the necessary parts rather than parsing this monster.
Once you have all the parts, you can use an endpoint to get the contact details.
Here's how to get this done:
import json
import re
import requests
url = "https://www.practo.com/Bangalore/doctor/" \
"dr-venkata-krishna-rao-diabetologist-1?" \
"practice_id=776084&specialization=general%20physician"
page = requests.get(url).text
query_string_pattern = re.compile(r"query_string\":\"(.*?)\"")
practice_doctor_uuid = re.compile(
r"(practice|doctor)_id\":"
r"\"([a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12})"
)
practice_id, doctor_id = [i[1] for i in re.findall(practice_doctor_uuid, page)[:2]]
query_string = re.search(query_string_pattern, page).group(1)
practice_url = "https://www.practo.com/health/api/vn/vnpractice"
query = f"{query_string}&practice_uuid={practice_id}&doctor_uuid={doctor_id}"
endpoint_url = f"{practice_url}{query}"
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
}
contact_info = requests.get(endpoint_url, headers=headers).json()
print(json.dumps(contact_info["vn_phone_number"], indent=2))
Output:
{
"number": "+918046801985",
"operator": "VOICE",
"vn_zone_id": 1,
"country_code": "IN",
"extension": true,
"id": 49090
}

How to crawl the result of another crawl in the same parse function?

Hi so i am crawling a website with articles, within each article is a link to a file, i managed to crawl all the article links, now i want to access each one and collect the link within it, instead of maybe having to save the result of the first crawl to a json and then writting another script.
thing is i am new to scrapy so i dont really know how to do that, thanks in advance!
import scrapy
class SgbdSpider(scrapy.Spider):
name = "sgbd"
start_urls = [
"http://www.sante.gouv.sn/actualites/"
]
def parse(self,response):
base = "http://www.sante.gouv.sn/actualites/"
for link in response.css(".card-title a"):
title = link.css("a::text").get()
href = link.css("a::attr(href)").extract()
#here instead of yield, i want to parse the href and then maybe yield the result of THAT parse.
yield{
"title" : title,
"href" : href
}
# next step for each href, parse again and get link in that page for pdf file
# pdf link can easily be collected with response.css(".file a::attr(href)").get()
# then write that link in a json file
next_page = response.css("li.pager-next a::attr(href)").get()
if next_page is not None and next_page.split("?page=")[-1] != "35":
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page,callback=self.parse)
you can yield a request to those pdf link with a new callback where you will put the logic extracting
A crawl spider rather than the simple basic spider is better suited to handle this. The basic spider template is generated by default so you have to specify the template to use when generating the spider.
Assuming you've created the project & are in the root folder:
$ scrapy genspider -t crawl sgbd sante.sec.gouv.sn
Opening up sgbd.py file, you'll notice the difference between it & the basic spider template.
If you're unfamiliar with XPath, here's a run-through
LinkExtractor & Rule will define your spider's behavior as per the documentation
Edit the file:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SgbdSpider(CrawlSpider):
name = 'sgbd'
allowed_domains = ['sante.sec.gouv.sn']
start_urls = ['https://sante.sec.gouv.sn/actualites']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
def set_user_agent(self, request, spider):
request.headers['User-Agent'] = self.user_agent
return request
# First rule get the links to the articles; callback is the function executed after following the link to each article
# Second rule handles pagination
# Couldn't get it to work when passing css selectors in LinkExtractor as restrict_css,
# used XPaths instead
rules = (
Rule(
LinkExtractor(restrict_xpaths='//*[#id="main-content"]/div[1]/div/div[2]/ul/li[11]/a'),
callback='parse_item',
follow=True,
process_request='set_user_agent',
),
Rule(
LinkExtractor(restrict_xpaths='//*[#id="main-content"]/div[1]/div/div[1]/div/div/div/div[3]/span/div/h4/a'),
process_request='set_user_agent',
)
)
# Extract title & link to pdf
def parse_item(self, response):
yield {
'title': response.xpath('//*[#id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/font/font/text()').get(),
'href': response.xpath('//*[#id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/#href').get()
}
Unfortunately this is as far as I could go as the site was inaccessible even with different proxies, it was taking too long to respond. You might have to tweak those XPaths a littler further. Better luck on your side.
Run the spider & save output to json
$ scrapy crawl sgbd -o results.json
Parse links in another function. Then parse again in yet another function. You can yield whatever results you want in any of those functions.
I agree with what #bens_ak47 & #user9958765 said, use a separate function.
For example, change this:
yield scrapy.Request(next_page, callback=self.parse)
to this:
yield scrapy.Request(next_page, callback=self.parse_pdffile)
then add the new method:
def parse_pdffile(self, response):
print(response.url)

Why is this CSS selector returning no results?

I am following along with a webscraping example in Automate-the-boring-stuff-with-python but my CSS selector is returning no results
import bs4
import requests
import sys
import webbrowser
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.find_all(".r a")
numopen = min(5, len(linkelems))
for i in range(numopen):
webbrowser.open('https://google.com' + linkelems[i].get('href'))
Has google since modified how they store search links ?
From inspecting the search page elements I see no reason this selector would not work.
There are two problems:
1.) Instead of soup.find_all(".r a") use soup.select(".r a") Only .select() method accepts CSS selectors
2.) Google page needs that you specify User-Agent header to return correct page.
import bs4
import sys
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]), headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.select(".r a")
for a in linkelems:
print(a.text)
Prints (for example):
Googling ...
Tree - Wikipediaen.wikipedia.org › wiki › Tree
... and so on.
A complimentary answer to Andrej Kesely's answer.
If you don't want to deal with figuring out what selectors to use or how to bypass blocks from Google, then you can try to use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that bypass blocks, data extraction, and more is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get the data you want.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google", # search engine
"q": "fus ro dah", # query
"api_key": os.getenv("API_KEY"), # environment variable with your API-KEY
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.nexusmods.com/skyrimspecialedition/mods/14094/
https://tenor.com/search/fus-ro-dah-gifs
'''
Disclaimer, I work for SerpApi.

Javascript Rendering Issue in Scrapy-Splash

I was exploring Scrapy+Splash and ran into issue that SplashRequest is not rendering the javascript and is giving exact same response scrapy.Request.
The webpage I want to scrape is this. I want some fields from the webpage for my course project.
I am unable to get the final HTML after js is rendered even after waiting for 'wait':'30'. In fact, the result is the same as scrapy.Request. The same code works perfectly for another website that I have tried ie. this. So I believe the settings are fine.
This is spider definition
import scrapy
from .. import IndeedItem
import scrapy
from scrapy_splash import SplashRequest
from bs4 import BeautifulSoup
class IndeedSpider(scrapy.Spider):
name = "indeed"
def __init__(self):
self.headers = {"Host": "www.naukri.com",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"}
def start_requests(self):
yield SplashRequest(
url = "https://www.naukri.com/job-listings-Sr-Python-Developer-Rackspace-Gurgaon-4-to-9-years-270819005015",
endpoint='render.html', headers = self.headers,
args={
'wait': 3,
}
)
def parse(self, response):
soup = BeautifulSoup(response.body)
it = IndeedItem()
it['job_title'] = soup
yield it
The settings.py (only relevant part) file is
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}
SPLASH_URL = 'http://localhost:8050/'
And the output file is here
I do not know what to make of the output, it has embedded JavaScript in it. Opening it in a browser tells that very little has been rendered (title only). How would I get rendered HTML for the website? Any help is much appreciated.

Resources