Select class name in HTML Parser containing extra words - web-scraping

I am trying to scrape a web page. I want to get reviews. But the reviews are of three categories, some are positive, some are neutral and some are negative. I am using html parser and have accessed many tags. But for the class which can be in three categories, how can I get them:
<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
<div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
<div class="review neutral" title="" style="background-color: #FFFF00;">6</div>
I have a python container for each div containing each item:
# finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})`
for container in containers:
title = container.findAll(a).text #This gives me titles
##Similarly I need the reviews of each of them here
review = container.findAll("div", {"class": "review "}))#along with review there is positive, neutral and negative word also according to the type of review

using regex, you can get the classes that contain the substring "review".
import re
for container in containers:
title = container.findAll(a).text #This gives me titles
review = container.findAll("div", {"class": re.compile(r'review')})
See the difference:
html = '''<div class="review positive" title="" style="background-color: #00B551;">9.3</div>
<div class="review negative" title="" style="background-color: #FF0000;">4.8</div>
<div class="review neutral" title="" style="background-color: #FFFF00;">6</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, 'html.parser')
review = soup.find_all('div', {'class':'review '})
print ('No regex: ',review)
print('\n')
review = soup.findAll("div", {"class": re.compile(r'review')})
print ('Regex: ',review)
Output:
No regex: []
Regex: [<div class="review positive" style="background-color: #00B551;" title="">9.3</div>, <div class="review negative" style="background-color: #FF0000;" title="">4.8</div>, <div class="review neutral" style="background-color: #FFFF00;" title="">6</div>]

Related

Verifying Bs4 Parsing Output from a Website

I was trying to scrape this site when I was running into errors due to tags that I thought existed, but did not exist in the scraped html from Bs4.
Site: https://en.thejypshop.com/category/cdlp/59/
I manually verified that the parsed output from Bs4 was giving me a completely different view of the html than when I inspected the site itself; here is a comparison of the two (copied relevant html in the two pastebin links). I also tried scraping with different parsing options such as 'lxml', 'html.parser', etc. but to no avail.
(Bs4 Output): https://pastebin.com/tg4P5DFh
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/2/" name="anchorBoxName_842">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" />
</a>
<span class="wish">
<img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png" />
</span>
</div>
<div class="icon">
<div class="promotion"></div>
<div class="button">
<div class="option"></div>
<img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png" />
<img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer" />
</div>
</div>
</div>
(html from Site): https://pastebin.com/2xfi4XTA
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/1/">
<img src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" id="eListPrdImage842_1" alt="">
</a>
</div>
<span class="pro_icon">
<img src="/web/upload/icon_202204271744355800.png" class="icon_img ec-product-listwishicon" alt="Before add to wish list" productno="842" categoryno="59" icon_status="off" login_status="F" individual-set="F">
<img src="/web/upload/icon_202204271744303700.png" onclick="category_add_basket('842','59', '1', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" alt="Add to cart" class="ec-admin-icon cart">
</span>
<span class="soldout_icon"></span>
</div>
Note that the <span class="soldout_icon"></span> tag does not appear in what Bs4 sees, among other things.
My guess as to why this is the case;
I am not using a headless browser, so some websites such as this one might not display the same thing.
There is some JS running in the background that Bs4 does not pick up on
Please let me know if any of my guesses are incorrect and what is actually going on!
Yes, you are right as
the second page is beeing built dynamicaly so you can't get the real html with bs4. Try to use combination of selenium and bs4 to get what you need. Here is a small script that finds some hidden divs and print them out. You should get deeper insight and simulate web surfing to catch the html when the page is fully developed. This one below is still in the process of construction.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
urls = ['https://en.thejypshop.com/category/cdlp/59/', 'https://pastebin.com/2xfi4XTA']
for url in urls:
data = driver.get(url)
time.sleep(1)
pg_html = driver.page_source
pg_html = pg_html.replace('<', '<').replace('>', '>')
soup = BeautifulSoup(pg_html, 'html.parser')
dv = soup.find_all('div', attrs={'class': 'thumbnail'})
dv1 = soup.find_all('span', attrs={'class': 'soldout_icon'})
try:
print(60 * '-')
print(dv[0])
except:
pass
print(60 * '-')
try:
print(dv1[0])
print(60 * '-')
except:
pass
''' R e s u l t :
------------------------------------------------------------
<div class="thumbnail">
<div class="prdImg">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg"/>
<span class="wish"><img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png"/></span>
</div>
<div class="icon">
<div class="promotion"> </div>
<div class="button">
<div class="option"></div> <img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png"/> <img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer"/> </div>
</div>
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
------------------------------------------------------------
<div class="thumbnail">
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
'''
Regards...

Python-BS4 text inside <span> that has no class

I have those 2 span with text inside them.They have no class or id and i want to scrape that text with bs4 but i don't know how.Using the small tag don't help me becouse the html is full of those.
Can someone help me with an exemple?
enter image description here
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
try this, The :nth-of-type(1) selector matches every span element that is the 1th child, of a particular type, of its parent.
for i in data.select('.lheight16 small span:nth-of-type(1)'):
print(i.text)
There are multiple options to do this, but most will orientate on the parents of the spans - Cause there is no expected output (recommend you should improve that) in your question, check these two.
Option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
Option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Example
from bs4 import BeautifulSoup
html='''
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
</div>
</td>
'''
soup = BeautifulSoup(html, 'lxml')
#option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
#option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Output
a:
Iasi
Ieri 16:13
b:
Iasi - Ieri 16:13

Scraping - find the name of all sub-class

I'm trying to find a way to get number of sub-class and their name contained in a root class.
For example, I would like to have in return for the class 'o-container__left u-mt-lg':
class= "c-site__container"
class= "c-site__container"
class= "c-site__container c-site__container__last"
I'm working with BeautifulSoup. I found this, but it didn't really do what I'm expected:
soup.div["class"]
Thank you for your help!
from bs4 import BeautifulSoup
data = """
<main class="o-page-content" role="main">
<section class="o-container">
<div class="o-container__left u-mt-lg">
<div class="c-site__container "></div>
<div class="c-site__container "></div>
<div class="c-site__container c-site__container__last"></div>
</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup.findChild('div', attrs={'class': 'o-container__left u-mt-lg'}):
print(item)
PLEASE NEXT TIME POST THE HTML AS TEXT, NOT AN IMG

Is it possible to to get an empty string in a list when there is no element, using CSS selector?

I want to scrape some items, which are on the same page, using Scrapy.
HTML looks like this:
<div class="container" id="1">
<span class="title">
product-title1
</span>
<div class="description">
product-desc
</div>
<div class="price">
1.0
</div>
</div>
I need to extract name, description and price.
Unfortunately, sometimes product doesn't have the description and HTML look like this:
<div class="container" id="2">
<span class="title">
product-title2
</span>
<div class="price">
2.0
</div>
</div>
Currently I am using CSS selectors which returns list of all elements existing on the website:
title = response.css('span[class="title"]').extract()
['product-title1', 'product-title2', 'product-title3']
description = response.css('div[class="description"]').extract()
['desc1','desc3']
price = response.css('div[class="price"]').extract()
['1.0','2.0','3.0']
Is it possible to get for example an empty string in place of missing 'desc2' when description object isn't there, using CSS selector?
I recommend you to rewrite you code:
for section in response.xpath('//div[#class="container"]'):
title = section.xpath('./span[#class="title"]/text()').get(default='not-found') # you can use any default value here or just empty string
desctiption = section.xpath('./div[#class="description"]').get()
price = section.xpath('./div[#class="price"]/text()').get()
Check this out..
for section in response.xpath('//div[#class="container"]'):
title = section.xpath('./span[#class="title"]/text()').get()
desctiption_tag = section.xpath("//div[contains(#class,'description')]")
if desctiption_tag:
desctiption = section.xpath('./div[#class="description"]').get()
else:
desctiption = "String"
price = section.xpath('./div[#class="price"]/text()').get()

How to create and reference custom heading ids with reStructuredText?

Currently, if I have:
My header
=========
`My header`_
rst2html Docutils 0.14 produces:
<div class="document" id="my-header">
<h1 class="title">My header</h1>
<p><a class="reference internal" href="#my-header">My header</a></p>
Is it possible to obtain the following ouptut instead:
<h1 class="title" id="my-custom-header">My header</h1>
<p><a class="reference internal" href="#my-custom-header">My header</a></p>
So note how I want two changes:
the id to be inside the heading, not on a separate div
control over the actual id
The closest I could get was:
<div class="document" id="my-header">
<span id="my-custom-header"></span>
<h1 class="title">My header</h1>
<p><a class="reference external" href="my-custom-header">My header</a></p>
but this is still not ideal, as I now have multiple ids floating around, and not inside the h1.
Asciidoc for example has that covered with:
[[my-custom-header]]
== My header
<<my-custom-header>>

Resources