beautifulsoup4 - grab Sibling element if Sibling present - web-scraping

The most common repetitive structure of the HTML is:
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
in such situations I grab the text it is possible for you
Occasionally (i.e., not always), the <p> of class="Standard" has a sibling <p> of class="P3", like so:
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
When this <p> of class="P3" is present, I want to additionally grab the text inside it, e.g. here I would additionally grab: (to ask a question in Spanish, you just use inflection)
My question is, given this kind of structure:
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
How can I produce output like this:
it is possible for you
it is acceptable for me
(to ask a question in Spanish, you just use inflection)
Currently, I've managed to do this:
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find("span", class_="T3")
print(p_english.contents[0])
And the output I get is:
it is possible for you
it is acceptable for me

use this :
Python Code :
from bs4 import BeautifulSoup
import re
text = '''
<div>
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
</div>
'''
soup = BeautifulSoup(text,features='html.parser')
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find('span',attrs={'class':'T3'})
nextSibling = p_standard.find_next_sibling()
print(p_english.text)
if(nextSibling.attrs['class'][0] == 'P3' and nextSibling.name == 'p'):
print(nextSibling.text)
Demo : Here
Explanation :
In order to get the class value within the find_next_sibling's
returned element i had to search into the variables of the instance
its self as there is no doc that mentions it on the official website
so i printed nextSibling.__dict__.keys()
the 0 index is because the class attribute's type is an array

I think it is more efficient to use css Or syntax and an adjacent sibling combinator to perform this
from bs4 import BeautifulSoup as bs
html = '''
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
'''
soup = bs(html, 'lxml')
items = [i.text.strip() for i in soup.select('.Standard, .Standard + .P3')]
print(items)

Related

Verifying Bs4 Parsing Output from a Website

I was trying to scrape this site when I was running into errors due to tags that I thought existed, but did not exist in the scraped html from Bs4.
Site: https://en.thejypshop.com/category/cdlp/59/
I manually verified that the parsed output from Bs4 was giving me a completely different view of the html than when I inspected the site itself; here is a comparison of the two (copied relevant html in the two pastebin links). I also tried scraping with different parsing options such as 'lxml', 'html.parser', etc. but to no avail.
(Bs4 Output): https://pastebin.com/tg4P5DFh
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/2/" name="anchorBoxName_842">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" />
</a>
<span class="wish">
<img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png" />
</span>
</div>
<div class="icon">
<div class="promotion"></div>
<div class="button">
<div class="option"></div>
<img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png" />
<img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer" />
</div>
</div>
</div>
(html from Site): https://pastebin.com/2xfi4XTA
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/1/">
<img src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" id="eListPrdImage842_1" alt="">
</a>
</div>
<span class="pro_icon">
<img src="/web/upload/icon_202204271744355800.png" class="icon_img ec-product-listwishicon" alt="Before add to wish list" productno="842" categoryno="59" icon_status="off" login_status="F" individual-set="F">
<img src="/web/upload/icon_202204271744303700.png" onclick="category_add_basket('842','59', '1', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" alt="Add to cart" class="ec-admin-icon cart">
</span>
<span class="soldout_icon"></span>
</div>
Note that the <span class="soldout_icon"></span> tag does not appear in what Bs4 sees, among other things.
My guess as to why this is the case;
I am not using a headless browser, so some websites such as this one might not display the same thing.
There is some JS running in the background that Bs4 does not pick up on
Please let me know if any of my guesses are incorrect and what is actually going on!
Yes, you are right as
the second page is beeing built dynamicaly so you can't get the real html with bs4. Try to use combination of selenium and bs4 to get what you need. Here is a small script that finds some hidden divs and print them out. You should get deeper insight and simulate web surfing to catch the html when the page is fully developed. This one below is still in the process of construction.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
urls = ['https://en.thejypshop.com/category/cdlp/59/', 'https://pastebin.com/2xfi4XTA']
for url in urls:
data = driver.get(url)
time.sleep(1)
pg_html = driver.page_source
pg_html = pg_html.replace('<', '<').replace('>', '>')
soup = BeautifulSoup(pg_html, 'html.parser')
dv = soup.find_all('div', attrs={'class': 'thumbnail'})
dv1 = soup.find_all('span', attrs={'class': 'soldout_icon'})
try:
print(60 * '-')
print(dv[0])
except:
pass
print(60 * '-')
try:
print(dv1[0])
print(60 * '-')
except:
pass
''' R e s u l t :
------------------------------------------------------------
<div class="thumbnail">
<div class="prdImg">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg"/>
<span class="wish"><img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png"/></span>
</div>
<div class="icon">
<div class="promotion"> </div>
<div class="button">
<div class="option"></div> <img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png"/> <img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer"/> </div>
</div>
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
------------------------------------------------------------
<div class="thumbnail">
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
'''
Regards...

Python-BS4 text inside <span> that has no class

I have those 2 span with text inside them.They have no class or id and i want to scrape that text with bs4 but i don't know how.Using the small tag don't help me becouse the html is full of those.
Can someone help me with an exemple?
enter image description here
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
try this, The :nth-of-type(1) selector matches every span element that is the 1th child, of a particular type, of its parent.
for i in data.select('.lheight16 small span:nth-of-type(1)'):
print(i.text)
There are multiple options to do this, but most will orientate on the parents of the spans - Cause there is no expected output (recommend you should improve that) in your question, check these two.
Option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
Option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Example
from bs4 import BeautifulSoup
html='''
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
</div>
</td>
'''
soup = BeautifulSoup(html, 'lxml')
#option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
#option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Output
a:
Iasi
Ieri 16:13
b:
Iasi - Ieri 16:13

Is it possible to to get an empty string in a list when there is no element, using CSS selector?

I want to scrape some items, which are on the same page, using Scrapy.
HTML looks like this:
<div class="container" id="1">
<span class="title">
product-title1
</span>
<div class="description">
product-desc
</div>
<div class="price">
1.0
</div>
</div>
I need to extract name, description and price.
Unfortunately, sometimes product doesn't have the description and HTML look like this:
<div class="container" id="2">
<span class="title">
product-title2
</span>
<div class="price">
2.0
</div>
</div>
Currently I am using CSS selectors which returns list of all elements existing on the website:
title = response.css('span[class="title"]').extract()
['product-title1', 'product-title2', 'product-title3']
description = response.css('div[class="description"]').extract()
['desc1','desc3']
price = response.css('div[class="price"]').extract()
['1.0','2.0','3.0']
Is it possible to get for example an empty string in place of missing 'desc2' when description object isn't there, using CSS selector?
I recommend you to rewrite you code:
for section in response.xpath('//div[#class="container"]'):
title = section.xpath('./span[#class="title"]/text()').get(default='not-found') # you can use any default value here or just empty string
desctiption = section.xpath('./div[#class="description"]').get()
price = section.xpath('./div[#class="price"]/text()').get()
Check this out..
for section in response.xpath('//div[#class="container"]'):
title = section.xpath('./span[#class="title"]/text()').get()
desctiption_tag = section.xpath("//div[contains(#class,'description')]")
if desctiption_tag:
desctiption = section.xpath('./div[#class="description"]').get()
else:
desctiption = "String"
price = section.xpath('./div[#class="price"]/text()').get()

Scrapy: checking if the tag has another tag inside it and scrape both elements

I am trying to scrape an html page that uses this structure:
<div class="article-body">
<div id="firstBodyDiv">
<p class="ng-scope">
This is a dummy text for explanation purposes
</p>
<p> class="ng-scope">
This is a <a>dummy</a> text for explanation purposes
</p>
</div>
</div>
as you can see some of the P elements have a elements and some dont.
What i did so far is the following:
economics["article_content"] = response.css("div.article-body div#firstBodyDiv > p:nth-child(n+1)::text").extract()
but it returns only the text before and after the a element if there is an aelement inside the p element
while this query return the a(s) elements:
response.css("div.article-body div#firstBodyDiv p:nth-child(n+1) a::text").extract()
i want to find a way to check whether there is an a element or not so i can execute the other query(the one who scrape the text inside the a element)
this is what i did so far to do so:
for i in response.css("div.article-body div#firstBodyDiv p:nth-child(n+1)"):
if response.css("div.article-body div#firstBodyDiv p:nth-child(n+1) a") in i :
# ofcourse this isnt working since and i am getting this error
# 'in <string>' requires string as left operand, not SelectorList
# probably i will have a different list1, list1.append() the p
# before, a, and the p text after the a element
# assign that list to economics["article_content"]
Although i am using css selectors, you are welcome to use xpath selectors.
You can use the descendant-or-self functionality from xpath, which will get all inner texts.
for i in response.css('div.article-body div#firstBodyDiv > p:nth-child(n+1)'):
print(''.join(i.xpath('descendant-or-self::text()').extract()))
You can also use scrapy shell in order to test your code with raw HTML like so:
$ scrapy shell
from scrapy.http import HtmlResponse
response = HtmlResponse(url='test', body='''<div class="article-body">
<div id="firstBodyDiv">
<p class="ng-scope">
This is a dummy text for explanation purposes
</p>
<p class="ng-scope">
This is a <a>dummy</a> text for explanation purposes
</p>
</div>
</div>
''', encoding='utf-8')
for i in response.css('div.article-body div#firstBodyDiv > p:nth-child(n+1)'):
print(''.join(i.xpath('descendant-or-self::text()').extract()))

Rich Snippets Nesting Issue

I've never used Rich Snippets before, so this is a little bit of a learning curve for me. I believe my issue is a nesting problem but I can't find any documentation anywhere that explicitly states how to nest these properties correctly.
I'm wanting to index a single-product review with multiple reviews into Rich Snippets with classic ASP pulling in different data feilds, here is my code:
<div>
<div itemscope itemtype="http://data-vocabulary.org/Review">
<span itemprop="itemreviewed">Forma Stanzol</span><br />
By <span itemprop="reviewer"><%=formaStanzolReviewArray(0,i)%></span><br />
<time itemprop="dtreviewed" datetime="<%=FormatDateTime(formaStanzolReviewArray(1,i),2)%>"><%=FormatDateTime(formaStanzolReviewArray(1,i),2)%></time> <br />
<span itemprop="description"><%=formaStanzolComment%></span>
</div>
</div>
This returns the Error: No rich snippet will be generated for this data, because it appears to include multiple reviews of an item, but no aggregate review information.
So, I added a dummy Aggregate code with static values, here's what it looks like all together:
<div>
<div itemscope itemtype="http://data-vocabulary.org/Review">
<span itemprop="itemreviewed">Forma Stanzol</span><br />
By <span itemprop="reviewer"><%=formaStanzolReviewArray(0,i)%></span><br />
<time itemprop="dtreviewed" datetime="<%=FormatDateTime(formaStanzolReviewArray(1,i),2)%>"><%=FormatDateTime(formaStanzolReviewArray(1,i),2)%></time> <br />
<span itemprop="description"><%=formaStanzolComment%></span>
</div>
<div itemscope itemtype="http://data-vocabulary.org/Review-aggregate">
<span itemprop="itemreviewed">Forma Stanzol</span>
<span itemprop="rating" itemscope itemtype="http://data-vocabulary.org/Rating">
<span itemprop="average">9</span>
out of <span itemprop="best">10</span>
</span>
based on<span itemprop="count">5</span> user reviews.
</div>
</div>
This causes my "Reviews" to not error but then all of my "Aggregate Reviews" push out this Error: No rich snippet will be generated for this data, because it appears to include multiple aggregate reviews of many items, instead of a single aggregate review of one item.
Seems like it's working against itself no matter what I do, so that's why I believe this to be a nesting issue.
How can I fix this?
EDIT: Ideally, I don't event want the Aggregate view of this item. The reviewer, item name, review date, and review description is all I need.
EDIT EDIT: This code is also running in a For loop where its getting information from the database with each pass.
Ok so the issue here was that a website, with a single product, but multiple reviews needs only one "Review-Aggregate" and one "Rating" itemtype. However, multiple "Review" itemtypes must be used.
So, my For Loop creates a "Review" for each row in the database, using the related data feilds and then after the conditional statement, the "Review-Aggregate" and Rating" codes are placed.
Code:
For i = 0 to uBound(formaStanzolReviewArray,2)
reviewCount = reviewCount + 1
formaStanzolComment = trim(formaStanzolReviewArray(2,i))
'Do not show reviews with empty comments
If Not (formaStanzolComment = "") OR isNull(formaStanzolComment) Then
%>
<div>
<div itemscope itemtype="http://data-vocabulary.org/Review">
<span style="position: absolute; left: 9999px;" itemprop="itemreviewed">Forma Stanzol</span>
Rating: <span itemprop="rating"><%=formaStanzolReviewArray(3,i)%></span> -
By <span itemprop="reviewer"><%=formaStanzolReviewArray(0,i)%></span> -
<time itemprop="dtreviewed" datetime="<%=FormatDateTime(formaStanzolReviewArray(1,i),2)%>"><%=FormatDateTime(formaStanzolReviewArray(1,i),2)%></time> <br />
<span itemprop="description"><%=formaStanzolComment%></span>
</div>
</div>
<%
sumRating = sumRating + formaStanzolReviewArray(3,i)
End If
Next
ratingAvg = sumRating / reviewCount
%>
<div style="position: absolute; left: 9999px;">
<div itemscope itemtype="http://data-vocabulary.org/Review-aggregate">
<span itemprop="rating" itemscope itemtype="http://data-vocabulary.org/Rating">
<span itemprop="worst">1</span>
<span itemprop="average"><%=ratingAvg%></span>
out of <span itemprop="best">5</span>
</span>
based on <span itemprop="votes"><%=reviewCount%></span> ratings.
<span itemprop="count"><%=reviewCount%></span> user reviews.
</div>
</div>
<%
Think of it as multiple User reviews in the For Loop, but we collect all of those reviews once in the aggregate, and then give that aggregate a rating scale.
Hope this helps anyone having nesting issues.
Please Note: I am using classic ASP for this particular code.

Resources