Python-BS4 text inside <span> that has no class - web-scraping

I have those 2 span with text inside them.They have no class or id and i want to scrape that text with bs4 but i don't know how.Using the small tag don't help me becouse the html is full of those.
Can someone help me with an exemple?
enter image description here
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>

try this, The :nth-of-type(1) selector matches every span element that is the 1th child, of a particular type, of its parent.
for i in data.select('.lheight16 small span:nth-of-type(1)'):
print(i.text)

There are multiple options to do this, but most will orientate on the parents of the spans - Cause there is no expected output (recommend you should improve that) in your question, check these two.
Option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
Option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Example
from bs4 import BeautifulSoup
html='''
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
</div>
</td>
'''
soup = BeautifulSoup(html, 'lxml')
#option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
#option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Output
a:
Iasi
Ieri 16:13
b:
Iasi - Ieri 16:13

Related

Verifying Bs4 Parsing Output from a Website

I was trying to scrape this site when I was running into errors due to tags that I thought existed, but did not exist in the scraped html from Bs4.
Site: https://en.thejypshop.com/category/cdlp/59/
I manually verified that the parsed output from Bs4 was giving me a completely different view of the html than when I inspected the site itself; here is a comparison of the two (copied relevant html in the two pastebin links). I also tried scraping with different parsing options such as 'lxml', 'html.parser', etc. but to no avail.
(Bs4 Output): https://pastebin.com/tg4P5DFh
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/2/" name="anchorBoxName_842">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" />
</a>
<span class="wish">
<img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png" />
</span>
</div>
<div class="icon">
<div class="promotion"></div>
<div class="button">
<div class="option"></div>
<img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png" />
<img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer" />
</div>
</div>
</div>
(html from Site): https://pastebin.com/2xfi4XTA
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/1/">
<img src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" id="eListPrdImage842_1" alt="">
</a>
</div>
<span class="pro_icon">
<img src="/web/upload/icon_202204271744355800.png" class="icon_img ec-product-listwishicon" alt="Before add to wish list" productno="842" categoryno="59" icon_status="off" login_status="F" individual-set="F">
<img src="/web/upload/icon_202204271744303700.png" onclick="category_add_basket('842','59', '1', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" alt="Add to cart" class="ec-admin-icon cart">
</span>
<span class="soldout_icon"></span>
</div>
Note that the <span class="soldout_icon"></span> tag does not appear in what Bs4 sees, among other things.
My guess as to why this is the case;
I am not using a headless browser, so some websites such as this one might not display the same thing.
There is some JS running in the background that Bs4 does not pick up on
Please let me know if any of my guesses are incorrect and what is actually going on!
Yes, you are right as
the second page is beeing built dynamicaly so you can't get the real html with bs4. Try to use combination of selenium and bs4 to get what you need. Here is a small script that finds some hidden divs and print them out. You should get deeper insight and simulate web surfing to catch the html when the page is fully developed. This one below is still in the process of construction.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
urls = ['https://en.thejypshop.com/category/cdlp/59/', 'https://pastebin.com/2xfi4XTA']
for url in urls:
data = driver.get(url)
time.sleep(1)
pg_html = driver.page_source
pg_html = pg_html.replace('<', '<').replace('>', '>')
soup = BeautifulSoup(pg_html, 'html.parser')
dv = soup.find_all('div', attrs={'class': 'thumbnail'})
dv1 = soup.find_all('span', attrs={'class': 'soldout_icon'})
try:
print(60 * '-')
print(dv[0])
except:
pass
print(60 * '-')
try:
print(dv1[0])
print(60 * '-')
except:
pass
''' R e s u l t :
------------------------------------------------------------
<div class="thumbnail">
<div class="prdImg">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg"/>
<span class="wish"><img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png"/></span>
</div>
<div class="icon">
<div class="promotion"> </div>
<div class="button">
<div class="option"></div> <img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png"/> <img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer"/> </div>
</div>
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
------------------------------------------------------------
<div class="thumbnail">
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
'''
Regards...

beautifulsoup4 - grab Sibling element if Sibling present

The most common repetitive structure of the HTML is:
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
in such situations I grab the text it is possible for you
Occasionally (i.e., not always), the <p> of class="Standard" has a sibling <p> of class="P3", like so:
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
When this <p> of class="P3" is present, I want to additionally grab the text inside it, e.g. here I would additionally grab: (to ask a question in Spanish, you just use inflection)
My question is, given this kind of structure:
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
How can I produce output like this:
it is possible for you
it is acceptable for me
(to ask a question in Spanish, you just use inflection)
Currently, I've managed to do this:
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find("span", class_="T3")
print(p_english.contents[0])
And the output I get is:
it is possible for you
it is acceptable for me
use this :
Python Code :
from bs4 import BeautifulSoup
import re
text = '''
<div>
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
</div>
'''
soup = BeautifulSoup(text,features='html.parser')
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find('span',attrs={'class':'T3'})
nextSibling = p_standard.find_next_sibling()
print(p_english.text)
if(nextSibling.attrs['class'][0] == 'P3' and nextSibling.name == 'p'):
print(nextSibling.text)
Demo : Here
Explanation :
In order to get the class value within the find_next_sibling's
returned element i had to search into the variables of the instance
its self as there is no doc that mentions it on the official website
so i printed nextSibling.__dict__.keys()
the 0 index is because the class attribute's type is an array
I think it is more efficient to use css Or syntax and an adjacent sibling combinator to perform this
from bs4 import BeautifulSoup as bs
html = '''
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
'''
soup = bs(html, 'lxml')
items = [i.text.strip() for i in soup.select('.Standard, .Standard + .P3')]
print(items)

How to create and reference custom heading ids with reStructuredText?

Currently, if I have:
My header
=========
`My header`_
rst2html Docutils 0.14 produces:
<div class="document" id="my-header">
<h1 class="title">My header</h1>
<p><a class="reference internal" href="#my-header">My header</a></p>
Is it possible to obtain the following ouptut instead:
<h1 class="title" id="my-custom-header">My header</h1>
<p><a class="reference internal" href="#my-custom-header">My header</a></p>
So note how I want two changes:
the id to be inside the heading, not on a separate div
control over the actual id
The closest I could get was:
<div class="document" id="my-header">
<span id="my-custom-header"></span>
<h1 class="title">My header</h1>
<p><a class="reference external" href="my-custom-header">My header</a></p>
but this is still not ideal, as I now have multiple ids floating around, and not inside the h1.
Asciidoc for example has that covered with:
[[my-custom-header]]
== My header
<<my-custom-header>>

webdriver select first sibling based on nth sibling condition

i'm writing a test case in a grid based software
I'm mostly using css selectors to select elements and perform clicking
based on the image - I'm selecting the right circled element (base don a css class that displays the blue dot ), now, based on this condition, I want to select the first sibling element - which is a "plus", basically that would open the sub grid further and allow me to run further testing
I can't seem to be able to do that -
assuming that I'm using the following sample html
<div class="td">
<a class="opener">
....
</a>
</div>
<div class="td">
...
</div>
<div class="td">
...
</div>
<div class="td">
...
</div>
<div class="td">
...
</div>
<div class="td">
<a class="round-solid">
...
</a>
</div>
I can select "round-solid" - based on this, how do I select "opener" element ?
I only want the opener element for which a specific column contains the "round-solid" class
That should do the trick:
driver.findElement(By.xpath("//a[#class='round-solid']/../preceding-sibling::div/a[#class='opener']"));

Stars and aggregated rating are not shown when using schema.org markup and and Review in xhtml page

I'm trying to implement schema.org's microData format in my xhtml template.
Since I'm using xhtml templates, I needed to add
<div itemprop="reviews" itemscope="itemscope" itemtype="http://schema.org/Review">
instead of:
<div itemprop="reviews" itemscope itemtype="http://schema.org/Review">
otherwise my template wouldn't be parsed. I found the solution here
My markup looks like this:
<div itemscope="itemscope" itemtype="http://schema.org/Place">
<div itemprop="aggregateRating" itemscope="itemscope"
itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue">#{company.meanRating}</span> stars -
based on <span itemprop="reviewCount">#{company.confirmedReviewCount}</span> reviews
</div>
<ui:repeat var="review" value="#{company.reverseConfirmedReviews}">
<div itemprop="reviews" itemscope="itemscope" itemtype="http://schema.org/Review">
<span itemprop="name">Not a happy camper</span> -
by <span itemprop="author">#{review.reviewer.firstName}</span>,
<div itemprop="reviewRating" itemscope="itemscope" itemtype="http://schema.org/Rating">
<span itemprop="ratingValue">1</span>/
<span itemprop="bestRating">5</span>stars
</div>
<span itemprop="description">#{review.text} </span>
</div>
</ui:repeat>
</div>
When testing this in http://www.google.com/webmasters/tools/richsnippets I'm not getting any stars back or aggregated review count
What am I doing wrong here?
Yes!!
The problem actually consisted of two errors, first somebody had named the div class to
"hReview-aggregate" which is appropriate when you implement Microformats not
Microdata
The second error was that I misunderstood the specification of schema.org.
This is how I end up doing:
<div class="box bigBox" itemscope="itemscope" itemtype="http://schema.org/LocalBusiness">
<span itemprop="name">#{viewCompany.name}</span>
<div class="subLeftColumn" style="margin-top:10px;" itemprop="aggregateRating" itemscope="itemscope" itemtype="http://schema.org/AggregateRating">
<div class="num">
<span class="rating" id="companyRating" itemprop="ratingValue">#{rating}</span>
</div>
<div>Grade</div>
<div class="num">
<span class="count" id="companyCount" itemprop="reviewCount">
#{confirmedReviewCount}
</span>
</div>
</div>
</div>
Hope this helps!!!!!
hey checkout how holidayhq guys have done it for this url : www.holidayiq.com/destinations/Lonavala-Overview.html
you can check there snippet on this tool : http://www.google.com/webmasters/tools/richsnippets
and google out this keyword "lonavala attractions" and you will see the same snippet, they have used microdata to generate this reviews in snippet, they have used typeof="v:Review-aggregate" and much more tags, have a look at it, its nice implementation of the reviews in snippet kind of work.

Resources