Verifying Bs4 Parsing Output from a Website - web-scraping

I was trying to scrape this site when I was running into errors due to tags that I thought existed, but did not exist in the scraped html from Bs4.
Site: https://en.thejypshop.com/category/cdlp/59/
I manually verified that the parsed output from Bs4 was giving me a completely different view of the html than when I inspected the site itself; here is a comparison of the two (copied relevant html in the two pastebin links). I also tried scraping with different parsing options such as 'lxml', 'html.parser', etc. but to no avail.
(Bs4 Output): https://pastebin.com/tg4P5DFh
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/2/" name="anchorBoxName_842">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" />
</a>
<span class="wish">
<img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png" />
</span>
</div>
<div class="icon">
<div class="promotion"></div>
<div class="button">
<div class="option"></div>
<img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png" />
<img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer" />
</div>
</div>
</div>
(html from Site): https://pastebin.com/2xfi4XTA
<div class="thumbnail">
<div class="prdImg">
<a href="/product/stray-kids-mini-album-maxident-case-ver/842/category/59/display/1/">
<img src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg" id="eListPrdImage842_1" alt="">
</a>
</div>
<span class="pro_icon">
<img src="/web/upload/icon_202204271744355800.png" class="icon_img ec-product-listwishicon" alt="Before add to wish list" productno="842" categoryno="59" icon_status="off" login_status="F" individual-set="F">
<img src="/web/upload/icon_202204271744303700.png" onclick="category_add_basket('842','59', '1', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" alt="Add to cart" class="ec-admin-icon cart">
</span>
<span class="soldout_icon"></span>
</div>
Note that the <span class="soldout_icon"></span> tag does not appear in what Bs4 sees, among other things.
My guess as to why this is the case;
I am not using a headless browser, so some websites such as this one might not display the same thing.
There is some JS running in the background that Bs4 does not pick up on
Please let me know if any of my guesses are incorrect and what is actually going on!

Yes, you are right as
the second page is beeing built dynamicaly so you can't get the real html with bs4. Try to use combination of selenium and bs4 to get what you need. Here is a small script that finds some hidden divs and print them out. You should get deeper insight and simulate web surfing to catch the html when the page is fully developed. This one below is still in the process of construction.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
urls = ['https://en.thejypshop.com/category/cdlp/59/', 'https://pastebin.com/2xfi4XTA']
for url in urls:
data = driver.get(url)
time.sleep(1)
pg_html = driver.page_source
pg_html = pg_html.replace('<', '<').replace('>', '>')
soup = BeautifulSoup(pg_html, 'html.parser')
dv = soup.find_all('div', attrs={'class': 'thumbnail'})
dv1 = soup.find_all('span', attrs={'class': 'soldout_icon'})
try:
print(60 * '-')
print(dv[0])
except:
pass
print(60 * '-')
try:
print(dv1[0])
print(60 * '-')
except:
pass
''' R e s u l t :
------------------------------------------------------------
<div class="thumbnail">
<div class="prdImg">
<img alt="" id="eListPrdImage842_2" src="https://cafe24img.poxo.com/jyp3602022/web/product/medium/202210/ca01b08c39232296f482b657be53aa4b.jpg"/>
<span class="wish"><img alt="Before add to wish list" categoryno="59" class="icon_img ec-product-listwishicon" icon_status="off" individual-set="F" login_status="F" productno="842" src="/web/upload/icon_202204271744355800.png"/></span>
</div>
<div class="icon">
<div class="promotion"> </div>
<div class="button">
<div class="option"></div> <img alt="Add to cart" class="ec-admin-icon cart" onclick="category_add_basket('842','59', '2', 'A0000', false, '1', 'P0000BGK', 'B', 'T', '20');" src="/web/upload/icon_202204271744303700.png"/> <img alt="View larger image" onclick="zoom('842', '59', '2','', '');" src="//img.echosting.cafe24.com/design/skin/admin/en_US/btn_prd_zoom.gif" style="cursor:pointer"/> </div>
</div>
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
------------------------------------------------------------
<div class="thumbnail">
</div>
------------------------------------------------------------
<span class="soldout_icon"></span>
------------------------------------------------------------
'''
Regards...

Related

Python-BS4 text inside <span> that has no class

I have those 2 span with text inside them.They have no class or id and i want to scrape that text with bs4 but i don't know how.Using the small tag don't help me becouse the html is full of those.
Can someone help me with an exemple?
enter image description here
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
try this, The :nth-of-type(1) selector matches every span element that is the 1th child, of a particular type, of its parent.
for i in data.select('.lheight16 small span:nth-of-type(1)'):
print(i.text)
There are multiple options to do this, but most will orientate on the parents of the spans - Cause there is no expected output (recommend you should improve that) in your question, check these two.
Option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
Option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Example
from bs4 import BeautifulSoup
html='''
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
</div>
</td>
'''
soup = BeautifulSoup(html, 'lxml')
#option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
#option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Output
a:
Iasi
Ieri 16:13
b:
Iasi - Ieri 16:13

How to create and reference custom heading ids with reStructuredText?

Currently, if I have:
My header
=========
`My header`_
rst2html Docutils 0.14 produces:
<div class="document" id="my-header">
<h1 class="title">My header</h1>
<p><a class="reference internal" href="#my-header">My header</a></p>
Is it possible to obtain the following ouptut instead:
<h1 class="title" id="my-custom-header">My header</h1>
<p><a class="reference internal" href="#my-custom-header">My header</a></p>
So note how I want two changes:
the id to be inside the heading, not on a separate div
control over the actual id
The closest I could get was:
<div class="document" id="my-header">
<span id="my-custom-header"></span>
<h1 class="title">My header</h1>
<p><a class="reference external" href="my-custom-header">My header</a></p>
but this is still not ideal, as I now have multiple ids floating around, and not inside the h1.
Asciidoc for example has that covered with:
[[my-custom-header]]
== My header
<<my-custom-header>>

Create custom portelt - mixture of news and events

I'm trying to create a custom portelt which displays news or event items based on a keyword. I don't want to use the collects portlet facility. I've created a Python Script which fetches the necessary results like this:
from Products.CMFCore.utils import getToolByName
portal_catalog = getToolByName(context, 'portal_catalog')
return portal_catalog.searchResults(
Subject = 'Startseite',
end={'query': DateTime(),
'range': 'min'},
sort_on='start',
sort_limit=3,
review_state='external')[:5]
So this should fetch all objects with the review_state 'external' and the keyword (or subject) 'Startseite'. Now I've created a page template which is used to render a classic portelt:
<html xmlns:tal="http://xml.zope.org/namespaces/tal"
xmlns:metal="http://xml.zope.org/namespaces/metal"
i18n:domain="plone">
<body>
<div metal:define-macro="portlet"
tal:define="view context/##events_view;
results context/startpage_informations;
events_link view/all_events_link;
prev_events_link view/prev_events_link"
tal:condition="results">
<dl class="portlet" id="portlet-events">
<dt class="portletHeader">
<span class="portletTopLeft"></span>
<a href=""
tal:attributes="href events_link"
class="tile">
Wichtige Neuigkeiten:
</a>
<span class="portletTopRight"></span>
</dt>
<tal:events tal:repeat="obj results">
<dd class="portletItem"
tal:define="oddrow repeat/obj/odd"
tal:attributes="class python:test(oddrow, 'portletItem even', 'portletItem odd')">
<a href="#"
class="tile"
tal:attributes="href obj/getURL;
title obj/Description">
<img src="#" alt="" tal:replace="structure here/news_icon.gif" />
<span tal:replace="obj/pretty_title_or_id">
Some Event
</span>
<span class="portletItemDetails"
tal:define="starts python:toLocalizedTime(obj.start, long_format=1);
ends python:toLocalizedTime(obj.end, long_format=1);
startTime python:toLocalizedTime(obj.start,time_only=1);
endTime python:toLocalizedTime(obj.end,time_only=1);
startDay python:toLocalizedTime(obj.start, long_format=0);
endDay python:toLocalizedTime(obj.end, long_format=0);">
<span>
<tal:condition condition="obj/location">
<tal:location content="obj/location">Location</tal:location>,<br />
</tal:condition>
<tal:sameday tal:condition="python:startDay==endDay">
<span tal:condition="startDay" tal:replace="startDay">:[If this is an event, show its start time and date]</span><br />
Uhrzeit: <span tal:condition="startTime" tal:replace="startTime">[If this is an event, show its start time and date]</span> -
<span tal:condition="endTime" tal:replace="endTime">[If this is an event, show its start time and date]</span>
</tal:sameday>
<tal:multiday tal:condition="python:startDay!=endDay">
Vom <span tal:condition="starts" tal:replace="starts">[If this is a multi-day event, show its start date]</span><br />bis
<tal:hasendday tal:condition="ends">
<span tal:replace="ends">[If this is a multi-day event, show its end date]</span>
</tal:hasendday>
</tal:multiday>
</span>
</span>
</a>
</dd>
</tal:events>
</dl>
</div>
</body>
</html>

Stars and aggregated rating are not shown when using schema.org markup and and Review in xhtml page

I'm trying to implement schema.org's microData format in my xhtml template.
Since I'm using xhtml templates, I needed to add
<div itemprop="reviews" itemscope="itemscope" itemtype="http://schema.org/Review">
instead of:
<div itemprop="reviews" itemscope itemtype="http://schema.org/Review">
otherwise my template wouldn't be parsed. I found the solution here
My markup looks like this:
<div itemscope="itemscope" itemtype="http://schema.org/Place">
<div itemprop="aggregateRating" itemscope="itemscope"
itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue">#{company.meanRating}</span> stars -
based on <span itemprop="reviewCount">#{company.confirmedReviewCount}</span> reviews
</div>
<ui:repeat var="review" value="#{company.reverseConfirmedReviews}">
<div itemprop="reviews" itemscope="itemscope" itemtype="http://schema.org/Review">
<span itemprop="name">Not a happy camper</span> -
by <span itemprop="author">#{review.reviewer.firstName}</span>,
<div itemprop="reviewRating" itemscope="itemscope" itemtype="http://schema.org/Rating">
<span itemprop="ratingValue">1</span>/
<span itemprop="bestRating">5</span>stars
</div>
<span itemprop="description">#{review.text} </span>
</div>
</ui:repeat>
</div>
When testing this in http://www.google.com/webmasters/tools/richsnippets I'm not getting any stars back or aggregated review count
What am I doing wrong here?
Yes!!
The problem actually consisted of two errors, first somebody had named the div class to
"hReview-aggregate" which is appropriate when you implement Microformats not
Microdata
The second error was that I misunderstood the specification of schema.org.
This is how I end up doing:
<div class="box bigBox" itemscope="itemscope" itemtype="http://schema.org/LocalBusiness">
<span itemprop="name">#{viewCompany.name}</span>
<div class="subLeftColumn" style="margin-top:10px;" itemprop="aggregateRating" itemscope="itemscope" itemtype="http://schema.org/AggregateRating">
<div class="num">
<span class="rating" id="companyRating" itemprop="ratingValue">#{rating}</span>
</div>
<div>Grade</div>
<div class="num">
<span class="count" id="companyCount" itemprop="reviewCount">
#{confirmedReviewCount}
</span>
</div>
</div>
</div>
Hope this helps!!!!!
hey checkout how holidayhq guys have done it for this url : www.holidayiq.com/destinations/Lonavala-Overview.html
you can check there snippet on this tool : http://www.google.com/webmasters/tools/richsnippets
and google out this keyword "lonavala attractions" and you will see the same snippet, they have used microdata to generate this reviews in snippet, they have used typeof="v:Review-aggregate" and much more tags, have a look at it, its nice implementation of the reviews in snippet kind of work.

Does IE7 have a problem applying CSS to dynamically created DOM Elements?

I'm building an HTML page that uses endless scrolling functionality to render new list elements as you scroll down, like on Facebook. I'm using the jquery.pageless plugin.
The thing is, now testing it on IE7, when I load the dynamic content, none of it is styled like it should be. The first set (20 rows), which the server generated in the html page, look fine. Then the next 20-per-page that are rendered with javascript, don't have any of the styles applied.
How do I fix this? Having a hard time testing it in IE7 from a mac. Is this a problem with IE7? Or could perhaps the elements not be being appended to the correct parent div in IE (using jQuery so I doubt this)? Or is there a common hack for reloading the stylesheets after every dynamically created html element is added?
The doctype is HTML5: <!DOCTYPE html>
Thanks for the advice!
Update:
Looking in the IE7 developer panel, the HTML is being spit out all malformed. The first time around it looks like this:
<article class='community-page page none vevent' data-status='available' data-type='community_page' itemscope='itemscope' itemtype='http://www.data-vocabulary.org/Event'>
<figure class='snapshot'><time class='availability dtstart' datetime='2010-12-16T00:00:00-08:00' itemprop='startDate' title='2010-12-16T00:00:00-08:00'><span class='value-title' title='2010-12-16T00:00:00-08:00'></span></time><span></span><img alt="Logo for Heavenly Cleaning" class="photo" src="/images/41/heavenly-cleaning-logo-small.JPG?1297971958" title="Logo for Heavenly Cleaning" />
Like
</figure>
<section class='details' itemprop='seller' itemtype='http://data-vocabulary.org/Organization'>
<header class='header'>
<hgroup>
<h3 class='user fn org' itemprop='name'>
Name<span class='hyphen'>-</span><span class='distance'>Wheaton, IL</span>
</h3>
<h2 class='title'><span class='quotation-mark'>"</span>Quote<span class='quotation-mark'>"</span><time class='expiration-date dtend'><span class='value-title' title='2011-12-16T00:00:00-08:00'></span></time></h2>
</hgroup>
</header>
<p class='highlights'></p>
<p class='description' itemprop='description'></p>
<footer class='footer'>
<address class='location adr' itemprop='address' itemscope='itemscope' itemtype='http://data-vocabulary.org/Address'>
<span class='locality' itemprop='locality'></span>
<span class='geo' itemprop='geo' itemtype='http://data-vocabulary.org/Geo'>
<span class='latitude' itemprop='latitude'>
<span class='value-title' title='41.850249'></span>
</span>
<span class='longitude' itemprop='longitude'>
<span class='value-title' title='-88.0855459'></span>
</span>
</span>
<span class='tel' itemprop='tel'></span>
</address>
Category: Home
</footer>
</section>
</article>
The second time around it looks more like this:
<article class='community-page page none vevent' data-status='available' data-type='community_page' itemscope='itemscope' itemtype='http://www.data-vocabulary.org/Event'/>
<figure class='snapshot'/>
<time class='availability dtstart' datetime='2010-12-16T00:00:00-08:00' itemprop='startDate' title='2010-12-16T00:00:00-08:00'/>
<span class='value-title' title='2010-12-16T00:00:00-08:00'/>
</time/>
<a href="/users/25?page_id=25" class="fancy-ajax logo">
<span/>
<img alt="Logo for Heavenly Cleaning" class="photo" src="/images/41/name-logo-small.JPG?1297971958" title="Logo for Heavenly Cleaning" />
</a>
Like
</figure/>
...
I am returning it as a json string and appending it like this:
$(window).load(function() {
var params = paginator;
params.dataType = "string";
$('#content').pageless({
url: window.location.pathname,
params: params,
distance: 500,
totalPages: 10,
loaderImage: "/images/loaders/load.gif",
scrape: function(data) {
var data = $.parseJSON(data);
var paginator = data.paginator;
var search = data.search;
var html = data.pages // html string;
if (data.more == false) {
$.pageless.settings.totalPages = $.pageless.settings.currentPage;
if($.pageless.settings.totalPages <= $.pageless.settings.currentPage){
$.pageless.stopListener();
}
}
$.pageless.settings.params = {dataType: "string", paginator: paginator, q: search.q, c: search.c, l: search.l, a: search.a};
return html;
}
});
})
Since you're using HTML5 elements, I assume you're using HTML5Shiv or Modernizr to hack IE to support those elements?
If not then yes, you will definitely have issues, since IE6/7/8 will simply not recognise those tags as valid HTML.

Resources