Scrapy splash can't find element - web-scraping

Problem:
I am using scrapy splash to scrape a web page. However it seems the css path for imageURL does not return any element but the ones for name and category worked fine. (xpath and selector are all copied directly from Chrome.)
Things I've Tried:
At first I thought it's because the page has not been fully loaded when parse gets called so I changed the wait argument for SplashRequest to 5 but it did not help. I also downloaded a copy of the html response from splash GUI (http://localhost:8050) and verified that the xpath/selectors all work well on the downloaded copy. Here I assumed that this html would be exactly what scrapy sees in parse so I couldn't make sense of why it wouldn't work inside scrapy script.
Code:
Here is my code:
class NikeSpider(scrapy.Spider):
name = 'nike'
allowed_domains = ['nike.com', 'store.nike.com']
start_urls = ['https://www.nike.com/t/air-vapormax-flyknit-utility-running-shoe-XPTbVZzp/AH6834-400']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest (
url=url,
callback=self.parse,
args= {
'wait': 5
}
)
def parse(self, response):
name = response.xpath('//*[#id="RightRail"]/div/div[1]/div[1]/h1/text()').extract_first()
imageURL = response.css('#PDP > div > div:nth-child(2) > div.css-1jldkv2 > div:nth-child(1) > div > div > div.d-lg-h.bg-white.react-carousel > div > div.slider-container.horizontal.react-carousel-slides > ul > li.slide.selected > div > picture:nth-child(3) > img::attr(src)').extract_first()
category = response.css('#RightRail > div > div.d-lg-ib.mb0-sm.mb8-lg.u-full-width > div.ncss-base.pr12-sm > h2::text').extract_first()
url = response.url
if name != None and imageURL != None and category != None:
item = ProductItem()
item['name'] = name
item['imageURL'] = imageURL
item['category'] = category
item['URL'] = url
yield item

May they use different formatting but for me it's (source::attr(srcset) at the end):
imageURL = response.css('#PDP > div > div:nth-child(2) > div.css-1jldkv2 > div:nth-child(1) > div > div > div.d-lg-h.bg-white.react-carousel > div > div.slider-container.horizontal.react-carousel-slides > ul > li.slide.selected > div > picture:nth-child(3) > source::attr(srcset)').extract_first()

Related

Can't Access Static Drop-down Inside an Iframe

I want to click on a dropdown inside an iframe and select some options, but i keep getting this error in cypress
cypress-iframe commands can only be applied to exactly one iframe at a time. Instead found 2
Please see my code below
`
it('Publish and Lock Results', function(){
setClasses.clickTools()
setClasses.clickConfiguration()
setClasses.publishAndLockResult()
cy.frameLoaded();
cy.wait(5000)
cy.iframe().find("body > div:nth-child(6) > div:nth-child(1) > div:nth-child(1) > form:nth-child(2) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2) > div:nth-child(5) > p:nth-child(2)").click()
})
```**Please see screenshot of the element**
[![enter image description here](https://i.stack.imgur.com/WpUfn.png)](https://i.stack.imgur.com/WpUfn.png)
There should be an iframe identifier that you can use (anything else for the iframe, a unique id, class or attribute, like data-test)
let iframe = cy.get(iframeIdentifier).its('0.contentDocument.body').should('not.be.empty').then(cy.wrap)
iframe.find("body > div:nth-child(6) > div:nth-child(1) > div:nth-child(1) > form:nth-child(2) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2) > div:nth-child(5) > p:nth-child(2)").click()
From what the output is suggesting, you've got two iframes on that page that it's trying to send commands to, and you have to pick one
Answer adapted from source: cypress.io blog post on working with iframes

Click SalesForce Report Refresh Button

I'm trying to find the simplest way to automatically have a SalesForce report refresh button clicked every 2 minutes. Below are the various "copy" details for the item inspection results (sorry, I'm not sure what information would be most helpful to complete this request).
Element:
Refresh
Outer HTML:
Refresh
Sector:
#brandBand_1 > div > div > div > div > div.slds-page-header--object-home.slds-page-header_joined.slds-page-header_bleed.slds-page-header.slds-shrink-none.test-headerRegion.forceListViewManagerHeader > div:nth-child(2) > div:nth-child(3) > force-list-view-manager-button-bar > div > div:nth-child(1) > lightning-button-icon
JS Path:
document.querySelector("#brandBand_1 > div > div > div > div > div.slds-page-header--object-home.slds-page-header_joined.slds-page-header_bleed.slds-page-header.slds-shrink-none.test-headerRegion.forceListViewManagerHeader > div:nth-child(2) > div:nth-child(3) > force-list-view-manager-button-bar > div > div:nth-child(1) > lightning-button-icon")
Xpath:
//*[#id="brandBand_1"]/div/div/div/div/div[1]/div[2]/div[3]/force-list-view-manager-button-bar/div/div[1]/lightning-button-icon
Full Xpath:
/html/body/div[4]/div[1]/section/div[1]/div/div[2]/div[1]/div/div/div/div/div/div/div/div[1]/div[2]/div[3]/force-list-view-manager-button-bar/div/div[1]/lightning-button-icon
I've tried some browser extensions, but so far have only found browser page refresh extensions.

Oh no - Scrapy CSS selector used several times on a product detail page?

I am trying to scrape products (not something surprising) - but honestly, defining the CSS selector for the product descriptions that works on any product page gives me a headache.
I look for the selector that defines the product description from the following link:
https://www.onlinebaufuchs.de/Werkzeug-Technik/Elektrowerkzeuge/Akku-Geraete/Akku-Schlagschrauber/Guede-Akku-Schlagschrauber-BSS-18-1-4-Zoll-0-Akkuschrauber-ohne-Akku-Ladegeraet::7886.html
The selector is:
#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(12)
Alternatively, the selector can be:
div.pd_description:nth-of-type(6)
But sometimes the selector changes:
https://www.onlinebaufuchs.de/Werkzeug-Technik/Elektrowerkzeuge/Akku-Geraete/18-Volt-Lithium-Ionen-Akkusystem/Guede-Ladegeraet-LG-18-05-0-5-A-Aufladegeraet-fuer-diverse-Guede-Akku-Werkzeuge::7852.html
Here is the selector:
#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(11)
Alternatively, the selector can be:
div.pd_description:nth-of-type(5)
When I look at the source code, the section of product description is defined with
.pd_description
But it's too general and used often in the source code for other sections too.
I can't figure out how to solute this problem.
My spider runs correctly, but from product to product i get empty descriptions (cause of my described issue).
def parse_product(self, response):
for product in response.css("body"):
yield {
"brand": product.css('div.pd_inforow:nth-of-type(4) span::text').extract(),
"item_name": product.css("h1::text').extract(),
"description": product.css('#inner > div > div.col-lg-12-full.col-md-12-full > div:nth-child(1) > div:nth-child(12)').extract_first
Why don't I match the product description with a CSS selector on all pages?
Using XPath selector (get div with class equal to pd_description that contains h4 with text Produktbeschreibung):
product.xpath('.//div[#class="pd_description"][h4[.="Produktbeschreibung"]]').get()

Symfony DOM Crawler: querying tag which matched current item

I'm using the Symfony DOM crawler to scrape some websites and one of the issues I'm having is that if I have a scrape target which contains multiple tags, such as:
$content['html'] = $crawler->filter('
#content > div.container > div.row > div > p:nth-child(n+4),
#content > div.container > div.row > div > h3,
#content > div.container > div.row > div > blockquote')->each(function($node) {
$data = strip_tags($node->html(), '<div>, <h1>, <h2>, <h3>, <h4>, <h5>, <h6>, <p>, <a>, <strong>, <em>, <img>');
return $data;
});
I'm not getting the [p], [h3] or [blockquote] tags in my results (which is correct). However, depending on which tag I've just scraped, I would like to process the result a bit further rather than just returning it.
Is there any way the crawler can be queried to return the tag which the current item was matched against? Basically, I'd like to know whether the current item/tag I've matched was a [p], [h3] or [blockquote] which in turn will enable me to further process the results.
Figured it out ... there is a method
$node->nodeName();
which returns the tag the query was matched against ...

Trouble rendering Css Data in Django?

So I'm trying to export a section of a website to PDF, and I'm able to output the HTML data properly, but the CSS codes just appears as text in the PDF.
>
def exportPDf(results, css, html):
>
> result = StringIO.StringIO()
>
> results_2 = StringIO.StringIO(results.encode("UTF-8"))
> css_encode = StringIO.StringIO(css.encode("UTF-8"))
>
> pdf = pisa.pisaDocument(results_2 , result)#ISO-8859-1
>
> if not pdf.err:
> return HttpResponse(result.getvalue(), mimetype='application/pdf')
> return HttpResponse('We had some errors<pre>%s</pre>' % escape(html))
>
> def get_data(request):
> results = request.GET['css'] + request.GET['html']
> html = request.GET['html']
> css = request.GET['css']
> return ExportPDf(results, css, html)
Again, the HTML is fine. IT's just the css part that doesn't render. It outputs the actual CSS codes to PDF.
If you've setup your CSS as such:
<style type="text/css">
body {
color:#fff;
}
</style>
Try wrapping your css with comments:
<style type="text/css">
<!--
body {
color:#fff;
}
-->
</style>
This will force the CSS as a comment and thus won't render. Since I can't see how your code is rendered this is just a guess but let me know if it does indeed work :)

Resources