Splash is unable to extract elements - web-scraping

I am trying to scrape https://www.lithia.com/new-inventory/index.htm.
But it seems that the slash is unable to extract simple elements on the page.
I tried to extract an element from the page with the appropriate XPath using either Scrapy project (python) or slash site (http://0.0.0.0:8050/), but Splash is unable to extract the element.
Code (I have simply so it is easier to convey and debug) :
import scrapy
from scrapy_splash import SplashRequest
from time import sleep
class CarSpider(scrapy.Spider):
name = 'car1'
allowed_domains = ['lithia.com/']
start_urls = ['https://www.lithia.com/baierl-auto-group/new-inventory.htm']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url = url,
callback = self.parse,
endpoint = 'render.html')
def parse(self, response):
sleep(5)
year = response.xpath('//span[contains(#class, "facet-list-group-label") and contains(text(), "Year")]')
sleep(5)
yield{
'year': year,
}
It returns:
{'year': []}
Meaning it is not extracted.
I check the Splash site (http://0.0.0.0:8050/) as well, and lots of element is not displayed in the HTML output. It seems like there is some rendering issue.
Following that, I came across this page (https://splash.readthedocs.io/en/stable/faq.html#website-is-not-rendered-correctly), informing possible debugs of the rendering issue by Splash:
I have tried:
Turning off private mode.
Tuning splash:wait()
setting splash:viewport_full()
adding splash:set_user_agent
enable it using splash.plugins_enabled
splash.html5_media_enabled property to enable HTML5 media
But so far, I am still unable to extract the element. In fact, lots of other elements cannot be extracted as well, just giving the element above as an example.
Please Help.

Related

CrawlSpider Rule not working for a link that actually exists in target site

I'm stuck trying to find a way to make my spider work. This is the scenario: I'm trying to find all the URLs of a specific domain that are contained in a particular target website. For this, I've defined a couple of rules so I can crawl the site and find out the links of my interest.
The thing is that it doesn't seem to work, even when I know that there are links with the proper format inside the website.
This is my spider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class sp(CrawlSpider):
name = 'sp'
start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']
custom_settings = {
'LOG_LEVEL': 'INFO',
'DEPTH_LIMIT': 4
}
rules = (
Rule(LinkExtractor(unique=True, allow_domains='a2zinc.net'), callback='parse_item'),
Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
)
def parse_item(self, response):
print(response.request.url)
yield {'link':response.request.url}
So, in summary, I'm trying to find all the links from 'a2zinc.net' contained inside https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/ and its subsections.
As you guys can see, there are at least 3 occurrences of the desired links inside the target website.
The funny thing is that when I test the spider using another target site (like this one) that also contains links of interest, it works as expected and I can't really see the difference.
Also, if I define a Link Extractor instance (as in the snippet below) inside a parsing method, it is also capable of finding the desired links, but I think this won't be the best way of using CrawlSpider + Rules.
def parse_item(self, response):
le = LinkExtractor(allow_domains='a2zinc.net')
links = le.extract_links(response)
for link in links:
yield {'link': link.url}
any idea what the cause of the problem could be?
Thanks a lot.
Your code works. The only issue is that you have set the logging level to INFO while the links that are being extracted are returning status code 403 which is only visible at the DEBUG level. Comment out your custom settings and you will see that the links are being extracted.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class sp(CrawlSpider):
name = 'sp'
start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']
custom_settings = {
# 'LOG_LEVEL': 'INFO',
# 'DEPTH_LIMIT': 4
}
rules = (
Rule(LinkExtractor(allow_domains='a2zinc.net'), callback='parse_item'),
Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
)
def parse_item(self, response):
print(response.request.url)
yield {'link':response.request.url}
OUTPUT:

How to extract data that loads differently with scrapy

I’m trying to extract product reviews on URLs like this one
https://www.namastevaporizers.com/products/mighty-vaporizer
the spider I have extracts anything on the page but nothing from the comments, I think it is because the comments load differently but unfortunately this is where my knowledge of scrappy ends. Can anyone help me with this?
here is my spider
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy.spiders import Spider
from Namaste.items import NPPItem
class NPP(Spider):
name = 'Product_Pages'
start_urls = ['https://www.namastevaporizers.com/products/mighty-vaporizer'
def parse(self, response):
item_loader = ItemLoader(item=NPPItem(), response=response)
item_loader.add_css("Z_reviews", "div.yotpo-user-name") # gets nothing
item_loader.add_css("Z_reviews", "div.content-title") # gets nothing
item_loader.add_css("Z_reviews", "div.content-review") # gets nothing
item_loader.add_css("Z_reviews", "div.yotpo") # gets some data but missing most stuff, this is the entire yotpo content wrapper
item_loader.add_value("AAE_source_url", response.url) #works fine
return item_loader.load_item()
the reviews in this site are loaded by JS, so you need to forge the request as your chrome do
Follow these steps you will get the result
open your chrome dev tool, shift to network tab, search (note: it's search instead of filter) a review content, you will the
request (I got the request url:https://staticw2.yotpo.com/batch)
Copy the curl command in chrome
Execute the curl in shell, if it's success, the next step is parse the curl and forge it in python code (The curl actually works in this site, i tried)
You can parse the curl in https://curl.trillworks.com/#python

Scrapy does not find text in Xpath or Css

I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element.
to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text.
from scrapy.selector import Selector
start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
#BASIC ITEM AND SPIDER YADA, SPARE YOU THE DETAILS
hxs = Selector(response)
response_css = response.css("body")
desc_data = hxs.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract()
desc_data2 = response_css.css('#DETAILS_TRUNC_TEXT::text').extract()
both return empty lists. Yes, I found the xpath and css selector via chrome, but the rest of them work just fine as I'm able to find other data on the site. Please help me find out why this isn't working.
To get the data you need to use any browser simulator like selenium so that It can catch the response of dynamically generated content. You need to put some delay to let the webpage load it's content fully. This is how you can go:
from selenium import webdriver
from scrapy import Selector
import time
driver = webdriver.Chrome()
URL = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
driver.get(URL)
time.sleep(5) #If you take out this line you won't get anything because the content of that page take some time to get loaded.
sel = Selector(text=driver.page_source)
item = sel.css('#DETAILS_TRUNC_TEXT::text').extract() #It is working
item_ano = sel.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract() #It is also working
print(item, item_ano)
driver.quit()
I tried your xpath and css in scrapy shell, and got nothing also.
Then I used view(response) command and found out the site is dynamic.
Here is a screenshot:
You can see that the details under Overview doesn't show up, and that's why no matter how you try, you still got nothing.
Solutions: Try Selenium (check the solution that SIM provided in the last answer) or Splash.
Good Luck. :)

Scrapy not return results with xpath

i am trying make scrapping to get stats in this url
http://www.acb.com/redaccion.php?id=133495
I firstly try with player name:
import scrapy
import requests
from scrapy.item import Item, Field
from ligafemanager.items import LigafemanagerItem
class Lf1Spider(scrapy.Spider):
name = 'lf1'
allowed_domains = ['acb.com']
start_urls = ['http://www.acb.com/redaccion.php?id=133495']
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
i = LigafemanagerItem()
i['acb_player_name'] = response.xpath('//td/div/codigo/table[1]/tbody/tr/td[2]/font/text()').extract()
self.logger.info('------------ACB NAME is: %s ------',
i['acb_player_name'])
return i
never return results
Well thats a tricky one, because what you see is not the real truth. Consider the html from Firebug
Now see the View source of the same page
All the ones highlighted in read are tags with error in firefox view source windows. Also notice one key thing tbody is missing. This is what happens with many sites, there is not tbody used in the HTML but browser does its autocorrection and add tbody to display the table correctly in browser.
When you are working with script the tbody is not there in the source and since scrapy won't do any auto correction, your XPATH with tbody won't find the element your are interested in. So simplest solution? Remove tbody from your xpath
In [3]: response.xpath('//td/div/codigo/table[1]/tr/td[2]/font/text()').extract()
Out[3]: ['Nombre']

Scrapy spider will not crawl on start urls

I am brand new to scrappy and have worked my way through the tutorial and am trying to figure out how to implement what I have learned so far to complete a seemingly basic task. I know very little python so far and am using this as a learning experience, so if I ask a simple question, I apologize.
My goal for this program is to follow this link http://ucmwww.dnr.state.la.us/ucmsearch/FindDocuments.aspx?idx=xwellserialnumber&val=971683 and to extract the well serial number to a csv file. Eventually I want to run this spider on several thousand different well files and retrieve specific data. However, I am starting with the basics first.
Right now the spider doesnt crawl on any web page that I enter. There are no errors listed in the code when I run it, it just states that 0 pages were crawled. I cant quite figure out what I am doing wrong. I am positive the start url is ok as I have checked it out. Do I need a specific type of spider to accomplish what I am trying to do?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class Sonrisdataaccess(Spider):
name = "serial"
allowed_domains = ["sonris.com"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498"]
def parse(self, response):
questions = Selector(response).xpath('/html/body/table[1]/tbody/tr[2]/td[1]')
for question in questions:
item = SonrisdataaccessItem()
item['serial'] = question.xpath ('/html/body/table[1]/tbody/tr[2]/td[1]').extract()[0]
yield item
Thank you for any help, I greatly appreciate it!
First of all I do not understand what you are doing in your for loop because if you have a selector you do not get the whole HTML again to select it...
Nevertheless, the interesting part is that the browser represents the table way different than it is downloaded with Scrapy. If you look at the response in your parse method you will see that there is no tbody element in the first table. This is why your selection does not return anything.
So to get the first serial number (as it is in your XPath) change your parse function to this:
def parse(self, response):
item = SonrisdataaccessItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
For later changes you may have to alter the XPath expression to get more data.

Resources