Scrapy: Scraping nested links

Scrapy: Scraping nested links - web-scraping

I am new to Scrapy and web scraping. Please don't get mad. I am trying to scrape profilecanada.com. Now, when I ran the code below, no errors are given but I think it still not scraping. In my code, I am trying to start in a page where there is a list of link. Each link leads to a page where there is also another list of link. From that link is another page that lies the data that I needed to extract and save into a json file. In general, it something like "nested link scraping". I don't know how it is actually called. Please see the image below for the result of spider when I rant it. Thank you in advance for your help.
import scrapy
class ProfilecanadaSpider(scrapy.Spider):
name = 'profilecanada'
allowed_domains = ['http://www.profilecanada.com']
start_urls = ['http://www.profilecanada.com/browse_by_category.cfm/']
def parse(self, response):
# urls in from start_url
category_list_urls = response.css('div.div_category_list > div.div_category_list_column > ul > li.li_category > a::attr(href)').extract()
# start_u = 'http://www.profilecanada.com/browse_by_category.cfm/'
# for each category of company
for url in category_list_urls:
url = url[3:]
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.profileCategoryPages)
def profileCategoryPages(self, response):
company_list_url = response.css('div.dv_en_block_name_frame > a::attr(href)').extract()
# for each company in the list
for url in company_list_url:
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.companyDetails)
def companyDetails(self, response):
return {
'company_name': response.css('span#name_frame::text').extract_first(),
'street_address': str(response.css('span#frame_addr::text').extract_first()),
'city': str(response.css('span#frame_city::text').extract_first()),
'region_or_province': str(response.css('span#frame_province::text').extract_first()),
'postal_code': str(response.css('span#frame_postal::text').extract_first()),
'country': str(response.css('div.type6_GM > div > div::text')[-1].extract())[2:],
'phone_number': str(response.css('span#frame_phone::text').extract_first()),
'fax_number': str(response.css('span#frame_fax::text').extract_first()),
'email': str(response.css('span#frame_email::text').extract_first()),
'website': str(response.css('span#frame_website > a::attr(href)').extract_first()),
}
IMAGE RESULT IN CMD:
The result in cmd when I ran the spider

You should change allowed_domains to allowed_domains = ['profilecanada.com'] and all the return scrapy.Request to yield scrapy.Request and it'll start working, keep in mind that obeying the robots.txt is not always enough, you should throttle your requests if necessary.

Related

stuck scraping the same 2nd page with infinite scroll

I'm trying to scrape game reviews from steam.
when running the spider above, I get the first page with 10 reviews.
then the second page with 10 reviews three times
class MySpider(scrapy.Spider):
name = "MySpider"
download_delay = 6
page_number = 1
start_urls = (
'https://steamcommunity.com/app/1794680/reviews/',
)
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'LOG_ENABLED': False,
'LOG_FILE': 'logging.txt',
'LOG_FILE_APPEND': False,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'FEEDS': {"items.json": {"format": "json", 'overwrite': True},},
}
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
for review in soup.find_all('div', class_="apphub_UserReviewCardContent"):
{...}
if(self.page_number<4):
self.page_number +=1
yield scrapy.Request('https://steamcommunity.com/app/1794680/homecontent/?userreviewscursor=AoIIPwYYanu12fcD&userreviewsoffset={offset}&p={p}&workshopitemspage={p}&readytouseitemspage={p}&mtxitemspage={p}&itemspage={p}&screenshotspage={p}&videospage={p}&artpage={p}&allguidepage={p}&webguidepage={p}&integratedguidepage={p}&discussionspage={p}&numperpage=10&browsefilter=trendweek&browsefilter=trendweek&l=english&appHubSubSection=10&filterLanguage=default&searchText=&maxInappropriateScore=100'.format(offset=10*(self.page_number-1) ,p=self.page_number),method='GET', callback=self.parse)
json output
I took a few request when scrolling the reviews.
I changed all values that looked like page number and replaced them with {p},
also I tried changing the 'userreviewsoffset' to fit the request format
i noticed that 'userreviewscursor' has a changing value every request but I don't know where it is from.

Your issue is with userreviewscursor=AoIIPwYYanu12fcD part of the url. That bit will change for every call, and you can find it in the HTML response under:
<input type="hidden" name="userreviewscursor" value="AoIIPwYYanLi8vYD">
Get that value and add it to the next call, and you're alright. (didn't want to babysit you and write the full code, but if needs be, let me know).

Web Scraper Returning Nothing

below is my code for a webscraper, the issue I'm having is that it's not returning anything. Does anyone know what's going on?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ['https://www.homedepot.ca']
start_urls = ['https://www.homedepot.ca/search?q=Malibu%20wide%20plank%20french%20montara#!q=Malibu%20wide%20plank%20french%20montara']
def parse(self, response):
for quote in response.cs('div.quote'):
title = quote.css('acl-product-card__title::text').getall()
price = quote.css('acl-product-card__price::text').getall()
combined = (title.text + ' ' + price.text)
print(combined)

Are you sure it's not raising an exception? This shouldn't work:
def parse(self, response):
for quote in response.cs('div.quote'):
It should be response.css instead of response.cs
Also, seems that your start url is down.
Maybe it doesn't accept requests from my country.
In case it opens to you, you should check the Selectors to see what each of them are returning, that will allow you to pinpoint where is the issue.

Struggling with Scrapy pagination

At the moment have got a bit of the Frankenstein code (consisting of Beautifulsoup and Scrapy parts) that seem to be doing a job in terms of the reading the info from page 1 urls. Shall try to redo everything in Scrapy as soon as pagination issue resolved.
So what codes is meant to do:
Read all subcategories (BeautifulSoup part)
The rest are Scrapy code parts
Using the above urls read sub-subcategories.
Extract the last page number and loop over the above urls.
Extract the necessary product info from the above urls.
All except part 3 do seem to work.
Have tried to use the below code to extract the last page number but not sure how to integrate it into the main code
def parse_paging(self, response):
try:
for next_page in ('?pn=1' + response.xpath('//ul[#class="pagination pull-left"]/noscript/a/text()').extract()[-1]):
print(next_page)
# yield scrapy.Request(url=response.urljoin(next_page))
except:
pass
The below is the main code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess
category_list = []
sub_category_url = []
root_url = 'https://uk.rs-online.com/web'
page = requests.get(root_url)
soup = BeautifulSoup(page.content, 'html.parser')
cat_up = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionUp')]
category_up = [item for sublist in cat_up for item in sublist]
cat_down = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionDown')]
category_down = [item for sublist in cat_down for item in sublist]
for c_up in category_up:
sub_category_url.append('https://uk.rs-online.com' + c_up['href'])
for c_down in category_down:
sub_category_url.append('https://uk.rs-online.com' + c_down['href'])
# print(k)
class subcategories(scrapy.Spider):
name = 'subcategories'
def start_requests(self):
urls = sub_category_url
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
products = response.css('div.card.js-title a::href').extract() #xpath("//div[contains(#class, 'js-tile')]/a/#href").
for p in products:
url = urljoin(response.url, p)
yield scrapy.Request(url, callback=self.parse_product)
def parse_product(self, response):
for quote in response.css('tr.resultRow'):
yield {
'product': quote.css('div.row.margin-bottom a::text').getall(),
'stock_no': quote.css('div.stock-no-label a::text').getall(),
'brand': quote.css('div.row a::text').getall(),
'price': quote.css('div.col-xs-12.price.text-left span::text').getall(),
'uom': quote.css('div.col-xs-12.pack.text-left span::text').getall(),
}
process = CrawlerProcess()
process.crawl(subcategories)
process.start()
Would be exceptionally grateful if you could provides any hints on how to deal with the above issue.
Let me know if you have any questions.

I would suggest you to extract next page number by using this
and then construct next page url using this number.
next_page_number = response.css('.nextPage::attr(ng-click)').re_first('\d+')

When using scrapy shell, I get no data from response.xpath

I am trying to scrape a betting site. However, when I check for the retrieved data in scrapy shell, I receive nothing.
The xpath to what I need is: //*[#id="yui_3_5_0_1_1562259076537_31330"] and when I write in the shell this is what I get:
In [18]: response.xpath ( '//*[#id="yui_3_5_0_1_1562259076537_31330"]')
Out[18]: []
The output is [] but I expected to be something from which I could extract the href.
When I use the "inspect" tool from Chrome, while the site is still loading, this id is outlined in purple. Does this mean that the site is using JavaScipt? And if this is true, is this the reason why scrapy does not find the item and returns []?

i try scraping the site just using Scrapy and this is my result.
This the items.py file
import scrapy
class LifeMatchsItem(scrapy.Item):
Event = scrapy.Field() # Name of event
Match = scrapy.Field() # Teams1 vs Team2
Date = scrapy.Field() # Date of Match
This is my Spider code
import scrapy
from LifeMatchesProject.items import LifeMatchsItem
class LifeMatchesSpider(scrapy.Spider):
name = 'life_matches'
start_urls = ['http://www.betfair.com/sport/home#sscpl=ro/']
custom_settings = {'FEED_EXPORT_ENCODING': 'utf-8'}
def parse(self, response):
for event in response.xpath('//div[contains(#class,"events-title")]'):
for element in event.xpath('./following-sibling::ul[1]/li'):
item = LifeMatchsItem()
item['Event'] = event.xpath('./a/#title').get()
item['Match'] = element.xpath('.//div[contains(#class,"event-name-info")]/a/#data-event').get()
item['Date'] = element.xpath('normalize-space(.//div[contains(#class,"event-name-info")]/a//span[#class="date"]/text())').get()
yield item
And this is the result

Scraping data from wikipedia using Scrapy - why/when do errors occur due to processing URLs?

I am just starting to use Scrapy, and I am learning to use it as I go along. Please can someone explain why there is an error in my code, and what this error is? Is this error related to an invalid URL I have provided, and/or is it connected with invalid xpaths?
Here is my code:
from scrapy.spider import Spider
from scrapy.selector import Selector
class CatswikiSpider(Spider):
name = "catswiki"
allowed_domains = ["http://en.wikipedia.org/wiki/Cat‎"]
start_urls = [
"http://en.wikipedia.org/wiki/Cat‎"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//body/div')
for site in sites:
title = ('//h1/span/text()').extract()
subtitle = ('//h2/span/text()').extract()
boldtext = ('//p/b').extract()
links = ('//a/#href').extract()
imagelinks = ('//img/#src').re(r'.*cat.*').extract()
print title, subtitle, boldtext, links, imagelinks
#filename = response.url.split("/")[-2]
#open(filename, 'wb').write(response.body)
And here are some attachments, showing the errors in the command prompt:

You need a function call before all your extract lines. I'm not familiar with scrapy, but it's probably something like:
title = site.xpath('//h1/span/text()').extract()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrapy: Scraping nested links - web-scraping

You should change allowed_domains to allowed_domains = ['profilecanada.com'] and all the return scrapy.Request to yield scrapy.Request and it'll start working, keep in mind that obeying the robots.txt is not always enough, you should throttle your requests if necessary.

Related

stuck scraping the same 2nd page with infinite scroll

Web Scraper Returning Nothing

Struggling with Scrapy pagination

When using scrapy shell, I get no data from response.xpath

Scraping data from wikipedia using Scrapy - why/when do errors occur due to processing URLs?

Categories

Resources