Struggling with Scrapy pagination - web-scraping

At the moment have got a bit of the Frankenstein code (consisting of Beautifulsoup and Scrapy parts) that seem to be doing a job in terms of the reading the info from page 1 urls. Shall try to redo everything in Scrapy as soon as pagination issue resolved.
So what codes is meant to do:
Read all subcategories (BeautifulSoup part)
The rest are Scrapy code parts
Using the above urls read sub-subcategories.
Extract the last page number and loop over the above urls.
Extract the necessary product info from the above urls.
All except part 3 do seem to work.
Have tried to use the below code to extract the last page number but not sure how to integrate it into the main code
def parse_paging(self, response):
try:
for next_page in ('?pn=1' + response.xpath('//ul[#class="pagination pull-left"]/noscript/a/text()').extract()[-1]):
print(next_page)
# yield scrapy.Request(url=response.urljoin(next_page))
except:
pass
The below is the main code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess
category_list = []
sub_category_url = []
root_url = 'https://uk.rs-online.com/web'
page = requests.get(root_url)
soup = BeautifulSoup(page.content, 'html.parser')
cat_up = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionUp')]
category_up = [item for sublist in cat_up for item in sublist]
cat_down = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionDown')]
category_down = [item for sublist in cat_down for item in sublist]
for c_up in category_up:
sub_category_url.append('https://uk.rs-online.com' + c_up['href'])
for c_down in category_down:
sub_category_url.append('https://uk.rs-online.com' + c_down['href'])
# print(k)
class subcategories(scrapy.Spider):
name = 'subcategories'
def start_requests(self):
urls = sub_category_url
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
products = response.css('div.card.js-title a::href').extract() #xpath("//div[contains(#class, 'js-tile')]/a/#href").
for p in products:
url = urljoin(response.url, p)
yield scrapy.Request(url, callback=self.parse_product)
def parse_product(self, response):
for quote in response.css('tr.resultRow'):
yield {
'product': quote.css('div.row.margin-bottom a::text').getall(),
'stock_no': quote.css('div.stock-no-label a::text').getall(),
'brand': quote.css('div.row a::text').getall(),
'price': quote.css('div.col-xs-12.price.text-left span::text').getall(),
'uom': quote.css('div.col-xs-12.pack.text-left span::text').getall(),
}
process = CrawlerProcess()
process.crawl(subcategories)
process.start()
Would be exceptionally grateful if you could provides any hints on how to deal with the above issue.
Let me know if you have any questions.

I would suggest you to extract next page number by using this
and then construct next page url using this number.
next_page_number = response.css('.nextPage::attr(ng-click)').re_first('\d+')

Related

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna
It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)

Get the last page Number of a wabpage - Beautiful Soup

I'm trying to get the page number of the last page of this website
http://digitalmoneytimes.com/category/crypto-news/
This links shows that the last page number is 335 but i can't extract the page number.
soup = BeautifulSoup(page.content, 'html.parser')
soup_output= soup.find_all("li",{"class":"active"})
soup_output=soup.select(tag)
print(soup_output)
I get an empty list as the output
In order to get the last page of the given website, I would strongly recommend you to use the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://digitalmoneytimes.com/category/crypto-news/")
soup = BeautifulSoup(page.content, 'html.parser')
soup = soup.find_all("a", href = True)
pages = []
for x in soup:
if "http://digitalmoneytimes.com/category/crypto-news/page/" in str(x):
pages.append(x)
last_page = pages[2].getText()
where last_page is equal to the last page. Due to the fact that I don't have access to your tag and page variables I can't really tell you where is the problem in your code.
Really hope this solves your problem.
If it is about getting the last page number, there is something you might try out as well:
import requests
from bs4 import BeautifulSoup
link = 'http://digitalmoneytimes.com/category/crypto-news/'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
last_page_num = soup.find(class_="pagination-next").find_previous_sibling().text
print(last_page_num)
Output:
336

Scrapy: Scraping nested links

I am new to Scrapy and web scraping. Please don't get mad. I am trying to scrape profilecanada.com. Now, when I ran the code below, no errors are given but I think it still not scraping. In my code, I am trying to start in a page where there is a list of link. Each link leads to a page where there is also another list of link. From that link is another page that lies the data that I needed to extract and save into a json file. In general, it something like "nested link scraping". I don't know how it is actually called. Please see the image below for the result of spider when I rant it. Thank you in advance for your help.
import scrapy
class ProfilecanadaSpider(scrapy.Spider):
name = 'profilecanada'
allowed_domains = ['http://www.profilecanada.com']
start_urls = ['http://www.profilecanada.com/browse_by_category.cfm/']
def parse(self, response):
# urls in from start_url
category_list_urls = response.css('div.div_category_list > div.div_category_list_column > ul > li.li_category > a::attr(href)').extract()
# start_u = 'http://www.profilecanada.com/browse_by_category.cfm/'
# for each category of company
for url in category_list_urls:
url = url[3:]
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.profileCategoryPages)
def profileCategoryPages(self, response):
company_list_url = response.css('div.dv_en_block_name_frame > a::attr(href)').extract()
# for each company in the list
for url in company_list_url:
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.companyDetails)
def companyDetails(self, response):
return {
'company_name': response.css('span#name_frame::text').extract_first(),
'street_address': str(response.css('span#frame_addr::text').extract_first()),
'city': str(response.css('span#frame_city::text').extract_first()),
'region_or_province': str(response.css('span#frame_province::text').extract_first()),
'postal_code': str(response.css('span#frame_postal::text').extract_first()),
'country': str(response.css('div.type6_GM > div > div::text')[-1].extract())[2:],
'phone_number': str(response.css('span#frame_phone::text').extract_first()),
'fax_number': str(response.css('span#frame_fax::text').extract_first()),
'email': str(response.css('span#frame_email::text').extract_first()),
'website': str(response.css('span#frame_website > a::attr(href)').extract_first()),
}
IMAGE RESULT IN CMD:
The result in cmd when I ran the spider
You should change allowed_domains to allowed_domains = ['profilecanada.com'] and all the return scrapy.Request to yield scrapy.Request and it'll start working, keep in mind that obeying the robots.txt is not always enough, you should throttle your requests if necessary.

scrape the next pages in python using Beautifulsoup

I want to scrape the links from each page and move on to the next pages and do the same. here is my code to scrape links from the first page:
import requests
from bs4 import BeautifulSoup
page='https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet'
request = requests.get(page)
soup = BeautifulSoup(request.text,'lxml')
links= soup.findAll('a',class_='search-list__item')
url=[]
prefix = "https://www.booli.se"
for link in links:
url.append(prefix+link["href"])
I tried the following for the first three pages, but it didn't work.
import re
import requests
from bs4 import BeautifulSoup
url=[]
prefix = "https://www.booli.se"
with requests.Session() as session:
for page in range(4):
response = session.get("https://www.booli.se/slutpriser/goteborg/22/?
objectType=L%C3%A4genhet&page=%f" % page)
soup = BeautifulSoup(response.content, "html.parser")
links= soup.findAll('a',class_='search-list__item')
for link in links:
url.append(prefix+link["href"])
First you have to create code that is working fine with one page.
Then you have to put your scraping code in loop
url = "https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet&page=1"
while True:
code goes here
You will notice there is a page=number at the end of the link.
You have to figure to run loop on these url with changing the page=number
i=1
url = "https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet&page=" + str(i)
while True:
i = i+1
page = requests.get(url)
if page.status_code != 200:
break
url = "https://www.booli.se/slutpriser/goteborg/22/?objectType=L%C3%A4genhet&page=" + str(i)
#Your scraping code goes here
#
#
I have used if statement so that the loop does not goes forever. It will go upto the last page.
Yes, I did it. Thank you. Here is the code for the first two pages:
urls=[]
for page in range(3):
urls.append("https://www.booli.se/slutpriser/goteborg/22/?
objectType=L%C3%A4genhet&page={}".format(page))
page=urls[1:]
#page
import requests
from bs4 import BeautifulSoup
inturl=[]
for page in page:
request = requests.get(page)
soup = BeautifulSoup(request.text,'lxml')
links= soup.findAll('a',class_='search-list__item')
prefix = "https://www.booli.se"
for link in links:
inturl.append(prefix+link["href"])

How To Remove White Space in Scrapy Spider Data

I am writing my first spider in Scrapy and attempting to follow the documentation. I have implemented ItemLoaders. The spider extracts the data, but the data contains many line returns. I have tried many ways to remove them, but nothing seems to work. The replace_escape_chars utility is supposed to work, but I can't figure out how to use it with the ItemLoader. Also some people use (unicode.strip), but again, I can't seem to get it to work. Some people try to use these in items.py and others in the spider. How can I clean the data of these line returns (\r\n)? My items.py file only contains the item names and field(). The spider code is below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.utils.markup import replace_escape_chars
from ccpstore.items import Greenhouse
class GreenhouseSpider(BaseSpider):
name = "greenhouse"
allowed_domains = ["domain.com"]
start_urls = [
"http://www.domain.com",
]
def parse(self, response):
items = []
l = XPathItemLoader(item=Greenhouse(), response=response)
l.add_xpath('name', '//div[#class="product_name"]')
l.add_xpath('title', '//h1')
l.add_xpath('usage', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]')
l.add_xpath('repeat', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]')
l.add_xpath('direction', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]')
items.append(l.load_item())
return items
You can use the default_output_processor on the loader and also other processors on individual fields, see title:
from scrapy.spider import BaseSpider
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Compose, MapCompose
from w3lib.html import replace_escape_chars, remove_tags
from ccpstore.items import Greenhouse
class GreenhouseSpider(BaseSpider):
name = "greenhouse"
allowed_domains = ["domain.com"]
start_urls = ["http://www.domain.com"]
def parse(self, response):
l = XPathItemLoader(Greenhouse(), response=response)
l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)
l.add_xpath('name', '//div[#class="product_name"]')
l.add_xpath('title', '//h1', Compose(remove_tags))
l.add_xpath('usage', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]')
l.add_xpath('repeat', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]')
l.add_xpath('direction', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]')
return l.load_item()
It turns out that there were also many blank spaces in the data, so combining the answer of Steven with some more research allowed the data to have all tags, line returns and duplicate spaces removed. The working code is below. Note the addition of text() on the loader lines which removes the tags and the split and join processors to remove spaces and line returns.
def parse(self, response):
items = []
l = XPathItemLoader(item=Greenhouse(), response=response)
l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars)
l.default_output_processor = Join()
l.add_xpath('title', '//h1/text()')
l.add_xpath('usage', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]/text()')
l.add_xpath('repeat', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]/text()')
l.add_xpath('direction', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]/text()')
items.append(l.load_item())
return items

Resources