How to get the all the urls form this page? - web-scraping

I am working on a scrapy project to get all the urls form this page and the following pages, but when i run the spider, i just get one url from each page!. I wrote a for loop to get them all, but nothing changed?
I need to get each ad data into a row in a csv file, how to do that ?
The spider code:
import datetime
import urlparse
import socket
import re
from scrapy.loader.processors import MapCompose, Join
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from cars2buy.items import Cars2BuyItem
class Cars2buyCarleasingSpider(CrawlSpider):
name = "cars2buy-carleasing"
start_urls = ['http://www.cars2buy.co.uk/business-car-leasing/']
rules = (
Rule(LinkExtractor(allow=("Abarth"), restrict_xpaths='//*[#id="content"]/div[7]/div[2]/div/a')),
Rule(LinkExtractor(allow=("695C"), restrict_xpaths='//*[#id="content"]/div/div/p/a'), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[#class="next"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for l in response.xpath('//*[#class="viewthiscar"]/#href'):
item=Cars2BuyItem()
item['Company']= l.extract()
item['url']= response.url
return item
the output is:
> 2017-04-27 20:22:39 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.fleetprices.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-204097572&broker=178&veh_id=901651523&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr', 'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/'}
> 2017-04-27 20:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> (referer: http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/)
> 2017-04-27 20:22:40 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.jgleasing.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-207378762&broker=248&veh_id=902250527&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr', 'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2'}
> 2017-04-27 20:22:40 [scrapy.core.engine] INFO: Closing spider
> (finished)

The problem is, once your for loop processed the first item it hits the return, leaves the parse_item method and so no other items are processed.
Suggest you replace the return with yield:
def parse_item(self, response):
for l in response.xpath('//*[#class="viewthiscar"]/#href'):
item=Cars2BuyItem()
item['Company']= l.extract()
item['url']= response.url
yield item

Related

scrapy fail to scrap all the items

There are more than 500 items, but scrapy shell only manages 5 items.
from urllib import response
import scrapy
class Elo1Spider(scrapy.Spider):
name = 'elo1'
allowed_domains = ['exportleftovers.com']
start_urls = ['http://exportleftovers.com/']
def parse(self, response):
for products in response.css('div.product-wrap'):
yield {
'name':products.css('a.product-thumbnail__title::text').get() ,
'price' : products.css('span.money::text').get().strip(),
}
next_page = response.css('a.pagination-next').attrib['href']
if next_page is not None:
yield response.follow(next_page,callback=self.parse)

Scrapy is not extracting data, css selectors are correct

This is my first scraper and I am having some trouble. To begin I created my css selectors and they work when using scrapy shell. When I run myspider it simply returns this
2017-10-26 14:48:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: digikey)
2017-10-26 14:48:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'digikey', 'CONCURRENT_REQUESTS': 1, 'NEW
SPIDER_MODULE': 'digikey.spiders', 'SPIDER_MODULES': ['digikey.spiders'], 'USER_AGENT': 'digikey ("Mozilla/5.0 (Windows
NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")'}
2017-10-26 14:48:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-26 14:48:50 [scrapy.core.engine] INFO: Spider opened
2017-10-26 14:48:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-10-26 14:48:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-26 14:48:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikey.com/products/en/capacitors/alumin
um-electrolytic-capacitors/58/page/3?stock=1> (referer: None)
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-26 14:48:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 329,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 104631,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 10, 26, 21, 48, 52, 235020),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 10, 26, 21, 48, 50, 249076)}
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Spider closed (finished)
PS C:\Users\dalla_000\digikey>
My spider looks like this
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from digikey.items import DigikeyItem
from scrapy.selector import Selector
class DigikeySpider(CrawlSpider):
name = 'digikey'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1', ), deny=('subsection\.php', ))),
)
def parse_item(self, response):
for row in response.css('table#productTable tbody tr'):
item = DigikeyItem()
item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first()
item['manufacturer'] = row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first()
item['description'] = row.css('.tr-description::text').extract_first()
item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first()
item['price'] = row.css('.tr-unitPrice::text').extract_first()
item['minimumquanity'] = row.css('.tr-minQty::text').extract_first()
yield item
parse_start_url = parse_item
The items.py looks like:
import scrapy
class DigikeyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
partnumber = scrapy.Field()
manufacturer = scrapy.Field()
description = scrapy.Field()
quanity= scrapy.Field()
minimumquanity = scrapy.Field()
price = scrapy.Field()
pass'
Setting:
BOT_NAME = 'digikey'
SPIDER_MODULES = ['digikey.spiders']
NEWSPIDER_MODULE = 'digikey.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'digikey ("Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
I am struggling to understand why no data is being extracted with working css selectors.Furthermore the spider is completing the job and closing. I am restricting the spider to only crawl one page, when it is working properly I will open it for the entire website.
This works for me (don't change the indentation):
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
class DigikeyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
partnumber = scrapy.Field()
manufacturer = scrapy.Field()
description = scrapy.Field()
quanity= scrapy.Field()
minimumquanity = scrapy.Field()
price = scrapy.Field()
class DigikeySpider(CrawlSpider):
name = 'digikey'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3', )),callback='parse_item'),
)
def parse_item(self, response):
for row in response.css('table#productTable tbody tr'):
item = DigikeyItem()
item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first()
item['manufacturer'] = row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first()
item['description'] = row.css('.tr-description::text').extract_first()
item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first()
item['price'] = row.css('.tr-unitPrice::text').extract_first()
item['minimumquanity'] = row.css('.tr-minQty::text').extract_first()
yield item
parse_start_url = parse_item
But if you want to test the rule change your start_urls to
start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58']
and remove parse_start_url = parse_item
I think you don't need to use the CrawlSpider because that spider was created with the intention of navigating a site like a forum or blog for different categories for getting post or actual item links (that's what rules are for).
Your rules are trying to get different urls to follow and then from those ones visit specific urls that match your rule and then call the method specified there with the response from those url.
But in your case you want to visit the specific url inside start_urls, so CrawlSpider is not working that way. You should use just the Spider implementation to get the response from the start_urls urls.
class DigikeySpider(scrapy.Spider):
name = 'digikey'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']
def parse(self, response):
...

import start_urls from a csv file in Scrapy

I recently start web-scraping using scrapy, I generated a list of urls that I want to scrape from into a txt document separate by a new line. This is my crawler code:
import scrapy
import csv
import sys
from realtor.items import RealtorItem
from scrapy.spider import BaseSpider
#from scrapy.selector import HtmlXPathSelector
#from realtor.items import RealtorItem
class RealtorSpider(scrapy.Spider):
name = "realtor"
allowed_domains = ["realtor.com"]
with open('realtor2.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
def parse(self, response):
#hxs = HtmlXPathSelector(response)
#sites = hxs.select('//div/li/div/a/#href')
sites = response.xpath('//a[contains(#href, "/realestateandhomes-detail/")]')
items = []
for site in sites:
print(site.extract())
item = RealtorItem()
item['link'] = site.xpath('#href').extract()
items.append(item)
return items
now my goal is to read the links from realtor2.txt and start parsing through them, however I get a valueError missing scheme in request URL :
File "C:\Users\Ash\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
%FF%FEw%00w%00w%00.%00r%00e%00a%00l%00t%00o%00r%00.%00c%00o%00m%00/%00r%00e%00a%00l%00e%00s%00t%00a%00t%00e%00a%00n%00d%00h%00o%00m%00e%00s%00-%00d%00e%00t%00a%00i%00l%00/%005%000%00-%00M%00e%00n%00o%00r%00e%00s%00-%00A%00v%00e%00-%00A%00p%00t%00-%006%001%000%00_%00C%00o%00r%00a%00l%00-%00G%00a%00b%00l%00e%00s%00_%00F%00L%00_%003%003%001%003%004%00_%00M%005%003%008%000%006%00-%005%008%006%007%007%00%0D%00
2017-06-25 22:28:35 [scrapy.core.engine] INFO: Closing spider (finished)
I think there may be an issue while defining start_urls, but I dont know how to proceed,
"ValueError: Missing scheme in request url" means that you are missing http.
You can use urljoin to avoid this problem.

cannot proceed to scraping or crawling

I am trying to scrape data from this https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job (a sample page) but to no avail. I don't knw why it keeps telling that FILTERED OFFSITE REQUEST to another website and referer is none. I just wanted to get the job name, ocation and links of it. Anyway, this is my code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "meridian"
allowed_domains = ["careers-meridianhealth.icims.com"]
start_urls = ["https://careers-meridianhealth.icims.com"]
rules = (Rule (SgmlLinkExtractor(deny = path_deny_base, allow=('\d+'),restrict_xpaths=('*'))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//div[2]/h1')
linker = hxs.select('//div[2]/div[8]/a[1]')
loc_Con = hxs.select('//div[2]/span/span/span[1]')
loc_Reg = hxs.select('//div[2]/span/span/span[2]')
loc_Loc = hxs.select('//div[2]/span/span/span[3]')
items = []
for titles in titles:
item = CraigslistSampleItem()
#item ["job_id"] = id.select('text()').extract()[0].strip()
item ["title"] = map(unicode.strip, titles.select('text()').extract()) #ok
item ["link"] = linker.select('#href').extract() #ok
item ["info"] = (response.url)
temp1 = loc_Con.select('text()').extract()
temp2 = loc_Reg.select('text()').extract()
temp3 = loc_Loc.select('text()').extract()
temp1 = temp1[0] if temp1 else ""
temp2 = temp2[0] if temp2 else ""
temp3 = temp3[0] if temp3 else ""
item["code"] = "{0}-{1}-{2}".format(temp1, temp2, temp3)
items.append(item)
return(items)
If you check your link extractor using scrapy shell, you see that your start URL only has links to websites not under "careers-meridianhealth.icims.com"
paul#paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com
In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [2]: lx = SgmlLinkExtractor(allow=('\d+'),restrict_xpaths=('*'))
In [3]: lx.extract_links(response)
Out[3]:
[Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=26322652', text=u'NURSE MANAGER ASSISTANT [OPERATING ROOM]', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=26119218', text=u'WEB DEVELOPER [CORP COMM & MARKETING]', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=30441671', text=u'HR Generalist', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=30435857', text=u'OCCUPATIONAL THERAPIST [BHCC REHABILITATION]', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/1800DOCTORS.cfm', text=u'1-800-DOCTORS', fragment='', nofollow=False),
Link(url='http://kidshealth.org/PageManager.jsp?lic=184&ps=101', text=u"Kids' Health", fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/HealthInformation/MeridianTunedin2health.cfm', text=u'Meridian Tunedin2health', fragment='', nofollow=False),
Link(url='http://money.cnn.com/magazines/fortune/best-companies/2013/snapshots/39.html?iid=bc_fl_list', text=u'', fragment='', nofollow=False)]
In [4]:
You can either change your rule, add more domains to allowed_domains attribute, or not define allowed_attribute at all (so all domains will be crawl, this can mean crawling A LOT of pages)
But if you look closely at the page source, you'll notice that it include an iframe, and if you follow the links, you'll find https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1 which contains individual job postings:
paul#paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com
In [1]: sel.xpath('.//iframe/#src')
Out[1]: [<Selector xpath='.//iframe/#src' data=u'https://careers-meridianhealth.icims.com'>]
In [2]: sel.xpath('.//iframe/#src').extract()
Out[2]: [u'https://careers-meridianhealth.icims.com/?in_iframe=1']
In [3]: fetch('https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> from <GET https://careers-meridianhealth.icims.com/?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> from <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> (referer: None)
In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [5]: lx = SgmlLinkExtractor()
In [6]: lx.extract_links(response)
Out[6]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1', text=u'submit your resume', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1', text=u'view all open job positions', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1', text=u'Reset Password', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [7]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1> (referer: None)
In [8]: lx.extract_links(response)
Out[8]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1', text=u'', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1', text=u'LICENSED PRACTICAL NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1', text=u'CERTIFIED NURSING ASSISTANT', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1', text=u'RECEPTIONIST', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1', text=u'REHABILITATION AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1', text=u'NURSE SUPERVISOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1', text=u'LPN', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1', text=u'SPEECH PATHOLOGIST PER DIEM', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1', text=u'SOCIAL WORKER PART TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1', text=u'CLIENT CARE COORDINATOR NIGHTS', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1', text=u'GREETER', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1', text=u'WELCOME AMBASSADOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1', text=u'CERTIFIED MEDICAL ASSISTANT I', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1', text=u'REGISTERED NURSE FULL TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1', text=u'PART TIME HOME HEALTH AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1', text=u'REHABILITATION TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1', text=u'REGISTERED NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1', text=u'DIETARY AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-care-unit%5D/job?in_iframe=1', text=u'TCU ADMINISTRATOR [TRANSITIONAL CARE UNIT]', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1', text=u'MDS Coordinator', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5108/per-diem-patient-service-tech/job?in_iframe=1', text=u'Per Diem PATIENT SERVICE TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1', text=u'Go back to the welcome page', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [9]:
You'll have to follow the pagination links to get all the other job postings.

Scrapy: Not able to scrape pages with login

I want to scrape these pages shown below, but it needs an authentication.
Tried the below code, but it says 0 pages scraped.
I am not able to understand what's the issue.
Can someone help please..
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from kappaal.items import KappaalItem
class KappaalCrawler(InitSpider):
name = "initkappaal"
allowed_domains = ["http://www.kappaalphapsi1911.com/"]
login_page = 'http://www.kappaalphapsi1911.com/login.aspx'
#login_page = 'https://kap.site-ym.com/Login.aspx'
start_urls = ["http://www.kappaalphapsi1911.com/search/newsearch.asp?cdlGroupID=102044"]
rules = ( Rule(SgmlLinkExtractor(allow= r'-\w$'), callback='parseItems', follow=True), )
#rules = ( Rule(SgmlLinkExtractor(allow=("*", ),restrict_xpaths=("//*[contains(#id, 'SearchResultsGrid')]",)) , callback="parseItems", follow= True), )
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'u': 'username', 'p': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Member Search Results" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parseItems(self, response):
hxs = HtmlXPathSelector(response)
members = hxs.select('/html/body/form/div[3]/div/table/tbody/tr/td/div/table[2]/tbody')
print members
items = []
for member in members:
item = KappaalItem()
item['Name'] = member.select("//a/text()").extract()
item['MemberLink'] = member.select("//a/#href").extract()
#item['EmailID'] =
#print item['Name'], item['MemberLink']
items.append(item)
return items
Got the below response after executing the scraper
2013-01-23 07:08:23+0530 [scrapy] INFO: Scrapy 0.16.3 started (bot: kappaal)
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled downloader middlewares:HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled item pipelines:
2013-01-23 07:08:23+0530 [initkappaal] INFO: Spider opened
2013-01-23 07:08:23+0530 [initkappaal] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-01-23 07:08:26+0530 [initkappaal] DEBUG: Crawled (200) <GET https://kap.site-ym.com/Login.aspx> (referer: None)
2013-01-23 07:08:26+0530 [initkappaal] DEBUG: Filtered offsite request to 'kap.site-ym.com': <GET https://kap.site-ym.com/search/all.asp?bst=Enter+search+criteria...&p=P%40ssw0rd&u=9900146>
2013-01-23 07:08:26+0530 [initkappaal] INFO: Closing spider (finished)
2013-01-23 07:08:26+0530 [initkappaal] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 231,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 23517,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 1, 23, 1, 38, 26, 194000),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 1, 23, 1, 38, 23, 542000)}
2013-01-23 07:08:26+0530 [initkappaal] INFO: Spider closed (finished)
I do not understand why it is not authenticating and parsing the start url as mentioned.
Also make sure you have cookies enabled so that when you log in then the session remains logged in
COOKIES_ENABLED = True
COOKIES_DEBUG = True
in your settings.py file
I fixed it like this:
def start_requests(self):
return self.init_request()
def init_request(self):
return [Request(url=self.login_page, callback=self.login)]
def login(self, response):
return FormRequest.from_response(response, formdata={'username': 'username', 'password': 'password'}, callback=self.check_login_response)
def check_login_response(self, response):
if "Logout" in response.body:
for url in self.start_urls:
yield self.make_requests_from_url(url)
else:
self.log("Could not log in...")
By overloading start_requests, you ensure that the login process goes over correctly and only then you start scraping.
I'm using a CrawlSpider with this and works perfectly! Hope it helps.
Okay, so there's a couple of problems that I can see. However, I can't test the code because of the username and password. Is there a dummy account that can be used for testing purposes?
InitSpider doesn't implement rules, so whilst it won't cause a problem, that should be removed.
The check_login_response needs to return something.
To wit:
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Member Search Results" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
return
This may not be the response your looking for, but I feel your pain...
I ran into this same issue I felt the documentation was not sufficient for Scrapy. I ended up using mechanize to log in. If you figure it out with scrapy great if now, Mechanize is pretty straight forward.

Resources