cannot proceed to scraping or crawling

cannot proceed to scraping or crawling - web-scraping

I am trying to scrape data from this https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job (a sample page) but to no avail. I don't knw why it keeps telling that FILTERED OFFSITE REQUEST to another website and referer is none. I just wanted to get the job name, ocation and links of it. Anyway, this is my code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "meridian"
allowed_domains = ["careers-meridianhealth.icims.com"]
start_urls = ["https://careers-meridianhealth.icims.com"]
rules = (Rule (SgmlLinkExtractor(deny = path_deny_base, allow=('\d+'),restrict_xpaths=('*'))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//div[2]/h1')
linker = hxs.select('//div[2]/div[8]/a[1]')
loc_Con = hxs.select('//div[2]/span/span/span[1]')
loc_Reg = hxs.select('//div[2]/span/span/span[2]')
loc_Loc = hxs.select('//div[2]/span/span/span[3]')
items = []
for titles in titles:
item = CraigslistSampleItem()
#item ["job_id"] = id.select('text()').extract()[0].strip()
item ["title"] = map(unicode.strip, titles.select('text()').extract()) #ok
item ["link"] = linker.select('#href').extract() #ok
item ["info"] = (response.url)
temp1 = loc_Con.select('text()').extract()
temp2 = loc_Reg.select('text()').extract()
temp3 = loc_Loc.select('text()').extract()
temp1 = temp1[0] if temp1 else ""
temp2 = temp2[0] if temp2 else ""
temp3 = temp3[0] if temp3 else ""
item["code"] = "{0}-{1}-{2}".format(temp1, temp2, temp3)
items.append(item)
return(items)

If you check your link extractor using scrapy shell, you see that your start URL only has links to websites not under "careers-meridianhealth.icims.com"
paul#paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com
In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [2]: lx = SgmlLinkExtractor(allow=('\d+'),restrict_xpaths=('*'))
In [3]: lx.extract_links(response)
Out[3]:
[Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=26322652', text=u'NURSE MANAGER ASSISTANT [OPERATING ROOM]', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=26119218', text=u'WEB DEVELOPER [CORP COMM & MARKETING]', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=30441671', text=u'HR Generalist', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/Careers/SearchJobs/index.cfm?JobID=30435857', text=u'OCCUPATIONAL THERAPIST [BHCC REHABILITATION]', fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/1800DOCTORS.cfm', text=u'1-800-DOCTORS', fragment='', nofollow=False),
Link(url='http://kidshealth.org/PageManager.jsp?lic=184&ps=101', text=u"Kids' Health", fragment='', nofollow=False),
Link(url='https://www.meridianhealth.com/MH/HealthInformation/MeridianTunedin2health.cfm', text=u'Meridian Tunedin2health', fragment='', nofollow=False),
Link(url='http://money.cnn.com/magazines/fortune/best-companies/2013/snapshots/39.html?iid=bc_fl_list', text=u'', fragment='', nofollow=False)]
In [4]:
You can either change your rule, add more domains to allowed_domains attribute, or not define allowed_attribute at all (so all domains will be crawl, this can mean crawling A LOT of pages)
But if you look closely at the page source, you'll notice that it include an iframe, and if you follow the links, you'll find https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1 which contains individual job postings:
paul#paul:~/tmp/stackoverflow$ scrapy shell https://careers-meridianhealth.icims.com
In [1]: sel.xpath('.//iframe/#src')
Out[1]: [<Selector xpath='.//iframe/#src' data=u'https://careers-meridianhealth.icims.com'>]
In [2]: sel.xpath('.//iframe/#src').extract()
Out[2]: [u'https://careers-meridianhealth.icims.com/?in_iframe=1']
In [3]: fetch('https://careers-meridianhealth.icims.com/?in_iframe=1')
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1> from <GET https://careers-meridianhealth.icims.com/?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Redirecting (302) to <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> from <GET https://careers-meridianhealth.icims.com/jobs?in_iframe=1>
2014-05-21 11:53:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1&hashed=0&in_iframe=1> (referer: None)
In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [5]: lx = SgmlLinkExtractor()
In [6]: lx.extract_links(response)
Out[6]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/login?back=intro&hashed=0&in_iframe=1', text=u'submit your resume', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1', text=u'view all open job positions', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/reminder?hashed=0&in_iframe=1', text=u'Reset Password', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [7]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1')
2014-05-21 11:54:24+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&in_iframe=1&searchCategory=&searchLocation=&ss=1> (referer: None)
In [8]: lx.extract_links(response)
Out[8]:
[Link(url='https://careers-meridianhealth.icims.com/jobs/search?in_iframe=1&pr=1', text=u'', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5196/licensed-practical-nurse/job?in_iframe=1', text=u'LICENSED PRACTICAL NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5192/certified-nursing-assistant/job?in_iframe=1', text=u'CERTIFIED NURSING ASSISTANT', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5191/receptionist/job?in_iframe=1', text=u'RECEPTIONIST', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5190/rehabilitation-aide/job?in_iframe=1', text=u'REHABILITATION AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5188/nurse-supervisor/job?in_iframe=1', text=u'NURSE SUPERVISOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5164/lpn/job?in_iframe=1', text=u'LPN', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5161/speech-pathologist-per-diem/job?in_iframe=1', text=u'SPEECH PATHOLOGIST PER DIEM', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5160/social-worker-part-time/job?in_iframe=1', text=u'SOCIAL WORKER PART TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5154/client-care-coordinator-nights/job?in_iframe=1', text=u'CLIENT CARE COORDINATOR NIGHTS', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5153/greeter/job?in_iframe=1', text=u'GREETER', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5152/welcome-ambassador/job?in_iframe=1', text=u'WELCOME AMBASSADOR', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5146/certified-medical-assistant-i/job?in_iframe=1', text=u'CERTIFIED MEDICAL ASSISTANT I', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5142/registered-nurse-full-time/job?in_iframe=1', text=u'REGISTERED NURSE FULL TIME', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5139/part-time-home-health-aide/job?in_iframe=1', text=u'PART TIME HOME HEALTH AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5136/rehabilitation-tech/job?in_iframe=1', text=u'REHABILITATION TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5127/registered-nurse/job?in_iframe=1', text=u'REGISTERED NURSE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5123/dietary-aide/job?in_iframe=1', text=u'DIETARY AIDE', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5121/tcu-administrator-%5Btransitional-care-unit%5D/job?in_iframe=1', text=u'TCU ADMINISTRATOR [TRANSITIONAL CARE UNIT]', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5119/mds-coordinator/job?in_iframe=1', text=u'MDS Coordinator', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/5108/per-diem-patient-service-tech/job?in_iframe=1', text=u'Per Diem PATIENT SERVICE TECH', fragment='', nofollow=False),
Link(url='https://careers-meridianhealth.icims.com/jobs/intro?in_iframe=1', text=u'Go back to the welcome page', fragment='', nofollow=False),
Link(url='https://media.icims.com/training/candidatefaq/faq.html', text=u'Need further assistance?', fragment='', nofollow=False),
Link(url='http://www.icims.com/platform_help?utm_campaign=platform+help&utm_content=page1&utm_medium=link&utm_source=platform', text=u'Applicant Tracking Software', fragment='', nofollow=False)]
In [9]:
You'll have to follow the pagination links to get all the other job postings.

Related

Why is the YouTube API v3 inconsistent with the amount of comments it lets you download before an error 400?

I am downloading YouTube comments with a python script that uses API keys and the YouTube Data API V3, but sooner or later I run into the following error:
{'error': {'code': 400, 'message': "The API server failed to successfully process the request. While this can be a transient error, it usually indicates that the request's input is invalid. Check the structure of the commentThread resource in the request body to ensure that it is valid.", 'errors': [{'message': "The API server failed to successfully process the request. While this can be a transient error, it usually indicates that the request's input is invalid. Check the structure of the commentThread resource in the request body to ensure that it is valid.", 'domain': 'youtube.commentThread', 'reason': 'processingFailure', 'location': 'body', 'locationType': 'other'}]}}
I am using the following code:
import argparse
import requests
import json
import time
start_time = time.time()
class YouTubeApi():
YOUTUBE_COMMENTS_URL = 'https://www.googleapis.com/youtube/v3/commentThreads'
comment_counter = 0
def is_error_response(self, response):
error = response.get('error')
if error is None:
return False
print("API Error: "
f"code={error['code']} "
f"domain={error['errors'][0]['domain']} "
f"reason={error['errors'][0]['reason']} "
f"message={error['errors'][0]['message']!r}")
print(self.comment_counter)
return True
def format_comments(self, results, likes_required):
comments_list = []
try:
for item in results["items"]:
comment = item["snippet"]["topLevelComment"]
likes = comment["snippet"]["likeCount"]
if likes < likes_required:
continue
author = comment["snippet"]["authorDisplayName"]
text = comment["snippet"]["textDisplay"]
str = "Comment by {}:\n \"{}\"\n\n".format(author, text)
str = str.encode('ascii', 'replace').decode()
comments_list.append(str)
self.comment_counter += 1
print("Comments downloaded:", self.comment_counter, end="\r")
except(KeyError):
print(results)
return comments_list
def get_video_comments(self, video_id, likes_required):
with open("API_keys.txt", "r") as f:
key_list = f.readlines()
comments_list = []
key_list = [key.strip('/n') for key in key_list]
params = {
'part': 'snippet,replies',
'maxResults': 100,
'videoId': video_id,
'textFormat': 'plainText',
'key': key_list[0]
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
comments_data = requests.get(self.YOUTUBE_COMMENTS_URL, params=params, headers=headers)
results = comments_data.json()
if self.is_error_response(results):
return []
nextPageToken = results.get("nextPageToken")
comments_list = []
comments_list += self.format_comments(results, likes_required)
while nextPageToken:
params.update({'pageToken': nextPageToken})
if self.comment_counter <= 900000:
params.update({'key': key_list[0]})
elif self.comment_counter <= 1800000:
params.update({'key': key_list[1]})
elif self.comment_counter <= 2700000:
params.update({'key': key_list[2]})
elif self.comment_counter <= 3600000:
params.update({'key': key_list[3]})
elif self.comment_counter <= 4500000:
params.update({'key': key_list[4]})
else:
params.update({'key': key_list[5]})
if self.comment_counter % 900001 == 0:
print(params["key"])
comments_data = requests.get(self.YOUTUBE_COMMENTS_URL, params=params, headers=headers)
results = comments_data.json()
if self.is_error_response(results):
return comments_list
nextPageToken = results.get("nextPageToken")
comments_list += self.format_comments(results, likes_required)
return comments_list
def get_video_id_list(self, filename):
try:
with open(filename, 'r') as file:
URL_list = file.readlines()
except FileNotFoundError:
exit("File \"" + filename + "\" not found")
list = []
for url in URL_list:
if url == "\n": # ignore empty lines
continue
if url[-1] == '\n': # delete '\n' at the end of line
url = url[:-1]
if url.find('='): # get id
id = url[url.find('=') + 1:]
list.append(id)
else:
print("Wrong URL")
return list
def main():
yt = YouTubeApi()
parser = argparse.ArgumentParser(add_help=False, description=("Download youtube comments from many videos into txt file"))
required = parser.add_argument_group("required arguments")
optional = parser.add_argument_group("optional arguments")
optional.add_argument("--likes", '-l', help="The amount of likes a comment needs to be saved", type=int)
optional.add_argument("--input", '-i', help="URL list file name")
optional.add_argument("--output", '-o', help="Output file name")
optional.add_argument("--help", '-h', help="Help", action='help')
args = parser.parse_args()
# --------------------------------------------------------------------- #
likes = 0
if args.likes:
likes = args.likes
input_file = "URL_list.txt"
if args.input:
input_file = args.input
output_file = "Comments.txt"
if args.output:
output_file = args.output
list = yt.get_video_id_list(input_file)
if not list:
exit("No URLs in input file")
try:
vid_counter = 0
with open(output_file, "a") as f:
for video_id in list:
vid_counter += 1
print("Downloading comments for video ", vid_counter, ", id: ", video_id, sep='')
comments = yt.get_video_comments(video_id, likes)
if comments:
for comment in comments:
f.write(comment)
print('\nDone!')
except KeyboardInterrupt:
exit("User Aborted the Operation")
# --------------------------------------------------------------------- #
if __name__ == '__main__':
main()
In another thread, it was discovered that google does not currently permit downloading all the comments on a popular video, however you would expect it to cut off at the same point. Instead, I have found that it can range anywhere betweek 1.5 million to 200k comments downloaded before it returns a code 400. Is this to do with a bug in my code, or is the YouTube API rejecting my request as it is clear that is a script? Would adding a time.sleep clause help with this?

(I bring forward this answer -- that I prepared to the question above at the time of its initial post -- because my assertions below seems to be confirmed once again by recent SO posts of this very kind.)
Your observations are correct. But, unfortunately, nobody but Google itself is able to provide a sound and complete answer to your question. Us -- non-Googlers (as myself!), or even the Googlers themselves (since they all sign NDAs) -- can only guess about the things implied.
Here is my educated guess, based on the investigations I made recently when responding to a very much related question (which you quoted above, yourself!):
As you already know, the API uses pagination for to return to callers sets of items of which cardinality exceed the internal limit of 50, or, by case, 100 items to be returned by each and every API endpoint invocation that provides result sets.
If you'll log the nextPageToken property that you obtain from CommentThreads.list via your object results, you'll see that those page tokens get bigger and bigger. Each and every such page token has to be passed on to the next CommentThreads.list call as the parameter pageToken.
The problem is that internally (not specified publicly, not documented) the API has a limit on the sheer length of the HTTP requests it accepts from its callers. (This happens for various reasons; e.g. security.) Therefore, when a given page token is sufficiently long, the HTTP request that the API user issues will exceed that internal limit, producing an internal error. That error surfaces to the API caller as the processingFailure error that you've encountered.
Many questions remain to be answered (e.g. why is that the page tokens have unbounded length?), but, again, those questions belong very much to the internal realm of the back-end system that's behind the API we're using. And those questions cannot be answered publicly, since are very much Google's internal business.

import start_urls from a csv file in Scrapy

I recently start web-scraping using scrapy, I generated a list of urls that I want to scrape from into a txt document separate by a new line. This is my crawler code:
import scrapy
import csv
import sys
from realtor.items import RealtorItem
from scrapy.spider import BaseSpider
#from scrapy.selector import HtmlXPathSelector
#from realtor.items import RealtorItem
class RealtorSpider(scrapy.Spider):
name = "realtor"
allowed_domains = ["realtor.com"]
with open('realtor2.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
def parse(self, response):
#hxs = HtmlXPathSelector(response)
#sites = hxs.select('//div/li/div/a/#href')
sites = response.xpath('//a[contains(#href, "/realestateandhomes-detail/")]')
items = []
for site in sites:
print(site.extract())
item = RealtorItem()
item['link'] = site.xpath('#href').extract()
items.append(item)
return items
now my goal is to read the links from realtor2.txt and start parsing through them, however I get a valueError missing scheme in request URL :
File "C:\Users\Ash\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
%FF%FEw%00w%00w%00.%00r%00e%00a%00l%00t%00o%00r%00.%00c%00o%00m%00/%00r%00e%00a%00l%00e%00s%00t%00a%00t%00e%00a%00n%00d%00h%00o%00m%00e%00s%00-%00d%00e%00t%00a%00i%00l%00/%005%000%00-%00M%00e%00n%00o%00r%00e%00s%00-%00A%00v%00e%00-%00A%00p%00t%00-%006%001%000%00_%00C%00o%00r%00a%00l%00-%00G%00a%00b%00l%00e%00s%00_%00F%00L%00_%003%003%001%003%004%00_%00M%005%003%008%000%006%00-%005%008%006%007%007%00%0D%00
2017-06-25 22:28:35 [scrapy.core.engine] INFO: Closing spider (finished)
I think there may be an issue while defining start_urls, but I dont know how to proceed,

"ValueError: Missing scheme in request url" means that you are missing http.
You can use urljoin to avoid this problem.

How to get the all the urls form this page?

I am working on a scrapy project to get all the urls form this page and the following pages, but when i run the spider, i just get one url from each page!. I wrote a for loop to get them all, but nothing changed?
I need to get each ad data into a row in a csv file, how to do that ?
The spider code:
import datetime
import urlparse
import socket
import re
from scrapy.loader.processors import MapCompose, Join
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from cars2buy.items import Cars2BuyItem
class Cars2buyCarleasingSpider(CrawlSpider):
name = "cars2buy-carleasing"
start_urls = ['http://www.cars2buy.co.uk/business-car-leasing/']
rules = (
Rule(LinkExtractor(allow=("Abarth"), restrict_xpaths='//*[#id="content"]/div[7]/div[2]/div/a')),
Rule(LinkExtractor(allow=("695C"), restrict_xpaths='//*[#id="content"]/div/div/p/a'), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[#class="next"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for l in response.xpath('//*[#class="viewthiscar"]/#href'):
item=Cars2BuyItem()
item['Company']= l.extract()
item['url']= response.url
return item
the output is:
> 2017-04-27 20:22:39 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.fleetprices.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-204097572&broker=178&veh_id=901651523&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr', 'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/'}
> 2017-04-27 20:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> (referer: http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/)
> 2017-04-27 20:22:40 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.jgleasing.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-207378762&broker=248&veh_id=902250527&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr', 'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2'}
> 2017-04-27 20:22:40 [scrapy.core.engine] INFO: Closing spider
> (finished)

The problem is, once your for loop processed the first item it hits the return, leaves the parse_item method and so no other items are processed.
Suggest you replace the return with yield:
def parse_item(self, response):
for l in response.xpath('//*[#class="viewthiscar"]/#href'):
item=Cars2BuyItem()
item['Company']= l.extract()
item['url']= response.url
yield item

Scrapy: Not able to scrape pages with login

I want to scrape these pages shown below, but it needs an authentication.
Tried the below code, but it says 0 pages scraped.
I am not able to understand what's the issue.
Can someone help please..
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from kappaal.items import KappaalItem
class KappaalCrawler(InitSpider):
name = "initkappaal"
allowed_domains = ["http://www.kappaalphapsi1911.com/"]
login_page = 'http://www.kappaalphapsi1911.com/login.aspx'
#login_page = 'https://kap.site-ym.com/Login.aspx'
start_urls = ["http://www.kappaalphapsi1911.com/search/newsearch.asp?cdlGroupID=102044"]
rules = ( Rule(SgmlLinkExtractor(allow= r'-\w$'), callback='parseItems', follow=True), )
#rules = ( Rule(SgmlLinkExtractor(allow=("*", ),restrict_xpaths=("//*[contains(#id, 'SearchResultsGrid')]",)) , callback="parseItems", follow= True), )
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'u': 'username', 'p': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Member Search Results" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parseItems(self, response):
hxs = HtmlXPathSelector(response)
members = hxs.select('/html/body/form/div[3]/div/table/tbody/tr/td/div/table[2]/tbody')
print members
items = []
for member in members:
item = KappaalItem()
item['Name'] = member.select("//a/text()").extract()
item['MemberLink'] = member.select("//a/#href").extract()
#item['EmailID'] =
#print item['Name'], item['MemberLink']
items.append(item)
return items
Got the below response after executing the scraper
2013-01-23 07:08:23+0530 [scrapy] INFO: Scrapy 0.16.3 started (bot: kappaal)
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled downloader middlewares:HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled item pipelines:
2013-01-23 07:08:23+0530 [initkappaal] INFO: Spider opened
2013-01-23 07:08:23+0530 [initkappaal] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-01-23 07:08:26+0530 [initkappaal] DEBUG: Crawled (200) <GET https://kap.site-ym.com/Login.aspx> (referer: None)
2013-01-23 07:08:26+0530 [initkappaal] DEBUG: Filtered offsite request to 'kap.site-ym.com': <GET https://kap.site-ym.com/search/all.asp?bst=Enter+search+criteria...&p=P%40ssw0rd&u=9900146>
2013-01-23 07:08:26+0530 [initkappaal] INFO: Closing spider (finished)
2013-01-23 07:08:26+0530 [initkappaal] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 231,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 23517,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 1, 23, 1, 38, 26, 194000),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 1, 23, 1, 38, 23, 542000)}
2013-01-23 07:08:26+0530 [initkappaal] INFO: Spider closed (finished)
I do not understand why it is not authenticating and parsing the start url as mentioned.

Also make sure you have cookies enabled so that when you log in then the session remains logged in
COOKIES_ENABLED = True
COOKIES_DEBUG = True
in your settings.py file

I fixed it like this:
def start_requests(self):
return self.init_request()
def init_request(self):
return [Request(url=self.login_page, callback=self.login)]
def login(self, response):
return FormRequest.from_response(response, formdata={'username': 'username', 'password': 'password'}, callback=self.check_login_response)
def check_login_response(self, response):
if "Logout" in response.body:
for url in self.start_urls:
yield self.make_requests_from_url(url)
else:
self.log("Could not log in...")
By overloading start_requests, you ensure that the login process goes over correctly and only then you start scraping.
I'm using a CrawlSpider with this and works perfectly! Hope it helps.

Okay, so there's a couple of problems that I can see. However, I can't test the code because of the username and password. Is there a dummy account that can be used for testing purposes?
InitSpider doesn't implement rules, so whilst it won't cause a problem, that should be removed.
The check_login_response needs to return something.
To wit:
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Member Search Results" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
return

This may not be the response your looking for, but I feel your pain...
I ran into this same issue I felt the documentation was not sufficient for Scrapy. I ended up using mechanize to log in. If you figure it out with scrapy great if now, Mechanize is pretty straight forward.

dnspython how to lookup serial of subdomain?

I have a little issue I cannot seem to get my head get around. I am trying to query the serial number of a sub-domain. I keep getting no answer error tho but it will work fine on root domains. Easier if I just show you:
import socket, dns.resolver
host = "google.com"
querytype = "SOA"
cachingserverslist = {'server1': '4.1.1.1', 'server2': '4.2.2.2'}
for cachingservername, cachingserver in sorted(cachingserverslist.iteritems()) :
query = dns.resolver.Resolver()
query.nameservers=[socket.gethostbyname(cachingserver)]
query.Timeout = 2.0
for a in query.query( host , querytype ) :
print a.serial
Which gives me the expected result. What I don't understand is when I change the host variable to any subdomain or www's it errors out with a no answer. Here is an ipython session which shows what I mean:
In [1]: import socket, dns.resolver
In [2]: host = "google.com"
In [3]: querytype = "SOA"
In [4]: cachingserverslist = {'server1': '4.1.1.1', 'server2': '4.2.2.2'}
In [5]: for cachingservername, cachingserver in sorted(cachingserverslist.iteritems()) :
...: query = dns.resolver.Resolver()
...: query.nameservers=[socket.gethostbyname(cachingserver)]
...: query.Timeout = 2.0
...:
In [6]: for a in query.query( host , querytype ) :
...: print a.serial
...:
2011121901
In [7]:
In [8]: host = "www.google.com"
In [9]: for a in query.query( host , querytype ) :
print a.serial
....:
....:
---------------------------------------------------------------------------
NoAnswer Traceback (most recent call last)
/var/www/pydns/<ipython console> in <module>()
/usr/local/lib/python2.6/dist-packages/dns/resolver.pyc in query(self, qname, rdtype, rdclass, tcp, source, raise_on_no_answer)
707 raise NXDOMAIN
708 answer = Answer(qname, rdtype, rdclass, response,
--> 709 raise_on_no_answer)
710 if self.cache:
711 self.cache.put((qname, rdtype, rdclass), answer)
/usr/local/lib/python2.6/dist-packages/dns/resolver.pyc in __init__(self, qname, rdtype, rdclass, response, raise_on_no_answer)
127 except KeyError:
128 if raise_on_no_answer:
--> 129 raise NoAnswer
130 if raise_on_no_answer:
131 raise NoAnswer
NoAnswer:
Any insight would be most appreciated. Thanks.

The serial number is an attribute of the SOA 'start of authority' record. www.google.com is a CNAME, so it doesn't have a serial number associated with it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

cannot proceed to scraping or crawling - web-scraping

Related

Why is the YouTube API v3 inconsistent with the amount of comments it lets you download before an error 400?

import start_urls from a csv file in Scrapy

How to get the all the urls form this page?

Scrapy: Not able to scrape pages with login

dnspython how to lookup serial of subdomain?

Categories

Resources