How to stop the Crawler - web-scraping

I am trying to write a crawler that goes to a website and searches for a list of keywords, with max_Depth of 2. But the scraper is supposed to stop once any of the keyword's appears on any page, the problem i am facing right now is that the crawler does-not stop when it first see's any of the keywords.
Even after trying, early return command, break command and CloseSpider Commands and even python exit commands.
My class of the Crawler:
class WebsiteSpider(CrawlSpider):
name = "webcrawler"
allowed_domains = ["www.roomtoread.org"]
start_urls = ["https://"+"www.roomtoread.org"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
crawl_count = 0
words_found = 0
def check_buzzwords(self, response):
self.__class__.crawl_count += 1
crawl_count = self.__class__.crawl_count
wordlist = [
"sfdc",
"pardot",
"Web-to-Lead",
"salesforce"
]
url = response.url
contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
data = response.body.decode('utf-8')
for word in wordlist:
substrings = find_all_substrings(data, word)
for pos in substrings:
ok = False
if not ok:
if self.__class__.words_found==0:
self.__class__.words_found += 1
print(word + "," + url + ";")
STOP!
return Item()
def _requests_to_follow(self, response):
if getattr(response, "encoding", None) != None:
return CrawlSpider._requests_to_follow(self, response)
else:
return []
I want it to stop execution when if not ok: is True.

When I want to stop a spider, I usually use the exception exception scrapy.exceptions.CloseSpider(reason='cancelled') from Scrapy-Docs.
The example there shows how you can use it:
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')
In your case something like
if not ok:
raise CloseSpider('keyword_found')
Or is that what you meant with
CloseSpider Commands
and already tried it?

Related

Scrapy isnt scraping the next page

I am trying to scrape article news from skynewsarabia.com
class SkyNewsSportsSpider(scrapy.Spider):
name = 'sky_news_sports'
sport = "https://www.skynewsarabia.com/sport/"
custom_settings = {
'FEED_EXPORT_FIELDS': ["article_content", "tags"],
}
allowed_domains = ['www.skynewsarabia.com']
first_token = "1569266773000"
scrape_this_link = "https://api.skynewsarabia.com//rest/v2/latest.json?defaultSectionId=6&nextPageToken={}&pageSize=20&types=ARTICLE"
start_urls = [scrape_this_link.format(first_token)]
urls = []
def parse(self, response):
articles = json.loads(response.text)
# to get the link for each article we need to combine both the id and the urlFriendlySuffix in one link
for article in range(0, len(articles["contentItems"])):
article_id = articles["contentItems"][article]["id"]
article_url = articles["contentItems"][article]["urlFriendlySuffix"]
relative_link = article_id + "-" + article_url
full_link = self.sport + relative_link
self.urls.append(full_link)
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse_details)
self.urls = []
print("Before Check")
self.first_token = articles["nextPageToken"]
if self.first_token is not None:
next_page = self.scrape_this_link.format(self.first_token)
print("I am inside!")
print(next_page)
yield response.follow(url=next_page, callback=self.parse)
def parse_details(self, response):
pass
The basic idea here is that you first scrape a link which has 20 links. besides that, the first link has also a token for the next link which you need to add to the next URL so you can scrape the next 20 links. However, the problem I am facing is that when you first run the script, it is taking the next token and get all the links of that token and then it stops! so I am just scraping 20 links only! when I print the first_token it's giving me something different than 1569266773000 which is provided by default in the script.
You need to change allowed_domains = ['www.skynewsarabia.com'] to allowed_domains = ['skynewsarabia.com']. Alternatively remove the allowed_domains variable completely.
Since you have specified the hostname www Scrapy filters the requests to api.skynewsarabia.com as offsite and the calls are just being dropped.
Additional tip: Try to use self.logger.info and self.logger.debug instead of the print commands in your code.

How can I start a brand new request in Scrapy crawler?

I am scraping from a website that will give every request session a sid, after getting the sid, I perform further search query with this sid and scrape the results.
I want to change the sid every time I've finished scraping all results of a single query, I've tried clearing the cookies but it doesn't work.
However, if I restart my crawler, it wll get a different sid each time, I just don't know how to get a new sid without restart the crawler.
I am wondering if there're something else that let the server know two requests are from the same connection.
Thanks!
Here is my current code:
class MySpider(scrapy.Spider):
name = 'my_spider'
allowed_domains = ['xxx.com']
start_urls = ['http://xxx/']
sid_pattern = r'SID=(\w+)&'
SID = None
query_list = ['aaa', 'bbb', 'ccc']
i = 0
def parse(self, response):
if self.i >= len(self.query_list):
return
pattern = re.compile(self.sid_pattern)
result = re.search(pattern, response.url)
if result is not None:
self.SID = result.group(1)
else:
exit(-1)
search_url = 'http://xxxx/AdvancedSearch.do'
query = self.query_list[i]
self.i += 1
query_form = {
'aaa':'bbb'
}
yield FormRequest(adv_search_url, method='POST', formdata=query_form, dont_filter=True,
callback=self.parse_result_entry)
yield Request(self.start_urls[0], cookies={}, callback=self.parse,dont_filter=True)
def parse_result(self, response):
do something
Setting COOKIES_ENABLED = False can achieve this, but is there another way other than a global settings?

When to return an Item if I don't know when the spider will finish?

So my spider takes in a list of websites, and it crawls through each one via start_requests which yield request passing in item as meta.
Then, the spider explores all the internal links of a single website and collects all the external links into the item. The problem is that I don't know when the spider finishes crawling all the internal links, so I can't yield an item.
class WebsiteSpider(scrapy.Spider):
name = "web"
def start_requests(self):
filename = "websites.csv"
requests = []
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
item = Links(base_url=seed_url, on_list=[])
request = Request(seed_url, callback=self.parse_seed)
request.meta['item'] = item
requests.append(request)
return requests
except IOError:
raise scrapy.exceptions.CloseSpider("A list of websites are needed")
def parse_seed(self, response):
item = response.meta['item']
netloc = urlparse(item['base_url']).netloc
external_le = LinkExtractor(deny_domains=netloc)
external_links = external_le.extract_links(response)
for external_link in external_links:
item['on_list'].append(external_link)
internal_le = LinkExtractor(allow_domains=netloc)
internal_links = internal_le.extract_links(response)
for internal_link in internal_links:
request = Request(internal_link, callback=self.parse_seed)
request.meta['item'] = item
yield request
the start_requests method needs to yield Request objects. You don't need to return a list of requests, but only yield a Request when it is ready, this works because scrapy requests are asynchronous.
The same happens with items, you just need to yield your items whenever you think the item is ready, I would recommend for your case to just check if there are no more internal_links to yield the item, or also you can as many items as you want, and then check which one was the last (or the one with more data):
class WebsiteSpider(scrapy.Spider):
name = "web"
def start_requests(self):
filename = "websites.csv"
requests = []
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
item = Links(base_url=seed_url, on_list=[])
yield Request(seed_url, callback=self.parse_seed, meta = {'item'=item})
except IOError:
raise scrapy.exceptions.CloseSpider("A list of websites are needed")
def parse_seed(self, response):
item = response.meta['item']
netloc = urlparse(item['base_url']).netloc
external_le = LinkExtractor(deny_domains=netloc)
external_links = external_le.extract_links(response)
for external_link in external_links:
item['on_list'].append(external_link)
internal_le = LinkExtractor(allow_domains=netloc)
internal_links = internal_le.extract_links(response)
if internal_links:
for internal_link in internal_links:
request = Request(internal_link, callback=self.parse_seed)
request.meta['item'] = item
yield request
else:
yield item
another thing you could do is create an extension to do what you need on the spider_closed method and do whatever you want knowing when the spider ended.

Flask-WTF: Queries of FormFields in FieldList are none after validate_on_submit

I'm trying to generate dynamic forms using Flask-WTF to create a new product based on some templates. A product will have a list of required key-value pairs based on its type, as well as a list of parts required to build it. The current relevant code looks as follows:
forms.py:
class PartSelectionForm(Form):
selected_part = QuerySelectField('Part', get_label='serial', allow_blank=True)
part_type = StringField('Type')
slot = IntegerField('Slot')
required = BooleanField('Required')
def __init__(self, csrf_enabled=False, *args, **kwargs):
super(PartSelectionForm, self).__init__(csrf_enabled=False, *args, **kwargs)
class NewProductForm(Form):
serial = StringField('Serial', default='', validators=[DataRequired()])
notes = TextAreaField('Notes', default='')
parts = FieldList(FormField(PartSelectionForm))
views.py:
#app.route('/products/new/<prodmodel>', methods=['GET', 'POST'])
#login_required
def new_product(prodmodel):
try:
model = db.session.query(ProdModel).filter(ProdModel.id==prodmodel).one()
except NoResultFound, e:
flash('No products of model type -' + prodmodel + '- found.', 'error')
return redirect(url_for('index'))
keys = db.session.query(ProdTypeTemplate.prod_info_key).filter(ProdTypeTemplate.prod_type_id==model.prod_type_id)\
.order_by(ProdTypeTemplate.prod_info_key).all()
parts_needed = db.session.query(ProdModelTemplate).filter(ProdModelTemplate.prod_model_id==prodmodel)\
.order_by(ProdModelTemplate.part_type_id, ProdModelTemplate.slot).all()
class F(forms.NewProductForm):
pass
for key in keys:
if key.prod_info_key in ['shipped_os','factory_os']:
setattr(F, key.prod_info_key, forms.QuerySelectField(key.prod_info_key, get_label='version'))
else:
setattr(F, key.prod_info_key, forms.StringField(key.prod_info_key, validators=[forms.DataRequired()]))
form = F(request.form)
if request.method == 'GET':
for part in parts_needed:
entry = form.parts.append_entry(forms.PartSelectionForm())
entry.part_type.data=part.part_type_id
entry.slot.data=slot=part.slot
entry.required.data=part.required
entry.selected_part.query = db.session.query(Part).join(PartModel).filter(PartModel.part_type_id==part.part_type_id, Part.status=='inventory')
if form.__contains__('shipped_os'):
form.shipped_os.query = db.session.query(OSVersion).order_by(OSVersion.version)
if form.__contains__('factory_os'):
form.factory_os.query = db.session.query(OSVersion).order_by(OSVersion.version)
if form.validate_on_submit():
...
Everything works as expected on a GET request, but on the validate_on_submit I get errors. The error is that all of the queries and query_factories for the selected_part QuerySelectFields in the list of PartSelectionForms is none, causing either direct errors in WTForms validation code or when Jinja2 attempts to re-render the QuerySelectFields. I'm not sure why this happens on the POST when everything appears to be correct for the GET.
I realized that although I set the required queries on a GET I'm not doing it for any PartSelectionForm selected_part entries on the POST. Since I already intended part_type, slot, and required to be hidden form fields, I added the following immediately before the validate_on_submit and everything works correctly:
for entry in form.parts:
entry.selected_part.query = db.session.query(Part).join(PartModel).\
filter(PartModel.part_type_id==entry.part_type.data, Part.status=='inventory')

Get HTTP response body as string (BubbleWrap for RubyMotion)

Using RubyMotion (for the first time!), I want to use Twitter's search API to retrieve some recent tweets for some users so have put together the class below.
The value of tweets is always an empty array. I suspect that BW::HTTP.get(url) spawns its own thread which is causing the issue.
Really, I just want twitter_search_results to return response.body.to_str but I am not sure how to do this.
How do I use RubyMotion (or BubbleWrap) to put an array of Tweet objects into my UIViewController?
class TweetsController
def initialize
#twitter_accounts = %w(dhh google)
#tweets = []
end
def tweets
twitter_search_results
puts #tweets.count
#tweets
end
def create_tweets(response)
BW::JSON.parse(response)["results"].each do |result|
#tweets << Tweet.new(result)
end
end
def twitter_search_results
query = #twitter_accounts.map{ |account| "from:#{account}" }.join(" OR ")
url = "http://search.twitter.com/search.json?q=#{query}"
BW::HTTP.get(url) do |response|
create_tweets(response.body.to_str)
end
end
end
class TwitterViewController < UIViewController
def viewDidLoad
super
self.view.backgroundColor = UIColor.blueColor
#table = UITableView.alloc.initWithFrame(self.view.bounds)
self.view.addSubview #table
#table.dataSource = self
#tweets_controller = TweetsController.new
end
def initWithNibName(name, bundle: bundle)
super
self.tabBarItem = UITabBarItem.alloc.initWithTitle(
"Twitter",
image: UIImage.imageNamed('twitter.png'),
tag: 1)
self
end
def tableView(tableView, numberOfRowsInSection: section)
#tweets_controller.tweets.length
end
def tableView(tableView, cellForRowAtIndexPath: indexPath)
#reuse_id = "Tweet"
cell = UITableViewCell.alloc.initWithStyle(UITableViewCellStyleDefault, reuseIdentifier:#reuse_id)
cell.textLabel.text = #tweets_controller.tweets[indexPath.row].text
return cell
end
end
class Tweet
attr_reader :created_at, :from_user, :text
def initialize(tweet_result)
#created_at = tweet_result["created_at"]
#from_user = tweet_result["from_user"]
#text = tweet_result["text"]
end
end
Full controller code below. I've also put the project on GitHub
class TweetsController
def initialize
#twitter_accounts = %w(dhh google)
#tweets = []
create_tweets
end
def tweets
#tweets
end
def create_tweets
json_data = twitter_search_results.dataUsingEncoding(NSUTF8StringEncoding)
e = Pointer.new(:object)
dict = NSJSONSerialization.JSONObjectWithData(json_data, options:0, error: e)
dict["results"].each do |result|
p result.class
p result
#tweets << Tweet.new(result)
end
end
def twitter_search_results
query = #twitter_accounts.map{ |account| "from:#{account}" }.join(" OR ")
url_string = "http://search.twitter.com/search.json?q=#{query}"
url_string_escaped = url_string.stringByAddingPercentEscapesUsingEncoding(NSUTF8StringEncoding)
url = NSURL.URLWithString(url_string_escaped)
request = NSURLRequest.requestWithURL(url)
response = nil
error = nil
data = NSURLConnection.sendSynchronousRequest(request, returningResponse: response, error: error)
raise "BOOM!" unless (data.length > 0 && error.nil?)
json = NSString.alloc.initWithData(data, encoding: NSUTF8StringEncoding)
end
end
the issue here is asynchronicity. you're almost there, I think, but the create_tweets method is not called before puts #tweets. In this case, I would recommend using a notification, because I think they are good ;-)
TweetsReady = 'TweetsReady' # constants are nice
NSNotificationCenter.defaultCenter.postNotificationName(TweetsReady, object:#tweets)
In your controller, register for this notification in `viewWillAppear` and unregister in `viewWillDisappear`
NSNotificationCenter.defaultCenter.addObserver(self, selector: 'tweets_ready:', name: TweetsReady, object:nil) # object:nil means 'register for all events, not just ones associated with 'object'
# ...
NSNotificationCenter.defaultCenter.removeObserver(self, name:TweetsReady, object:nil)
and you tweets_ready method should implement your UI changes.
def tweets_ready(notification)
#table.reloadData
end

Resources