i am developing a spider with several fields with scrapy framework. When I export the scrapped fields to a .csv file, the fields (or columns) are unordered, not as I defined them in items.py file.
Does anyone know how to solve this issue?
Thanks in advance.
class myspider(BaseSpider):
filehandle1 = open('file.xls','w')
---------------
def parse(self, response):
-----------
self.filehandle2.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(item['a'],item['h'],item['g'],item['f'],item['e'],item['d'],item['c'],item['b']))
Related
I'm stuck trying to find a way to make my spider work. This is the scenario: I'm trying to find all the URLs of a specific domain that are contained in a particular target website. For this, I've defined a couple of rules so I can crawl the site and find out the links of my interest.
The thing is that it doesn't seem to work, even when I know that there are links with the proper format inside the website.
This is my spider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class sp(CrawlSpider):
name = 'sp'
start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']
custom_settings = {
'LOG_LEVEL': 'INFO',
'DEPTH_LIMIT': 4
}
rules = (
Rule(LinkExtractor(unique=True, allow_domains='a2zinc.net'), callback='parse_item'),
Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
)
def parse_item(self, response):
print(response.request.url)
yield {'link':response.request.url}
So, in summary, I'm trying to find all the links from 'a2zinc.net' contained inside https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/ and its subsections.
As you guys can see, there are at least 3 occurrences of the desired links inside the target website.
The funny thing is that when I test the spider using another target site (like this one) that also contains links of interest, it works as expected and I can't really see the difference.
Also, if I define a Link Extractor instance (as in the snippet below) inside a parsing method, it is also capable of finding the desired links, but I think this won't be the best way of using CrawlSpider + Rules.
def parse_item(self, response):
le = LinkExtractor(allow_domains='a2zinc.net')
links = le.extract_links(response)
for link in links:
yield {'link': link.url}
any idea what the cause of the problem could be?
Thanks a lot.
Your code works. The only issue is that you have set the logging level to INFO while the links that are being extracted are returning status code 403 which is only visible at the DEBUG level. Comment out your custom settings and you will see that the links are being extracted.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class sp(CrawlSpider):
name = 'sp'
start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']
custom_settings = {
# 'LOG_LEVEL': 'INFO',
# 'DEPTH_LIMIT': 4
}
rules = (
Rule(LinkExtractor(allow_domains='a2zinc.net'), callback='parse_item'),
Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
)
def parse_item(self, response):
print(response.request.url)
yield {'link':response.request.url}
OUTPUT:
I have scrapy spider code which will scrap a webpage and pull the youtube video link into a file. I am trying to get that scrapy to run the url's as strings rather than a array.
This way my output is one URL without quotes, and then I wish to add text after the output of the URL. ",&source=Open YouTube Playlist"
This way I can load the FULL url into a wordpress webplayer via native or a plugin, and it will auto-create a youtube list out of my output.
Maybe I am not thinking clearly? Is there a better way to accomplish the same goal?
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/#href').extract():
yield scrapy.Request(response.urljoin(link),
callback=self.parse_page)
# If page contains link to next page extract link and parse
next_page = response.xpath('//a[contains(.,
"Go\s+to\s+page\s+2")]/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse)
# Youtube link 1st pass
def parse_page(self, response):
link = response.xpath('//iframe/#src').extract_first()
linkprune = link.split('/embed/')[+1]
output = linkprune.split('?')[0]
yield{
'https://www.youtube.com/watch_videos?video_ids=': output + ','
}
Current Output
https://www.youtube.com/watch_videos?video_ids=
"mueStjvHneI,"
"X7HfQL4fYgQ,"
"UtnR4gPMs_Q,"
"Kd9pbiKQqr4,"
"AokjaT-CnBk,"
"VdvhAsX6buo,"
"pF-XykcAqz8,"
"Fl0DDmx-jZw,"
"dpzLDiuQq9o,"
"J2_bl0zI504,"
...
Aiming to achieve
https://www.youtube.com/watch_videos?video_ids=mueStjvHneI,X7HfQL4fYgQ,UtnR4gPMs_Q,Kd9pbiKQqr4,VdvhAsX6buo,pF-XykcAqz8,dpzLDiuQq9o,&source=Open YouTube Playlist
If you load this URL, it will create a beautiful Youtube list.
I'm writing a crawler to get some pages from Yelp. I define the Yelp Item like this:
yelpItem.py:
import scrapy
class YelpItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
link = scrapy.Field()
and in the spider folder, I use YelpItem in the parse function.
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h3/span/a[contains(#class, "biz-name")]')
items = []
for site in sites:
item = YelpItem()
When running it, it says:
NameError: global name 'YelpItem' is not defined
I had searched several webpages, and tried to add codes like:
from hw1.items import YelpItem
(hw1 is my project name),but it is not helping. This will leads to an error like: No module named items
Can anyone help me to figure out how to deal with this? Thanks!
Use
from hw1.yelpItem import YelpItem
Because when you try from hw1.items you are referencing to the items.py file but your YelpItem is in the yelpItem.py file you have to update the import path too.
You can read about the background why it is so here.
I am brand new to scrappy and have worked my way through the tutorial and am trying to figure out how to implement what I have learned so far to complete a seemingly basic task. I know very little python so far and am using this as a learning experience, so if I ask a simple question, I apologize.
My goal for this program is to follow this link http://ucmwww.dnr.state.la.us/ucmsearch/FindDocuments.aspx?idx=xwellserialnumber&val=971683 and to extract the well serial number to a csv file. Eventually I want to run this spider on several thousand different well files and retrieve specific data. However, I am starting with the basics first.
Right now the spider doesnt crawl on any web page that I enter. There are no errors listed in the code when I run it, it just states that 0 pages were crawled. I cant quite figure out what I am doing wrong. I am positive the start url is ok as I have checked it out. Do I need a specific type of spider to accomplish what I am trying to do?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class Sonrisdataaccess(Spider):
name = "serial"
allowed_domains = ["sonris.com"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498"]
def parse(self, response):
questions = Selector(response).xpath('/html/body/table[1]/tbody/tr[2]/td[1]')
for question in questions:
item = SonrisdataaccessItem()
item['serial'] = question.xpath ('/html/body/table[1]/tbody/tr[2]/td[1]').extract()[0]
yield item
Thank you for any help, I greatly appreciate it!
First of all I do not understand what you are doing in your for loop because if you have a selector you do not get the whole HTML again to select it...
Nevertheless, the interesting part is that the browser represents the table way different than it is downloaded with Scrapy. If you look at the response in your parse method you will see that there is no tbody element in the first table. This is why your selection does not return anything.
So to get the first serial number (as it is in your XPath) change your parse function to this:
def parse(self, response):
item = SonrisdataaccessItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
For later changes you may have to alter the XPath expression to get more data.
I want to search for all documents inside a fairly large Plone site that contain a specific snippet of html in the body (list items with headings inside them, urgh ...) and then change that html (drop the headings).
Pointers on how to do that are much appreciated!
You should create a browserview (or run the instance in debug mode) and run this code:
from Products.CMFCore.utils import getToolByName
import re
ctool = getToolByName(context, 'portal_catalog')
results = ctool.searchResults(portal_type='Document')
for i in results:
obj = i.getObject()
text = obj.getField('text').get(obj)
<find and remove your html using the regular expression module>
obj.reindexObject()
If you need to do this many times, you could evaluate to add your custom index that simplify the job.
I have not tried it in a while, but check out GoReplace