I am using Scrapy to retrieve information about projects on https://www.indiegogo.com. I want to scrape all pages with the url format www.indiegogo.com/projects/[NameOfProject]. However, I am not sure how to reach all of those pages during a crawl. I can't find a master page that hardcodes links to all of the /projects/ pages. All projects seem to be accessible from https://www.indiegogo.com/explore (through visible links and the search function), but I cannot determine the set of links/search queries that would return all pages. My spider code is given below. These start_urls and rules scrape about 6000 pages, but I hear that there should be closer to 10x that many.
About the urls with parameters: The filter_quick parameter values used come from the "Trending", "Final Countdown", "New This Week", and "Most Funded" links on the Explore page and obviously miss unpopular and poorly funded projects. There is no max value on the per_page url parameter.
Any suggestions? Thanks!
class IndiegogoSpider(CrawlSpider):
name = "indiegogo"
allowed_domains = ["indiegogo.com"]
start_urls = [
"https://www.indiegogo.com/sitemap",
"https://www.indiegogo.com/explore",
"http://go.indiegogo.com/blog/category/campaigns-2",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=countdown&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=new&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=most_funded&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=popular_all&per_page=50000"
]
rules = (
Rule(LinkExtractor(allow=('/explore?'))),
Rule(LinkExtractor(allow=('/campaigns-2/'))),
Rule(LinkExtractor(allow=('/projects/')), callback='parse_item'),
)
def parse_item(self, response):
[...]
Sidenote: there are other URL formats www.indiegogo.com/projects/[NameOfProject]/[OtherStuff] that either redirect to the desired URL format or give 404 errors when I try to load them in the browser. I am assuming that Scrapy is handling the redirects and blank pages correctly, but would be open to hearing ways to verify this.
Well if you have the link to sitemap than it will be faster to let Scrapy fetch pages from there and process them.
This will work something like below.
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
//**you can set rules for extracting URLs under sitemap_rules.
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Try the below code this will crawl the site and only crawl the "indiegogo.com/projects/"
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from sitemap.items import myitem
class DmozSpider(CrawlSpider):
name = 'indiego'
allowed_domains = ['indiegogo.com']
start_urls = [
'http://indiegogo.com'
]
rules = (Rule(LinkExtractor(allow_domains=['indiegogo.com/projects/']), callback='parse_items', follow= True),)
def parse_items(self, response):
item = myitem()
item['link'] = response.request.url
item['title'] = response.xpath('//title').extract()
yield item
Related
Hi so i am crawling a website with articles, within each article is a link to a file, i managed to crawl all the article links, now i want to access each one and collect the link within it, instead of maybe having to save the result of the first crawl to a json and then writting another script.
thing is i am new to scrapy so i dont really know how to do that, thanks in advance!
import scrapy
class SgbdSpider(scrapy.Spider):
name = "sgbd"
start_urls = [
"http://www.sante.gouv.sn/actualites/"
]
def parse(self,response):
base = "http://www.sante.gouv.sn/actualites/"
for link in response.css(".card-title a"):
title = link.css("a::text").get()
href = link.css("a::attr(href)").extract()
#here instead of yield, i want to parse the href and then maybe yield the result of THAT parse.
yield{
"title" : title,
"href" : href
}
# next step for each href, parse again and get link in that page for pdf file
# pdf link can easily be collected with response.css(".file a::attr(href)").get()
# then write that link in a json file
next_page = response.css("li.pager-next a::attr(href)").get()
if next_page is not None and next_page.split("?page=")[-1] != "35":
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page,callback=self.parse)
you can yield a request to those pdf link with a new callback where you will put the logic extracting
A crawl spider rather than the simple basic spider is better suited to handle this. The basic spider template is generated by default so you have to specify the template to use when generating the spider.
Assuming you've created the project & are in the root folder:
$ scrapy genspider -t crawl sgbd sante.sec.gouv.sn
Opening up sgbd.py file, you'll notice the difference between it & the basic spider template.
If you're unfamiliar with XPath, here's a run-through
LinkExtractor & Rule will define your spider's behavior as per the documentation
Edit the file:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SgbdSpider(CrawlSpider):
name = 'sgbd'
allowed_domains = ['sante.sec.gouv.sn']
start_urls = ['https://sante.sec.gouv.sn/actualites']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
def set_user_agent(self, request, spider):
request.headers['User-Agent'] = self.user_agent
return request
# First rule get the links to the articles; callback is the function executed after following the link to each article
# Second rule handles pagination
# Couldn't get it to work when passing css selectors in LinkExtractor as restrict_css,
# used XPaths instead
rules = (
Rule(
LinkExtractor(restrict_xpaths='//*[#id="main-content"]/div[1]/div/div[2]/ul/li[11]/a'),
callback='parse_item',
follow=True,
process_request='set_user_agent',
),
Rule(
LinkExtractor(restrict_xpaths='//*[#id="main-content"]/div[1]/div/div[1]/div/div/div/div[3]/span/div/h4/a'),
process_request='set_user_agent',
)
)
# Extract title & link to pdf
def parse_item(self, response):
yield {
'title': response.xpath('//*[#id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/font/font/text()').get(),
'href': response.xpath('//*[#id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/#href').get()
}
Unfortunately this is as far as I could go as the site was inaccessible even with different proxies, it was taking too long to respond. You might have to tweak those XPaths a littler further. Better luck on your side.
Run the spider & save output to json
$ scrapy crawl sgbd -o results.json
Parse links in another function. Then parse again in yet another function. You can yield whatever results you want in any of those functions.
I agree with what #bens_ak47 & #user9958765 said, use a separate function.
For example, change this:
yield scrapy.Request(next_page, callback=self.parse)
to this:
yield scrapy.Request(next_page, callback=self.parse_pdffile)
then add the new method:
def parse_pdffile(self, response):
print(response.url)
I'm trying to scrape the newspaper online, I wanted to get all the URLs within the domain, and if there are any external URLs (articles from other domains) mentioned in the article, I may want to go and fetch those URLs. In other words, I want to allow the spider to go at a depth of 3 (is it two clicks away from start_urls?). Can someone look let me know if the snippet is right/wrong?
Any help is greatly appreciated.
Here is my code snippet:
start_urls = ['www.example.com']
master_domain = tldextract.extract(start_urls[0]).domain
allowed_domains = ['www.example.com']
rules = (Rule(LinkExtractor(deny=(r"/search", r'showComment=', r'/search/')),
callback="parse_item", follow=True),
)
def parse_item(self, response):
url = response.url
master_domain = self.master_domain
self.logger.info(master_domain)
current_domain = tldextract.extract(url).domain
referer = response.request.headers.get('Referer')
depth = response.meta.get('depth')
if current_domain == master_domain:
yield {'url': url,
'referer': referer,
'depth': depth}
elif current_domain != master_domain:
if depth < 2:
yield {'url': url,
'referer': referer,
'depth': depth}
else:
self.logger.debug('depth is greater than 3')
Open settings and add
DEPTH_LIMIT = 2
For more details see
There is no need of checking the domain,
if current_domain == master_domain:
when you have allowed domains it will automatically follow only those domains mentioned in allowed_domains
Relatively new to Splash. I'm trying to scrape a website which needs a login. I started off with the Splash API for which I was able to login perfectly. However, when I put my code in a scrapy spider script, using SplashRequest, it's not able to login.
import scrapy
from scrapy_splash import SplashRequest
class Payer1Spider(scrapy.Spider):
name = "payer1"
start_url = "https://provider.wellcare.com/provider/claims/search"
lua_script = """
function main(splash,args)
assert(splash:go(args.url))
splash:wait(0.5)
local search_input = splash:select('#Username')
search_input:send_text('')
local search_input = splash:select('#Password')
search_input:send_text('')
assert(splash:wait(0.5))
local login_button = splash:select('#btnSubmit')
login_button:mouse_click()
assert(splash:wait(7))
return{splash:html()}
end
"""
def start_requests(self):
yield SplashRequest(self.start_url, self.parse_result,args={'lua_source': self.lua_script},)
def parse_result(self, response):
yield {'doc_title' : response.text}
The output HTML is the login page and not the one after logging in.
You have to add endpoint='execute' to your SplashRequest to execute the lua-script:
yield SplashRequest(self.start_url, self.parse_result, args={'lua_source': self.lua_script}, endpoint='execute')
I believe you don't need splash to login to the site indeed. You can try next:
Get https://provider.wellcare.com and then..
# Get request verification token..
token = response.css('input[name=__RequestVerificationToken]::attr(value)').get()
# Forge post request payload...
data = [
('__RequestVerificationToken', token),
('Username', 'user'),
('Password', 'pass'),
('ReturnUrl', '/provider/claims/search'),
]
#Make dict from list of tuples
formdata=dict(data)
# And then execute request
scrapy.FormRequest(
url='https://provider.wellcare.com/api/sitecore/Login',
formdata=formdata
)
Not completely sure if all of this will work. But you can try.
I am new to Scrapy and trying to scrape https://socialblade.com/ website to get the channel id of the most viewed and most subscribed Youtuber of a country.
The way I am doing it is to click on the link to a Youtuber on the main listing page (e.g. https://socialblade.com/youtube/top/country/pk/mostsubscribed). Then it opens a new page, and the last part of the new opened page contains the channel id (e.g. https://socialblade.com/youtube/channel/UC4JCksJF76g_MdzPVBJoC3Q).
Here is my code:
import scrapy
class SocialBladeSpider(scrapy.Spider):
name = "socialblade"
def start_requests(self):
urls = [
'https://socialblade.com/youtube/top/country/pk/mostviewed',
'https://socialblade.com/youtube/top/country/pk/mostsubscribed'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse_url(self, response):
data = {
'url': response.url.split('/')[-1],
'displayName': response.css('div#YouTubeUserTopInfoBlockTop div h1::text').extract_first()
}
yield {
response.meta['country']: {
response.meta['key']: data
}
}
def parse(self, response):
key = response.url.split("/")[-1]
country = response.url.split("/")[-2]
for a in response.css('a[href^="/youtube/user/"]'):
request = scrapy.Request(url='https://socialblade.com' + a.css('::attr(href)').extract_first(), callback=self.parse_url)
request.meta['key'] = key
request.meta['country'] = country
yield request
Issue is: after scraping these two urls I should get total 500 records. But I am only getting 348 records. I did the R&D but unable to find solution.
Does anyone have any advice on how to solve this?
Pass dont_filter=True to your requests if you do not want to filter out duplicate requests.
For more information, see the documentation about Request.
I've been trying to scrape some lists from this website http://www.golf.org.au its an ASP.NET based I did some research and it appears that I must pass some values in a POST request to make the website fetch the data into the tables I did that but still I'm failing any Idea what I'm missing?
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
class GolfscraperSpider(scrapy.Spider):
name = "golfscraper"
allowed_domains = ["golf.org.au","www.golf.org.au"]
ids = ['3012801330', '3012801331', '3012801332', '3012801333']
start_urls = []
for id in ids:
start_urls.append('http://www.golf.org.au/handicap/%s' %id)
def parse(self, response):
scrapy.FormRequest('http://www.golf.org.au/default.aspx?
s=handicap',
formdata={
'__VIEWSTATE':
response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET':
'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' :
response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
def parse_details(self,response):
for name in response.css('div.rnd-course::text').extract():
yield {'name' : name}
Yes, ASP pages are tricky to scrape. Most probably some little parameter is missing.
Solution for this:
instead of creating the request through scrapy.FormRequest(...) use the scrapy.FormRequest.from_response() method (see code example below). This will capture most or even all of the hidden form data and use it to prepopulate the FormRequest's data.
it seems you forgot to return the request, maybe that's another potential problem too ...
as far as I recall the __VIEWSTATEGENERATOR also will change each time and has to be extracted from the page
If this doesn't work, fire up your Firefox browser with Firebug plugin or Chrome's developer tools, do the request in the browser and then check the full request header and body data against the same data in your request. There will be some difference.
Example code with all my suggestions:
def parse(self, response):
req = scrapy.FormRequest.from_response(response,
formdata={
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET': 'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
log.info(req.headers)
log.info(req.body)
return req