Scrape https://socialblade.com/ - web-scraping

I am new to Scrapy and trying to scrape https://socialblade.com/ website to get the channel id of the most viewed and most subscribed Youtuber of a country.
The way I am doing it is to click on the link to a Youtuber on the main listing page (e.g. https://socialblade.com/youtube/top/country/pk/mostsubscribed). Then it opens a new page, and the last part of the new opened page contains the channel id (e.g. https://socialblade.com/youtube/channel/UC4JCksJF76g_MdzPVBJoC3Q).
Here is my code:
import scrapy
class SocialBladeSpider(scrapy.Spider):
name = "socialblade"
def start_requests(self):
urls = [
'https://socialblade.com/youtube/top/country/pk/mostviewed',
'https://socialblade.com/youtube/top/country/pk/mostsubscribed'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse_url(self, response):
data = {
'url': response.url.split('/')[-1],
'displayName': response.css('div#YouTubeUserTopInfoBlockTop div h1::text').extract_first()
}
yield {
response.meta['country']: {
response.meta['key']: data
}
}
def parse(self, response):
key = response.url.split("/")[-1]
country = response.url.split("/")[-2]
for a in response.css('a[href^="/youtube/user/"]'):
request = scrapy.Request(url='https://socialblade.com' + a.css('::attr(href)').extract_first(), callback=self.parse_url)
request.meta['key'] = key
request.meta['country'] = country
yield request
Issue is: after scraping these two urls I should get total 500 records. But I am only getting 348 records. I did the R&D but unable to find solution.
Does anyone have any advice on how to solve this?

Pass dont_filter=True to your requests if you do not want to filter out duplicate requests.
For more information, see the documentation about Request.

Related

How to get the url trail followed by spider for each item extracted?

i am currently working on a spider that crawls on an e-commerce website and extract data. Meanwhile, i need to save the url trail as well in the product such as
{
'product_name: "apple iphone 12",
'trail': ["https://www.apple.com/", "https://www.apple.com/iphone/", "https://www.apple.com/iphone-12/"
}
Same as the user will go from start page to the product.
I am using scrapy 2.4.1
I passed previous url as keyword arguments in the callback
source: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)

Scrapy get all urls from a domain and go beyond the domain by the depth of 2

I'm trying to scrape the newspaper online, I wanted to get all the URLs within the domain, and if there are any external URLs (articles from other domains) mentioned in the article, I may want to go and fetch those URLs. In other words, I want to allow the spider to go at a depth of 3 (is it two clicks away from start_urls?). Can someone look let me know if the snippet is right/wrong?
Any help is greatly appreciated.
Here is my code snippet:
start_urls = ['www.example.com']
master_domain = tldextract.extract(start_urls[0]).domain
allowed_domains = ['www.example.com']
rules = (Rule(LinkExtractor(deny=(r"/search", r'showComment=', r'/search/')),
callback="parse_item", follow=True),
)
def parse_item(self, response):
url = response.url
master_domain = self.master_domain
self.logger.info(master_domain)
current_domain = tldextract.extract(url).domain
referer = response.request.headers.get('Referer')
depth = response.meta.get('depth')
if current_domain == master_domain:
yield {'url': url,
'referer': referer,
'depth': depth}
elif current_domain != master_domain:
if depth < 2:
yield {'url': url,
'referer': referer,
'depth': depth}
else:
self.logger.debug('depth is greater than 3')
Open settings and add
DEPTH_LIMIT = 2
For more details see
There is no need of checking the domain,
if current_domain == master_domain:
when you have allowed domains it will automatically follow only those domains mentioned in allowed_domains

I want to fetch the data in data.json using this query or code

I am using VS code + git bash to scrape this data into JSON. But I am not getting any data into JSON or I did not get anything in JSON. JSON file is empty.
import scrapy
class ContactsSpider(scrapy.Spider):
name= 'contacts'
start_urls = [
'https://app.cartinsight.io/sellers/all/amazon/'
]
def parse(self, response):
for contacts in response.xpath("//td[#title= 'Show Contact']"):
yield{
'show_contacts_td': contacts.xpath(".//td[#id='show_contacts_td']").extract_first()
}
next_page= response.xpath("//li[#class = 'stores-desc hidden-xs']").extract_first()
if next_page is not None:
next_page_link= response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse)
The URL https://app.cartinsight.io/sellers/all/amazon/ you want to scrape is redirecting to this URL https://app.cartinsight.io/. The second URL didn't contain this XPath "//td[#title= 'Show Contact']" which results in skipping the for loop in parse method and thus you are not getting your desired results.

scraping content of ASP.NET based website using scrapy

I've been trying to scrape some lists from this website http://www.golf.org.au its an ASP.NET based I did some research and it appears that I must pass some values in a POST request to make the website fetch the data into the tables I did that but still I'm failing any Idea what I'm missing?
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
class GolfscraperSpider(scrapy.Spider):
name = "golfscraper"
allowed_domains = ["golf.org.au","www.golf.org.au"]
ids = ['3012801330', '3012801331', '3012801332', '3012801333']
start_urls = []
for id in ids:
start_urls.append('http://www.golf.org.au/handicap/%s' %id)
def parse(self, response):
scrapy.FormRequest('http://www.golf.org.au/default.aspx?
s=handicap',
formdata={
'__VIEWSTATE':
response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET':
'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' :
response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
def parse_details(self,response):
for name in response.css('div.rnd-course::text').extract():
yield {'name' : name}
Yes, ASP pages are tricky to scrape. Most probably some little parameter is missing.
Solution for this:
instead of creating the request through scrapy.FormRequest(...) use the scrapy.FormRequest.from_response() method (see code example below). This will capture most or even all of the hidden form data and use it to prepopulate the FormRequest's data.
it seems you forgot to return the request, maybe that's another potential problem too ...
as far as I recall the __VIEWSTATEGENERATOR also will change each time and has to be extracted from the page
If this doesn't work, fire up your Firefox browser with Firebug plugin or Chrome's developer tools, do the request in the browser and then check the full request header and body data against the same data in your request. There will be some difference.
Example code with all my suggestions:
def parse(self, response):
req = scrapy.FormRequest.from_response(response,
formdata={
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET': 'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
log.info(req.headers)
log.info(req.body)
return req

Find and scrape all URLs with specific format using Scrapy

I am using Scrapy to retrieve information about projects on https://www.indiegogo.com. I want to scrape all pages with the url format www.indiegogo.com/projects/[NameOfProject]. However, I am not sure how to reach all of those pages during a crawl. I can't find a master page that hardcodes links to all of the /projects/ pages. All projects seem to be accessible from https://www.indiegogo.com/explore (through visible links and the search function), but I cannot determine the set of links/search queries that would return all pages. My spider code is given below. These start_urls and rules scrape about 6000 pages, but I hear that there should be closer to 10x that many.
About the urls with parameters: The filter_quick parameter values used come from the "Trending", "Final Countdown", "New This Week", and "Most Funded" links on the Explore page and obviously miss unpopular and poorly funded projects. There is no max value on the per_page url parameter.
Any suggestions? Thanks!
class IndiegogoSpider(CrawlSpider):
name = "indiegogo"
allowed_domains = ["indiegogo.com"]
start_urls = [
"https://www.indiegogo.com/sitemap",
"https://www.indiegogo.com/explore",
"http://go.indiegogo.com/blog/category/campaigns-2",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=countdown&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=new&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=most_funded&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=popular_all&per_page=50000"
]
rules = (
Rule(LinkExtractor(allow=('/explore?'))),
Rule(LinkExtractor(allow=('/campaigns-2/'))),
Rule(LinkExtractor(allow=('/projects/')), callback='parse_item'),
)
def parse_item(self, response):
[...]
Sidenote: there are other URL formats www.indiegogo.com/projects/[NameOfProject]/[OtherStuff] that either redirect to the desired URL format or give 404 errors when I try to load them in the browser. I am assuming that Scrapy is handling the redirects and blank pages correctly, but would be open to hearing ways to verify this.
Well if you have the link to sitemap than it will be faster to let Scrapy fetch pages from there and process them.
This will work something like below.
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
//**you can set rules for extracting URLs under sitemap_rules.
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Try the below code this will crawl the site and only crawl the "indiegogo.com/projects/"
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from sitemap.items import myitem
class DmozSpider(CrawlSpider):
name = 'indiego'
allowed_domains = ['indiegogo.com']
start_urls = [
'http://indiegogo.com'
]
rules = (Rule(LinkExtractor(allow_domains=['indiegogo.com/projects/']), callback='parse_items', follow= True),)
def parse_items(self, response):
item = myitem()
item['link'] = response.request.url
item['title'] = response.xpath('//title').extract()
yield item

Resources