How to scrape all of the data from the website?

How to scrape all of the data from the website? - web-scraping

My code only giving me 44 links data instead of 102. Can Someone say me why it is Extracting like that?I would appreciate your help.How can i extract it properly???
import scrapy
class ProjectItem(scrapy.Item):
title = scrapy.Field()
owned = scrapy.Field()
Revenue2014 = scrapy.Field()
Revenue2015 = scrapy.Field()
Website = scrapy.Field()
Rank = scrapy.Field()
Employees = scrapy.Field()
headquarters = scrapy.Field()
FoundedYear = scrapy.Field()
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"click or tap here.")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./#href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[#role="main"]/p[1]//text()').extract_first().rsplit('-')[1]
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[#role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
item['Revenue2014'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"2014")]/text()').extract_first().rsplit('$')[1]
item['Revenue2015'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"$")]/text()').extract_first().rsplit('$')[1]
item['Website'] = response.xpath('//div[#role="main"]/p/a[contains(.,"www.")]/#href').extract_first()
item['Rank'] = response.xpath('//div[#role="main"]/p[contains(.,"rank")]/text()').extract_first()
item['Employees'] = response.xpath('//div[#role="main"]/p[contains(.,"Employ")]/text()').extract_first()
item['headquarters'] = response.xpath('//div[#role="main"]/p[10]//text()').extract()
item['FoundedYear'] = response.xpath('//div[#role="main"]/p[contains(.,"founded")]/text()').extract()
# Finally: yield the item
yield item

Looking closer at the output of scrapy you'll find that starting after a few dozen of requests they get redirected like shown below:
DEBUG: Redirecting (302) to <GET http://www.cincinnati.com/get-access/?return=http%3A%2F%2Fwww.cincinnati.com%2Fstory%2Fmoney%2F2016%2F11%2F27%2Ffrischs-restaurants%2F94430718%2F> from <GET http://www.cincinnati.com/story/money/2016/11/27/frischs-restaurants/94430718/>
The page that gets requested says: We hope you have enjoyed your complimentary access.
So it looks like they offer only limited access to anonymous users. You probably need to register to their service to get full access to the data.

There are a few potential problems with your xpaths:
it's usually a bad idea to make xpaths look for text that's on a page. Text can change from one minute to the next. The layout and html structure is much more long lived.
using 'following-siblings' is also a last-resort xpath feature that is quite vulnerable to slight changes on the website.
What I would be doing instead:
# iterate all paragraphs within the article:
for para in response.xpath("//*[#itemprop='articleBody']/p"):
url = para.xpath("./a/#href").extract()
# ... etc
len( response.xpath("//*[#itemprop='articleBody']/p")) gives me the expected 102 by the way.
You might have to filter the URLs to remove non-company urls like the on labeled with "click or tap here"

Related

When using scrapy shell, I get no data from response.xpath

I am trying to scrape a betting site. However, when I check for the retrieved data in scrapy shell, I receive nothing.
The xpath to what I need is: //*[#id="yui_3_5_0_1_1562259076537_31330"] and when I write in the shell this is what I get:
In [18]: response.xpath ( '//*[#id="yui_3_5_0_1_1562259076537_31330"]')
Out[18]: []
The output is [] but I expected to be something from which I could extract the href.
When I use the "inspect" tool from Chrome, while the site is still loading, this id is outlined in purple. Does this mean that the site is using JavaScipt? And if this is true, is this the reason why scrapy does not find the item and returns []?

i try scraping the site just using Scrapy and this is my result.
This the items.py file
import scrapy
class LifeMatchsItem(scrapy.Item):
Event = scrapy.Field() # Name of event
Match = scrapy.Field() # Teams1 vs Team2
Date = scrapy.Field() # Date of Match
This is my Spider code
import scrapy
from LifeMatchesProject.items import LifeMatchsItem
class LifeMatchesSpider(scrapy.Spider):
name = 'life_matches'
start_urls = ['http://www.betfair.com/sport/home#sscpl=ro/']
custom_settings = {'FEED_EXPORT_ENCODING': 'utf-8'}
def parse(self, response):
for event in response.xpath('//div[contains(#class,"events-title")]'):
for element in event.xpath('./following-sibling::ul[1]/li'):
item = LifeMatchsItem()
item['Event'] = event.xpath('./a/#title').get()
item['Match'] = element.xpath('.//div[contains(#class,"event-name-info")]/a/#data-event').get()
item['Date'] = element.xpath('normalize-space(.//div[contains(#class,"event-name-info")]/a//span[#class="date"]/text())').get()
yield item
And this is the result

Scrapy: Scraping nested links

I am new to Scrapy and web scraping. Please don't get mad. I am trying to scrape profilecanada.com. Now, when I ran the code below, no errors are given but I think it still not scraping. In my code, I am trying to start in a page where there is a list of link. Each link leads to a page where there is also another list of link. From that link is another page that lies the data that I needed to extract and save into a json file. In general, it something like "nested link scraping". I don't know how it is actually called. Please see the image below for the result of spider when I rant it. Thank you in advance for your help.
import scrapy
class ProfilecanadaSpider(scrapy.Spider):
name = 'profilecanada'
allowed_domains = ['http://www.profilecanada.com']
start_urls = ['http://www.profilecanada.com/browse_by_category.cfm/']
def parse(self, response):
# urls in from start_url
category_list_urls = response.css('div.div_category_list > div.div_category_list_column > ul > li.li_category > a::attr(href)').extract()
# start_u = 'http://www.profilecanada.com/browse_by_category.cfm/'
# for each category of company
for url in category_list_urls:
url = url[3:]
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.profileCategoryPages)
def profileCategoryPages(self, response):
company_list_url = response.css('div.dv_en_block_name_frame > a::attr(href)').extract()
# for each company in the list
for url in company_list_url:
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.companyDetails)
def companyDetails(self, response):
return {
'company_name': response.css('span#name_frame::text').extract_first(),
'street_address': str(response.css('span#frame_addr::text').extract_first()),
'city': str(response.css('span#frame_city::text').extract_first()),
'region_or_province': str(response.css('span#frame_province::text').extract_first()),
'postal_code': str(response.css('span#frame_postal::text').extract_first()),
'country': str(response.css('div.type6_GM > div > div::text')[-1].extract())[2:],
'phone_number': str(response.css('span#frame_phone::text').extract_first()),
'fax_number': str(response.css('span#frame_fax::text').extract_first()),
'email': str(response.css('span#frame_email::text').extract_first()),
'website': str(response.css('span#frame_website > a::attr(href)').extract_first()),
}
IMAGE RESULT IN CMD:
The result in cmd when I ran the spider

You should change allowed_domains to allowed_domains = ['profilecanada.com'] and all the return scrapy.Request to yield scrapy.Request and it'll start working, keep in mind that obeying the robots.txt is not always enough, you should throttle your requests if necessary.

How to scrapy these data's from the website?

Here's an example: [http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/][1]
Ideally like to see a neatly crawled and extracted output data array with the following fields:
Company Name
2016 Rank
2015 Rank
Years in Business
Business Description
Website
2015 Revenues
2014 Revenues
HQ City
Year Founded
Employees
Is family owned?
from each of the specific company data pages.I'm purely beginner to scrapy i want know how to extract links automatically. Here in this code i'm feeding it manual. Can anyone help me here.
import scrapy
from spy.items import SpyItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.linkextractors import LinkExtractor
class ProjectSpider(CrawlSpider):
name = "project"
allowed_domains = ["cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/"]
start_urls = [100Links in here]
def parse(self, response):
item = SpyItem()
item['title'] = response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[1]/strong/text()').extract()
item['Business'] =response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[4]/text()').extract()
item['website'] =response.xpath('//p[5]/a/text()').extract()
item['Ranking']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[2]/text()[1]').extract()
item['HQ']=response.css('p:nth-child(12)::text').extract()
item['Revenue2015']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[7]/text()').extract()
item['Revenue2014']=response.css('p:nth-child(10)::text').extract()
item['YearFounded']=response.xpath('//p[11]/text()').extract().encode('utf-8')
item['Employees']=response.xpath('//article/div[3]/p[12]/text()').extract()
item['FamilyOwned']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[13]/text()').extract()
yield item

There are at least two issues with your code.
allowed_domain has to be a domain. Not more.
You use a CrawlSpider that is meant to be used with Rules. You don't have any rules.
In the following there is some tested code as starting point:
import scrapy
class ProjectItem(scrapy.Item):
title = scrapy.Field()
owned = scrapy.Field()
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./#href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[#role="main"]/p[1]//text()').extract_first()
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[#role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
# find clever XPaths for other fields ...
# ...
# Finally: yield the item
yield item

The headquarters has different <p> tag and it's name is in another <p> tag and it is changing how can i get all headquarters names correctly?

I need help in correctly scrap the headquarters data from all the links in the http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/' website
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./#href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[#role="main"]/p[1]//text()').extract_first().rsplit('-')[1]
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[#role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
item['Revenue2014'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"2014")]/text()').extract_first().rsplit('$')[1]
item['Revenue2015'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"$")]/text()').extract_first().rsplit('$')[1]
item['Website'] = response.xpath('//div[#role="main"]/p/a[contains(.,"com")]/text()').extract_first()
item['Rank'] = response.xpath('//div[#role="main"]/p[contains(.,"rank")]/text()').extract_first()
item['Employees'] = response.xpath('//div[#role="main"]/p[contains(.,"Employ")]/text()').extract_first()
item['headquarters'] = response.xpath('//div[#role="main"]/p[10]//text()').extract()
item['FoundedYear'] = response.xpath('//div[#role="main"]/p[contains(.,"founded")]/text()').extract()
# Finally: yield the item
yield item

While the headquarter has a preceding header labeled "Headquarter" I would take this as anchor and select the content of next following <p> tag like so:
//p[contains(.,"Headquarters")]/following-sibling::p[1]
Maybe you want to have a look at this XPath Tutorial to get a better understanding of the commands (and find a better solution).

Scraping data from wikipedia using Scrapy - why/when do errors occur due to processing URLs?

I am just starting to use Scrapy, and I am learning to use it as I go along. Please can someone explain why there is an error in my code, and what this error is? Is this error related to an invalid URL I have provided, and/or is it connected with invalid xpaths?
Here is my code:
from scrapy.spider import Spider
from scrapy.selector import Selector
class CatswikiSpider(Spider):
name = "catswiki"
allowed_domains = ["http://en.wikipedia.org/wiki/Cat‎"]
start_urls = [
"http://en.wikipedia.org/wiki/Cat‎"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//body/div')
for site in sites:
title = ('//h1/span/text()').extract()
subtitle = ('//h2/span/text()').extract()
boldtext = ('//p/b').extract()
links = ('//a/#href').extract()
imagelinks = ('//img/#src').re(r'.*cat.*').extract()
print title, subtitle, boldtext, links, imagelinks
#filename = response.url.split("/")[-2]
#open(filename, 'wb').write(response.body)
And here are some attachments, showing the errors in the command prompt:

You need a function call before all your extract lines. I'm not familiar with scrapy, but it's probably something like:
title = site.xpath('//h1/span/text()').extract()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to scrape all of the data from the website? - web-scraping

Related

When using scrapy shell, I get no data from response.xpath

Scrapy: Scraping nested links

How to scrapy these data's from the website?

The headquarters has different <p> tag and it's name is in another <p> tag and it is changing how can i get all headquarters names correctly?

Scraping data from wikipedia using Scrapy - why/when do errors occur due to processing URLs?

Categories

Resources