Get structured output with Scrapy - web-scraping

I'm just starting to use scrapy and this is one of my first few projects. I am trying to scrape some company metadata from https://www.baincapitalprivateequity.com/portfolio/ . I have figured out my selectors but I'm unable to structure the output. I'm currently getting everything in one cell but I want the output to be one row for each company. If someone could help with where I'm going wrong, it'll be really great.
import scrapy
from ..items import BainpeItem
class BainPeSpider(scrapy.Spider):
name = 'Bain-PE'
allowed_domains = ['baincapitalprivateequity.com']
start_urls = ['https://www.baincapitalprivateequity.com/portfolio/']
def parse(self, response):
items = BainpeItem()
all_cos = response.css('div.grid')
for i in all_cos:
company = i.css('ul li::text').extract()
about = i.css('div.companyDetail p').extract()
items['company'] = company
items['about'] = about
yield items

You can just yield each item in the for loop:
for i in all_cos:
item = BainpeItem()
company = i.css('ul li::text').extract()
about = i.css('div.companyDetail p').extract()
item['company'] = company
item['about'] = about
yield item
This way each item will arrive in the pipeline separately.

Related

Duplication in data while scraping data using Scrapy

python
I am using scrapy to scrape data from a website, where i want to scrape graphic cards title,price and whether they are in stock or not. The problem is my code is looping twice and instead of having 10 products I am getting 20.
import scrapy
class ThespiderSpider(scrapy.Spider):
name = 'Thespider'
start_urls = ['https://www.czone.com.pk/graphic-cards-pakistan-ppt.154.aspx?page=2']
def parse(self, response):
data = {}
cards = response.css('div.row')
for card in cards:
for c in card.css('div.product'):
data['Title'] = c.css('h4 a::text').getall()
data['Price'] = c.css('div.price span::text').getall()
data['Stock'] = c.css('div.product-stock span.product-data::text').getall()
yield data
You're doing a nested for loop when one isn't necessary.
Each card can be captured by the CSS selector response.css('div.product')
Code Example
def parse(self, response):
data = {}
cards = response.css('div.product')
for card in cards:
data['Title'] = card.css('h4 a::text').getall()
data['Price'] = card.css('div.price span::text').getall()
data['Stock'] = card.css('div.product-stock span.product-data::text').getall()
yield data
Additional Information
Use get() instead of getall(). The output you get is a list, you'll probably want a string which is what get() gives you.
If you're thinking about multiple pages, an items dictionary may be better than yielding a dictionary. Invariably there will be the thing you need to alter and an items dictionary gives you more flexibility to do this.

When using scrapy shell, I get no data from response.xpath

I am trying to scrape a betting site. However, when I check for the retrieved data in scrapy shell, I receive nothing.
The xpath to what I need is: //*[#id="yui_3_5_0_1_1562259076537_31330"] and when I write in the shell this is what I get:
In [18]: response.xpath ( '//*[#id="yui_3_5_0_1_1562259076537_31330"]')
Out[18]: []
The output is [] but I expected to be something from which I could extract the href.
When I use the "inspect" tool from Chrome, while the site is still loading, this id is outlined in purple. Does this mean that the site is using JavaScipt? And if this is true, is this the reason why scrapy does not find the item and returns []?
i try scraping the site just using Scrapy and this is my result.
This the items.py file
import scrapy
class LifeMatchsItem(scrapy.Item):
Event = scrapy.Field() # Name of event
Match = scrapy.Field() # Teams1 vs Team2
Date = scrapy.Field() # Date of Match
This is my Spider code
import scrapy
from LifeMatchesProject.items import LifeMatchsItem
class LifeMatchesSpider(scrapy.Spider):
name = 'life_matches'
start_urls = ['http://www.betfair.com/sport/home#sscpl=ro/']
custom_settings = {'FEED_EXPORT_ENCODING': 'utf-8'}
def parse(self, response):
for event in response.xpath('//div[contains(#class,"events-title")]'):
for element in event.xpath('./following-sibling::ul[1]/li'):
item = LifeMatchsItem()
item['Event'] = event.xpath('./a/#title').get()
item['Match'] = element.xpath('.//div[contains(#class,"event-name-info")]/a/#data-event').get()
item['Date'] = element.xpath('normalize-space(.//div[contains(#class,"event-name-info")]/a//span[#class="date"]/text())').get()
yield item
And this is the result

How to scrapy these data's from the website?

Here's an example: [http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/][1]
Ideally like to see a neatly crawled and extracted output data array with the following fields:
Company Name
2016 Rank
2015 Rank
Years in Business
Business Description
Website
2015 Revenues
2014 Revenues
HQ City
Year Founded
Employees
Is family owned?
from each of the specific company data pages.I'm purely beginner to scrapy i want know how to extract links automatically. Here in this code i'm feeding it manual. Can anyone help me here.
import scrapy
from spy.items import SpyItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.linkextractors import LinkExtractor
class ProjectSpider(CrawlSpider):
name = "project"
allowed_domains = ["cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/"]
start_urls = [100Links in here]
def parse(self, response):
item = SpyItem()
item['title'] = response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[1]/strong/text()').extract()
item['Business'] =response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[4]/text()').extract()
item['website'] =response.xpath('//p[5]/a/text()').extract()
item['Ranking']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[2]/text()[1]').extract()
item['HQ']=response.css('p:nth-child(12)::text').extract()
item['Revenue2015']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[7]/text()').extract()
item['Revenue2014']=response.css('p:nth-child(10)::text').extract()
item['YearFounded']=response.xpath('//p[11]/text()').extract().encode('utf-8')
item['Employees']=response.xpath('//article/div[3]/p[12]/text()').extract()
item['FamilyOwned']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[13]/text()').extract()
yield item
There are at least two issues with your code.
allowed_domain has to be a domain. Not more.
You use a CrawlSpider that is meant to be used with Rules. You don't have any rules.
In the following there is some tested code as starting point:
import scrapy
class ProjectItem(scrapy.Item):
title = scrapy.Field()
owned = scrapy.Field()
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./#href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[#role="main"]/p[1]//text()').extract_first()
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[#role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
# find clever XPaths for other fields ...
# ...
# Finally: yield the item
yield item

Scrapy Data Table extract

I am trying to scrape "https://www.expireddomains.net/deleted-com-domains/"
for the expired domain data list.
I always get empty item fields for the following
class ExpiredSpider(BaseSpider):
name = "expired"
allowed_domains = ["example.com"]
start_urls = ['https://www.expireddomains.net/deleted-com-domains/']
def parse(self, response):
log.msg('parse(%s)' % response.url, level = log.DEBUG)
rows = response.xpath('//table[#class="base1"]/tbody/tr')
for row in rows:
item = DomainItem()
item['domain'] = row.xpath('td[1]/text()').extract()
item['bl'] = row.xpath('td[2]/text()').extract()
yield item
Can somebody point out what is wrong? Thanks.
As a first note, you should use scrapy.Spider instead of BaseSpider which is deprecated
Secondly, .extract() method returns a list rather than a single element.
This is how the item extraction should look like
item['domain'] = row.xpath('td[1]/text()').extract_first()
item['bl'] = row.xpath('td[2]/text()').extract_first()
Also,
You should use the built in python logging library
import logging
logging.debug("parse("+response.url+")")

Scraping data from wikipedia using Scrapy - why/when do errors occur due to processing URLs?

I am just starting to use Scrapy, and I am learning to use it as I go along. Please can someone explain why there is an error in my code, and what this error is? Is this error related to an invalid URL I have provided, and/or is it connected with invalid xpaths?
Here is my code:
from scrapy.spider import Spider
from scrapy.selector import Selector
class CatswikiSpider(Spider):
name = "catswiki"
allowed_domains = ["http://en.wikipedia.org/wiki/Cat‎"]
start_urls = [
"http://en.wikipedia.org/wiki/Cat‎"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//body/div')
for site in sites:
title = ('//h1/span/text()').extract()
subtitle = ('//h2/span/text()').extract()
boldtext = ('//p/b').extract()
links = ('//a/#href').extract()
imagelinks = ('//img/#src').re(r'.*cat.*').extract()
print title, subtitle, boldtext, links, imagelinks
#filename = response.url.split("/")[-2]
#open(filename, 'wb').write(response.body)
And here are some attachments, showing the errors in the command prompt:
You need a function call before all your extract lines. I'm not familiar with scrapy, but it's probably something like:
title = site.xpath('//h1/span/text()').extract()

Resources