How to iterate to scrape each item no matter the position

How to iterate to scrape each item no matter the position - web-scraping

I'm using scrapy and I'm traying to scrape Technical descriptions from products. But i can't find any tutorial for what i'm looking for.
I'm using this web: Air Conditioner 1
For exemple, i need to extract the model of that product:
Modelo ---> KCIN32HA3AN
. It's in the 5th place.
(//span[#class='gb-tech-spec-module-list-description'])[5]
But if i go this other product:
Air Conditioner 2
The model is: Modelo ---> ALS35-WCCR
And it's in the 6th position. And i only get this 60 m3 since is the 5th position.
I don't know how to iterate to obtain each model no matter the position.
This is the code i'm using right now
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
class Hotel(Item):
titulo = Field()
precio = Field()
marca = Field()
modelo = Field()
class TripAdvisor(CrawlSpider):
name = 'Hoteles'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36',
'CLOSESPIDER_PAGECOUNT': 20
}
start_urls = ['https://www.garbarino.com/productos/aires-acondicionados-split/4278']
download_delay = 2
rules = (
Rule(
LinkExtractor(
allow=r'/?page=\d+'
), follow=True),
Rule(
LinkExtractor(
allow=r'/aire-acondicionado-split'
), follow=True, callback='parse_items'),
)
def parse_items(self, response):
sel = Selector (response)
item = ItemLoader(Hotel(), sel)
item.add_xpath('titulo', '//h1/text()')
item.add_xpath('precio', '//*[#id="final-price"]/text()')
item.add_xpath('marca', '(//span[#class="gb-tech-spec-module-list-description"])[1]/text()', MapCompose(lambda i: i.replace('\n', ' ').replace('\r', ' ').strip()))
item.add_xpath('modelo', '(//span[#class="gb-tech-spec-module-list-description"])[5]/text()', MapCompose(lambda i: i.replace('\n', ' ').replace('\r', ' ').strip()))
yield item.load_item()

Is not good to take elements by the position, the website could change a lot many times, and that forces you to fix your crawler, in some cases, several times.
But you can use some reference that is most associated with the element that you want than the element position.
For example, I accessed the site you linked and opened this product page, note that the element with the value of modelo should be associated with the element that "presents" the modelo:
<ul>
<li>
<h3 class="gb-tech-spec-module-list-title">Modelo</h3>
<span class="gb-tech-spec-module-list-description">BSI26WCCR</span>
</li>
<li>
<h3 class="gb-tech-spec-module-list-title">Tipo de Tecnología</h3>
<span class="gb-tech-spec-module-list-description">Inverter</span>
</li>
...
</ul>
So, you can do the following:
//*[contains(text(), "Modelo")]/following-sibling::*[contains(#class, "description")]/text()
In that way, the Xpath does not depends on the position.
Reference to use following-sibling.

For those two, you can use the following css selector:
ul:nth-child(2) > li:nth-child(1) > span
Take the first returned match

Related

Find sub class of a class and return list of elements

I intend to scrape certain countries from a webpage that are under chapter 4 and return a list of those countries. The challenge is that I cannot retrieve the tag
USING READ HTML
reqUS = Request('https://www.state.gov/reports/country-reports-on-terrorism-2019/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'})
US = urlopen(reqUS).read()
# print(US)
# Create a soup object
soup = BeautifulSoup(US, 'html.parser')
# find class "floated-right well"
#Terrorist_list = soup.find_all(attrs={"class": "report__section-title"})
Chapter4 = (soup.find('h2', class_="report__section-title",id="report-toc__section-7"))
#print(Chapter4)
# Give location where text is stored which you wish to alter
unordered_list = soup.find("h2", {"id": "report-toc__section-7"})
print(unordered_list)

You could use #report-toc__section-7 as an anchor, and TERRORIST SAFE HAVENS as a start point and COUNTERING TERRORISM ON THE ECONOMIC FRONT as an endpoint. Use those as the text to go into :-soup-contains to filter with css selectors to obtain only the p tags between those with a child strong tag (using :has). You need to also add in :not to remove the p with child strongs after, and including, the endpoint. From that filtering pull out the child strong tags which have the locality and countries.
Loop the returned list and test if strong text is all uppercase; if so, it is a locality and can be the key of a dictionary to which you add a list of following strong values as countries; repeating as you encounter each new locality. You can then pull out specific countries by locality.
For older bs4 versions replace :-soup-contains with :contains.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get(
'https://www.state.gov/reports/country-reports-on-terrorism-2019/')
soup = bs(r.content, 'lxml')
items = soup.select('section:has(>#report-toc__section-7) p:has(strong:-soup-contains("TERRORIST SAFE HAVENS")) ~ p:has(strong):not(p:has(strong:-soup-contains("COUNTERING TERRORISM ON THE ECONOMIC FRONT")), p:has(strong:-soup-contains("COUNTERING TERRORISM ON THE ECONOMIC FRONT")) ~ p) > strong')
d = {}
for i in items:
if i.text.isupper():
key = i.text
d[key] = []
else:
value = i.text.strip()
if value:
d[key].append(value)
print(d)
Prints
Read more about css selectors here: https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes

CSS selector or XPath that gets information between two i tags?

I'm trying to scrape price information, and the HTML of the website looks like this
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
I want to get 999. (I don't want the dollar sign or the .00) I currently have
product_price_sn = product.css('.def-price i').extract()
I know it's wrong but not sure how to fix it. Any idea how to scrape that price information? Thanks!

You can use this xpath //span[#class="def-price"]/text()
Make sure you are using /text() and not //text(). Otherwise it will return all text nodes inside span tag.
or
This css selector .def-price::text. When using css selector don't use .def-price ::text, it will return all text nodes like the //text() in xpath.
Using scrapy response.xpath object
from scrapy.http import Request, HtmlResponse as Response
content = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''.encode('utf-8')
url = 'https://stackoverflow.com/questions/62849500'
''' mocking scrapy request object '''
request = Request(url=url)
''' mocking scrapy response object '''
response = Response(url=url, request=request, body=content)
''' using xpath '''
print(response.xpath('//span[#class="def-price"]/text()').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.xpath('//span[#class="def-price"]/text()').extract()).strip())
# outputs "99"
''' using css selector '''
print(response.css('.def-price::text').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.css('.def-price::text').extract()).strip())
# outputs "99"
See it in action here
Using lxml html parser
from lxml import html
parser = html.fromstring("""
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
"""
)
print(parser.xpath('//span[#class="def-price"]/text()'))
# outputs ['\n ', '\n "999"\n ']
print(''.join(parser.xpath('//span[#class="def-price"]/text()')).strip())
# outputs "999"
See it in action here

With BeautifulSoup, you can use CSS selector .def_price and then .find_all(text=True, recursive=0) to get all immediate text.
For example:
from bs4 import BeautifulSoup
txt = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''
soup = BeautifulSoup(txt, 'html.parser')
print( ''.join(soup.select_one('.def-price').find_all(text=True, recursive=0)).strip() )
Prints:
"999"

Scrapy implements an extension for that as it isn't standard for CSS selectors. So this should work for you:
product_price_sn = product.css('.def-price i::text').extract()
Here is what the docs say:
Per W3C standards, CSS selectors do not support selecting text nodes
or attribute values. But selecting these is so essential in a web
scraping context that Scrapy (parsel) implements a couple of
non-standard pseudo-elements:
to select text nodes, use ::text
to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna

It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)

How to loop in dropdown menu in Aspx dynamic websites using python requests and BeautifulSoup and scrape data

For my question I read this post "request using python to asp.net page" and this also Data Scraping, aspx , and I found what I was looking for but there are some minor items still to solve.
I want to web scrape a website http://up-rera.in/, it is aspx dynamic website. By clicking inspect element websites throws to a different link which is this: http://upreraportal.cloudapp.net/View_projects.aspx
It is using Aspx
How can I loop on all the drop down and click search to get the page content; for example I am able to SCRAPE Agra and WAS ABLE TO GET THE PAGE DETAILS.
Since this is my learning phase so I am avoiding Selenium to get page details.
Here is my code:
import requests
from bs4 import BeautifulSoup
import os
import time
import csv
final_data = []
url = "http://upreraportal.cloudapp.net/View_projects.aspx"
headers= {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
formfields={'__VIEWSTATE':'9VAv5iAKM/uLKHgQ6U91ShYmoKdKfrPqrxB2y86PhSY8pOPAulcgfrsPDINzwmvXGr+vdlE7FT6eBQCKtAFsJPQ9gQ9JIBTBCGCIjYwFuixL3vz6Q7R0OZTH2cwYmyfPHLOqxh8JbLDfyKW3r3e2UgP5N/4pI1k6DNoNAcEmNYGPzGwFHgUdJz3LYfYuFDZSydsVrwSB5CHAy/pErTJVDMmOackTy1q6Y+TNw7Cnq2imnKnBc70eldJn0gH/rtkrlPMS+WP3CXke6G7nLOzaUVIlnbHVoA232CPRcWuP1ykPjSfX12hAao6srrFMx5GUicO3Dvpir+z0U1BDEjux86Cu5/aFML2Go+3k9iHiaS3+WK/tNNui5vNAbQcPiZrnQy9wotJnw18bfHZzU/77uy22vaC+8vX1cmomiV70Ar33szSWTQjbrByyhbFbz9PHd3IVebHPlPGpdaUPxju5xkFQIJRnojsOARjc76WzTYCf479BiXUKNKflMFmr3Fp5S3BOdKFLBie1fBDgwaXX4PepOeZVm1ftY0YA4y8ObPxkJBcGh5YLxZ4vJr2z3pd8LT2i/2fyXJ9aXR9+SJzlWziu9bV8txiuJHSQNojr10mQv8MSCUAKUjT/fip8F3UE9l+zeQBOC++LEeQiTurHZD0GkNix8zQAHbNpGLBfvgocXZd/4KqqnBCLLwBVQobhRbJhbQJXbGYNs6zIXrnkx7CD9PjGKvRx9Eil19Yb5EqRLJQHSg5OdwafD1U+oyZwr3iUMXP/pJw5cTHMsK3X+dH4VkNxsG+KFzBzynKPdF17fQknzqwgmcQOxD6NN6158pi+9cM1UR4R7iwPwuBCOK04UaW3V1A9oWFGvKLls9OXbLq2DS4L3EyuorEHnxO+p8rrGWIS4aXpVVr4TxR3X79j4i8OVHhIUt8H+jo5deRZ6aG13+mXgZQd5Qu1Foo66M4sjUGs7VUcwYCXE/DP/NHToeU0hUi0sJs7+ftRy07U2Be/93TZjJXKIrsTQxxeNfyxQQMwBYZZRPPlH33t3o3gIo0Hx18tzGYj2v0gaBb+xBpx9mU9ytkceBdBPnZI1kJznArLquQQxN3IPjt6+80Vow74wy4Lvp7D+JCThAnQx4K8QbdKMWzCoKR63GTlBwLK2TiYMAVisM77XdrlH6F0g56PlGQt/RMtU0XM1QXgZvWr3KJDV8UTe0z1bj29sdTsHVJwME9eT62JGZFQAD4PoiqYl7nAB61ajAkcmxu0Zlg7+9N9tXbL44QOcY672uOQzRgDITmX6QdWnBqMjgmkIjSo1qo/VpUEzUXaVo5GHUn8ZOWI9xLrJWcOZeFl0ucyKZePMnIxeUU32EK/NY34eE6UfSTUkktkguisYIenZNfoPYehQF9ASL7t4qLiH5jca4FGgZW2kNKb3enjEmoKqbWDFMkc8/1lsk2eTd/GuhcTysVSxtvpDSlR0tjg8A2hVpR67t2rYm8iO/L1m8ImY48=',
'__VIEWSTATEGENERATOR':'4F1A7E70',
'__EVENTVALIDATION':'jVizPhFNJmo9F/GVlIrlMWMsjQe1UKHfYE4jlpTDfXZHWu9yAcpHUvT/1UsRpbgxYwZczJPd6gsvas8ilVSPkfwP1icGgOTXlWfzykkU86LyIEognwkhOfO1+suTK2e598vAjyLXRf555BXMtCO+oWoHcMjbVX2cHKtpBS1GyyqyyVB8IchAAtDEMD3G5bbzhvof6PX4Iwt5Sv1gXkHRKOR333OcYzmSGJvZgLsmo3qQ+5EOUIK5D71x/ZENmubZXvwbU0Ni6922E96RjCLh5cKgFSne5PcRDUeeDuEQhJLyD04K6N45Ow2RKyu7HN1n1YQGFfgAO3nMCsP51i7qEAohXK957z3m/H+FasHWF2u05laAWGVbPwT35utufotpPKi9qWAbCQSw9vW9HrvN01O97scG8HtWxIOnOdI6/nhke44FSpnvY1oPq+BuY2XKrb2404fKl5EPR4sjvNSYy1/8mn6IDH0eXvzoelNMwr/pKtKBESo3BthxTkkx5MR0J42qhgHURB9eUKlsGulAzjF27pyK4vjXxzlOlHG1pRiQm/wzB4om9dJmA27iaD7PJpQGgSwp7cTpbOuQgnwwrwUETxMOxuf3u1P9i+DzJqgKJbQ+pbKqtspwYuIpOR6r7dRh9nER2VXXD7fRfes1q2gQI29PtlbrRQViFM6ZlxqxqoAXVM8sk/RfSAL1LZ6qnlwGit2MvVYnAmBP9wtqcvqGaWjNdWLNsueL6DyUZ4qcLv42fVcOrsi8BPRnzJx0YiOYZ7gg7edHrJwpysSGDR1P/MZIYFEEUYh238e8I2EAeQZM70zHgQRsviD4o5r38VQf/cM9fjFii99E/mZ+6e0mIprhlM/g69MmkSahPQ5o/rhs8IJiM/GibjuZHSNfYiOspQYajMg0WIGeKWnywfaplt6/cqvcEbqt77tIx2Z0yGcXKYGehmhyHTWfaVkMuKbQP5Zw+F9X4Fv5ws76uCZkOxKV3wj3BW7+T2/nWwWMfGT1sD3LtQxiw0zhOXfY1bTB2XfxuL7+k5qE7TZWhKF4EMwLoaML9/yUA0dcXhoZBnSc',
'ctl00$ContentPlaceHolder1$DdlprojectDistrict':'Agra',
'ctl00$ContentPlaceHolder1$txtProject':'',
'ctl00$ContentPlaceHolder1$btnSearch':'Search'}
#here in form details check agra , i am able to scrape one city only,
# how to loop for all cities
r = requests.post(url, data=formfields, headers=headers)
data=r.text
soup = BeautifulSoup(data, "html.parser")
get_list = soup.find_all('option') #gets list of all <option> tag
for element in get_list :
cities = element["value"]
#final.append(cities)
#print(final)
get_details = soup.find_all("table", attrs={"id":"ContentPlaceHolder1_GridView1"})
for details in get_details:
text = details.find_all("tr")[1:]
for tds in text:
td = tds.find_all("td")[1]
rera = td.find_all("span")
rnumber = ""
for num in rera:
rnumber = num.text
print(rnumber)

Try the below code. It will give you all the results you are after. Just a little twitch was needed. I just scraped the different names from the dropdown menu and make use of those in a loop so that you can get all the data one by one. I did noting else except for adding few lines. Your code could have been better if you wrapped it within a function.
Btw, I've put the two giant string within two variables so that you need not to worry about it and make it a little slimmer.
This is the rectified code:
import requests
from bs4 import BeautifulSoup
url = "http://upreraportal.cloudapp.net/View_projects.aspx"
response = requests.get(url).text
soup = BeautifulSoup(response,"lxml")
VIEWSTATE = soup.select("#__VIEWSTATE")[0]['value']
EVENTVALIDATION = soup.select("#__EVENTVALIDATION")[0]['value']
for title in soup.select("#ContentPlaceHolder1_DdlprojectDistrict [value]")[:-1]:
search_item = title.text
# print(search_item)
headers= {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
formfields = {'__VIEWSTATE':VIEWSTATE, #Put the value in this variable
'__VIEWSTATEGENERATOR':'4F1A7E70',
'__EVENTVALIDATION':EVENTVALIDATION, #Put the value in this variable
'ctl00$ContentPlaceHolder1$DdlprojectDistrict':search_item, #this is where your city name changes in each iteration
'ctl00$ContentPlaceHolder1$txtProject':'',
'ctl00$ContentPlaceHolder1$btnSearch':'Search'}
#here in form details check agra , i am able to scrape one city only,
# how to loop for all cities
res = requests.post(url, data=formfields, headers=headers).text
soup = BeautifulSoup(res, "html.parser")
get_list = soup.find_all('option') #gets list of all <option> tag
for element in get_list :
cities = element["value"]
#final.append(cities)
#print(final)
get_details = soup.find_all("table", attrs={"id":"ContentPlaceHolder1_GridView1"})
for details in get_details:
text = details.find_all("tr")[1:]
for tds in text:
td = tds.find_all("td")[1]
rera = td.find_all("span")
rnumber = ""
for num in rera:
rnumber = num.text
print(rnumber)

How to scrapy these data's from the website?

Here's an example: [http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/][1]
Ideally like to see a neatly crawled and extracted output data array with the following fields:
Company Name
2016 Rank
2015 Rank
Years in Business
Business Description
Website
2015 Revenues
2014 Revenues
HQ City
Year Founded
Employees
Is family owned?
from each of the specific company data pages.I'm purely beginner to scrapy i want know how to extract links automatically. Here in this code i'm feeding it manual. Can anyone help me here.
import scrapy
from spy.items import SpyItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.linkextractors import LinkExtractor
class ProjectSpider(CrawlSpider):
name = "project"
allowed_domains = ["cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/"]
start_urls = [100Links in here]
def parse(self, response):
item = SpyItem()
item['title'] = response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[1]/strong/text()').extract()
item['Business'] =response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[4]/text()').extract()
item['website'] =response.xpath('//p[5]/a/text()').extract()
item['Ranking']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[2]/text()[1]').extract()
item['HQ']=response.css('p:nth-child(12)::text').extract()
item['Revenue2015']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[7]/text()').extract()
item['Revenue2014']=response.css('p:nth-child(10)::text').extract()
item['YearFounded']=response.xpath('//p[11]/text()').extract().encode('utf-8')
item['Employees']=response.xpath('//article/div[3]/p[12]/text()').extract()
item['FamilyOwned']=response.xpath('//*[#id="overlay"]/div[2]/article/div[3]/p[13]/text()').extract()
yield item

There are at least two issues with your code.
allowed_domain has to be a domain. Not more.
You use a CrawlSpider that is meant to be used with Rules. You don't have any rules.
In the following there is some tested code as starting point:
import scrapy
class ProjectItem(scrapy.Item):
title = scrapy.Field()
owned = scrapy.Field()
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./#href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[#role="main"]/p[1]//text()').extract_first()
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[#role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
# find clever XPaths for other fields ...
# ...
# Finally: yield the item
yield item

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to iterate to scrape each item no matter the position - web-scraping

For those two, you can use the following css selector: ul:nth-child(2) > li:nth-child(1) > span Take the first returned match

Related

Find sub class of a class and return list of elements

CSS selector or XPath that gets information between two i tags?

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

How to loop in dropdown menu in Aspx dynamic websites using python requests and BeautifulSoup and scrape data

How to scrapy these data's from the website?

Categories

Resources