I am trying to extract quotes from https://www.goodreads.com/quotes. It seems that I am only getting the first page and the next page part is not working.
Here is my code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://www.goodreads.com/quotes'
]
def parse(self,response):
for quote in response.xpath("//div[#class='quote']"):
yield {
'quoteText': quote.xpath(".//div[#class ='quoteText']").extract_first()
}
next_page=response.css("a").xpath("#href").extract()
if next_page is not None:
next_page_link=response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback= self.parse)
You have to get the href of the next page's link.
Use this for getting next page URL:
next_page=response.css("a.next_page::attr(href)").get()
You can read more about selectors here:
https://docs.scrapy.org/en/latest/topics/selectors.html
Related
I want to scrape multiple urls stored in a csv file using Scrapy. My code works(shows no error) but it only scrapes the last url, but not all of them. Here is a picture of my code. Plz tell me what I'm doing wrong. I want to scrape all the urls and save the scraped text together. I have already tried a lot of the suggestions found on StackOverflow. My code-
import scrapy
from scrapy import Request
from ..items import personalprojectItem
class ArticleSpider(scrapy.Spider):
name = 'articles'
with open('C:\\Users\\Admin\\Documents\\Bhavya\\input_urls.csv') as file:
for line in file:
start_urls = line
def start_requests(self):
request = Request(url=self.start_urls)
yield request
def parse(self, response):
item = personalprojectItem()
article = response.css('div p::text').extract()
item['article'] = article
yield item
Below is a minimal example of how you can include a list of urls from file in a scrapy project.
We have a text file with the following links, inside the scrapy project folder:
https://www.theguardian.com/technology/2022/nov/18/elon-musk-twitter-engineers-workers-mass-resignation
https://www.theguardian.com/world/2022/nov/18/iranian-protesters-set-fire-to-ayatollah-khomeinis-ancestral-home
https://www.theguardian.com/world/2022/nov/18/canada-safari-park-shooting-animals-two-charged
The spider code looks like this (again, minimal example):
import scrapy
class GuardianSpider(scrapy.Spider):
name = 'guardian'
allowed_domains = ['theguardian.com']
start_urls = [x for x in open('urls_list.txt', 'r').readlines()]
def parse(self, response):
title = response.xpath('//h1/text()').get()
header = response.xpath('//div[#data-gu-name="standfirst"]//p/text()').get()
yield {
'title': title,
'header': header
}
If we run the spider with scrapy crawl guardian -o guardian_news.json, we get a JSON file looking like this:
[
{"title": "Elon Musk summons Twitter engineers amid mass resignations and puts up poll on Trump ban", "header": "Reports show nearly 1,200 workers left company after demand for \u2018long hours at high intensity\u2019, while Musk starts poll on whether to reinstate Donald Trump"},
{"title": "Iranian protesters set fire to Ayatollah Khomeini\u2019s ancestral home", "header": "Social media images show what is now a museum commemorating the Islamic Republic founder ablaze as protests continue"},
{"title": "Two Canadian men charged with shooting animals at safari park", "header": "Mathieu Godard and Jeremiah Mathias-Polson accused of breaking into Parc Omega in Quebec and killing three wild boar and an elk"}
]
Scrapy documentation can be found here: https://docs.scrapy.org/en/latest/
This is the url https://www.eazydiner.com/goa/restaurants/
import scrapy
class WebScraperSpider(scrapy.Spider):
name = 'web_scraper'
allowed_domains = ['https://www.eazydiner.com']
start_urls = ['http://https://www.eazydiner.com/goa/restaurants//']
def parse(self, response):
'name': response.xpath('//span[#class ="res_name"]/text()').get()
'location': response.xpath('//span[#class ="res_loc inline-block"]/text()').get()
'cuisine': response.xpath('//span[#class ="padding-l-10 greyres_cuisine"]/text()').get()
scrapers.here is my code. I am using scrapy basic spider template and I am getting DNS lookup failed error. where is my mistake?
class TopmoviesSpider(scrapy.Spider):
name = 'topmovies'
allowed_domains = ['www.imdb.com']
start_urls = ['https://https://www.imdb.com/chart/top/']
def parse(self, response):
movies = response.xpath("//td[#class='titleColumn']/a")
for movie in movies:
link = movie.xpath(".//#href").get()
yield response.follow(url=link, callback=self.scrape_movie)
def scrape_movie(self,response):
rating = response.xpath("//span[#itemprop='ratingValue']/text()").get()
for mov in response.xpath("//div[#class='title_wrapper']"):
yield {
'title': mov.xpath(".//h1/text()").get(),
'year_of_release': mov.xpath(".//span/a/text()").get(),
'duration': mov.xpath(".//div[#class='subtext']/time/text()").get(),
'genre': mov.xpath(".//div[#class='subtext']/a/text()").get(),
'date_of_release': mov.xpath("//div[#class='subtext']/a[2]/text()"),
'rating': rating
}
Check the start_urls. You had given an invalid url. If you are trying to crawl imdb, check this post.
I am developing a scraper for internal use and evaluation of my company's partner website onestop.jdsu.com. The website is actually ASPX site.
I can't get scrapy to login to the page: https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F
There are actually two methods of login on the page and I think I'm having a problem with distinguishing them in the scrapy spider. The one I'm most interested in is the "partner login" although login using the employee login, which is actually a script that displays a drop-down login window, would be fine.
I've used "loginform" to extract the relevant fields from both forms. Unfortunately no combination of relevant POST data seems to make a difference. Perhaps I'm not clicking the button on the partner form ("ctl00$PlaceHolderMain$loginControl$login","")?
Also the "Login failed" message does not come through even when I know the login had to have failed.
The spider below ignores "__VIEWSTATE" and "__EVENTVALIDATION" because they don't make a difference if included and they don't seem to have anything to do with the partner login in the HTML of the page.
Any help would be very much appreciated!
LOGINFORM TEST OUTPUT
python ./test.py https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F
[1] 1273
peter-macbook:_loginform-master peter$ [
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F",
[
[
[
"__VIEWSTATE",
"/wEPDwUKMTEzNDkwMDAxNw9kFgJmD2QWAgIBD2QWAgIDD2QWCAIDDxYCHgdWaXNpYmxlaGQCBQ8WAh8AaGQCCw9kFgYCAQ8WAh4EaHJlZgUhL193aW5kb3dzL2RlZmF1bHQuYXNweD9SZXR1cm5Vcmw9ZAIDD2QWAgIDDw8WAh8AaGRkAgUPFgIfAGhkAg0PFgIfAGgWAgIBDw8WAh4ISW1hZ2VVcmwFIS9fbGF5b3V0cy8xMDMzL2ltYWdlcy9jYWxwcmV2LnBuZ2RkZP7gVj0vs2N5c/DzKfAu4DwrFihP"
],
[
"__EVENTVALIDATION",
"/wEWBALlpOFKAoyn3a4JAuj7pusEAsXI9Y8HY+WYdEUkWKmn7tesA+BODBefeYE="
],
[
"ctl00$PlaceHolderMain$loginControl$UserName",
"USER"
],
[
"ctl00$PlaceHolderMain$loginControl$password",
"PASS"
],
[
"ctl00$PlaceHolderMain$loginControl$login",
""
]
],
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F",
"POST"
]
]
SCRAPY SPIDER FOR PARTNER LOGIN
import scrapy
from tutorial.items import WaveReadyItem
#from scrapy import log
#from scrapy.shell import inspect_response
class WaveReadySpider(scrapy.Spider):
name = "onestop_home-page-3"
allowed_domains = ["https://onestop.jdsu.com"]
start_urls = [
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F",
"https://onestop.jdsu.com/Products/network-systems/Pages/default.aspx"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'ctl00$PlaceHolderMain$loginControl$UserName': 'MY-USERID', 'ctl00$PlaceHolderMain$loginControl$password': 'MY-PASSWD', 'ctl00$PlaceHolderMain$loginControl$login': ''},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "Invalid ID or Password" in response.body:
self.log("Login failed", level=log.ERROR)
return
def parse(self, response):
#=============================================================================
#HOME PAGE: PICK UP OTHER LANDING PAGES IN CENTER COLUMN
#=============================================================================
etc.
I don't know why you fail. But here is how I use "loginform".
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.username, self.password)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
The fill_login_form method will try its best to locate the correct form of login. And then it will return everything needed to perform a Login. If you fill in the form manually, something may be missed.
I am crawling a site which may contain a lot of start_urls, like:
http://www.a.com/list_1_2_3.htm
I want to populate start_urls like [list_\d+_\d+_\d+\.htm],
and extract items from URLs like [node_\d+\.htm] during crawling.
Can I use CrawlSpider to realize this function?
And how can I generate the start_urls dynamically in crawling?
The best way to generate URLs dynamically is to override the start_requests method of the spider:
from scrapy.http.request import Request
def start_requests(self):
with open('urls.txt', 'rb') as urls:
for url in urls:
yield Request(url, self.parse)
There are two questions:
1)yes you can realize this functionality by using Rules e.g ,
rules =(Rule(SgmlLinkExtractor(allow = ('node_\d+.htm')) ,callback = 'parse'))
suggested reading
2) yes you can generate start_urls dynamically , start_urls is a
list
e.g >>> start_urls = ['http://www.a.com/%d_%d_%d' %(n,n+1,n+2) for n in range(0, 26)]
>>> start_urls
['http://www.a.com/0_1_2', 'http://www.a.com/1_2_3', 'http://www.a.com/2_3_4', 'http://www.a.com/3_4_5', 'http://www.a.com/4_5_6', 'http://www.a.com/5_6_7', 'http://www.a.com/6_7_8', 'http://www.a.com/7_8_9', 'http://www.a.com/8_9_10','http://www.a.com/9_10_11', 'http://www.a.com/10_11_12', 'http://www.a.com/11_12_13', 'http://www.a.com/12_13_14', 'http://www.a.com/13_14_15', 'http://www.a.com/14_15_16', 'http://www.a.com/15_16_17', 'http://www.a.com/16_17_18', 'http://www.a.com/17_18_19', 'http://www.a.com/18_19_20', 'http://www.a.com/19_20_21', 'http://www.a.com/20_21_22', 'http://www.a.com/21_22_23', 'http://www.a.com/22_23_24', 'http://www.a.com/23_24_25', 'http://www.a.com/24_25_26', 'http://www.a.com/25_26_27']