scrapers.here is my code. I am using scrapy basic spider template and I am getting DNS lookup failed error. where is my mistake?
class TopmoviesSpider(scrapy.Spider):
name = 'topmovies'
allowed_domains = ['www.imdb.com']
start_urls = ['https://https://www.imdb.com/chart/top/']
def parse(self, response):
movies = response.xpath("//td[#class='titleColumn']/a")
for movie in movies:
link = movie.xpath(".//#href").get()
yield response.follow(url=link, callback=self.scrape_movie)
def scrape_movie(self,response):
rating = response.xpath("//span[#itemprop='ratingValue']/text()").get()
for mov in response.xpath("//div[#class='title_wrapper']"):
yield {
'title': mov.xpath(".//h1/text()").get(),
'year_of_release': mov.xpath(".//span/a/text()").get(),
'duration': mov.xpath(".//div[#class='subtext']/time/text()").get(),
'genre': mov.xpath(".//div[#class='subtext']/a/text()").get(),
'date_of_release': mov.xpath("//div[#class='subtext']/a[2]/text()"),
'rating': rating
}
Check the start_urls. You had given an invalid url. If you are trying to crawl imdb, check this post.
Related
This is the url https://www.eazydiner.com/goa/restaurants/
import scrapy
class WebScraperSpider(scrapy.Spider):
name = 'web_scraper'
allowed_domains = ['https://www.eazydiner.com']
start_urls = ['http://https://www.eazydiner.com/goa/restaurants//']
def parse(self, response):
'name': response.xpath('//span[#class ="res_name"]/text()').get()
'location': response.xpath('//span[#class ="res_loc inline-block"]/text()').get()
'cuisine': response.xpath('//span[#class ="padding-l-10 greyres_cuisine"]/text()').get()
I am trying to extract quotes from https://www.goodreads.com/quotes. It seems that I am only getting the first page and the next page part is not working.
Here is my code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://www.goodreads.com/quotes'
]
def parse(self,response):
for quote in response.xpath("//div[#class='quote']"):
yield {
'quoteText': quote.xpath(".//div[#class ='quoteText']").extract_first()
}
next_page=response.css("a").xpath("#href").extract()
if next_page is not None:
next_page_link=response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback= self.parse)
You have to get the href of the next page's link.
Use this for getting next page URL:
next_page=response.css("a.next_page::attr(href)").get()
You can read more about selectors here:
https://docs.scrapy.org/en/latest/topics/selectors.html
I am trying to get the titles of Booking.com comments from this website:
https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75,
where r_lang=all basically says that the website should show comments in every language.
In order to obtain the titles from this page I do this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
From the website (see screenshot), the first two titles should be "Sencillamente placentera" and "It could have been great.". However, somehow the url only loads comments in spanish:
“Sencillamente placentera”
“La atención de la chica del restaurante”
“El desayuno estilo buffet, completo ”
“Me gusto la ubicación, y la vista.”
“Su ubicación es muy buena.”
I noticed that if in the url I change the 'museo.es.' to 'museo.en.', I get the headers of english comments. But this is inconsistent, because if I load the original url, I get comments in english, french, spanish, etc. How can I fix this? Thanks
Servers can be configured to send different responses based on the browser making the request. Adding a User-Agent seems to fix the problem.
import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75'
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
}
)
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
Output:
“Sencillamente placentera”
“It could had been great.”
“will never stay their in the future.”
“Hôtel bien situé.”
...
You could always use a browser as a plan B. Selenium doesn't have this problem
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75')
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
print(titles)
New way to access Booking.com reviews is to use the new reviewlist.html endpoint. For example for hotel in original question reviews are located over at:
https://www.booking.com/reviewlist.html?pagename=ibis-bogota-museo&type=total&lang=en-us&sort=f_recent_desc&cc1=co&dist=1&rows=25&offset=0
This endpoint is particularly great because it supports many filters and offers up to 25 reviews per page.
Here's a snippet in Python with parsel and httpx:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('#data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb", # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
I write more about scraping this endpoint on my blog How to Scrape Booking.com which has more illustrations and videos if more information is needed.
I am developing a scraper for internal use and evaluation of my company's partner website onestop.jdsu.com. The website is actually ASPX site.
I can't get scrapy to login to the page: https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F
There are actually two methods of login on the page and I think I'm having a problem with distinguishing them in the scrapy spider. The one I'm most interested in is the "partner login" although login using the employee login, which is actually a script that displays a drop-down login window, would be fine.
I've used "loginform" to extract the relevant fields from both forms. Unfortunately no combination of relevant POST data seems to make a difference. Perhaps I'm not clicking the button on the partner form ("ctl00$PlaceHolderMain$loginControl$login","")?
Also the "Login failed" message does not come through even when I know the login had to have failed.
The spider below ignores "__VIEWSTATE" and "__EVENTVALIDATION" because they don't make a difference if included and they don't seem to have anything to do with the partner login in the HTML of the page.
Any help would be very much appreciated!
LOGINFORM TEST OUTPUT
python ./test.py https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F
[1] 1273
peter-macbook:_loginform-master peter$ [
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F",
[
[
[
"__VIEWSTATE",
"/wEPDwUKMTEzNDkwMDAxNw9kFgJmD2QWAgIBD2QWAgIDD2QWCAIDDxYCHgdWaXNpYmxlaGQCBQ8WAh8AaGQCCw9kFgYCAQ8WAh4EaHJlZgUhL193aW5kb3dzL2RlZmF1bHQuYXNweD9SZXR1cm5Vcmw9ZAIDD2QWAgIDDw8WAh8AaGRkAgUPFgIfAGhkAg0PFgIfAGgWAgIBDw8WAh4ISW1hZ2VVcmwFIS9fbGF5b3V0cy8xMDMzL2ltYWdlcy9jYWxwcmV2LnBuZ2RkZP7gVj0vs2N5c/DzKfAu4DwrFihP"
],
[
"__EVENTVALIDATION",
"/wEWBALlpOFKAoyn3a4JAuj7pusEAsXI9Y8HY+WYdEUkWKmn7tesA+BODBefeYE="
],
[
"ctl00$PlaceHolderMain$loginControl$UserName",
"USER"
],
[
"ctl00$PlaceHolderMain$loginControl$password",
"PASS"
],
[
"ctl00$PlaceHolderMain$loginControl$login",
""
]
],
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F",
"POST"
]
]
SCRAPY SPIDER FOR PARTNER LOGIN
import scrapy
from tutorial.items import WaveReadyItem
#from scrapy import log
#from scrapy.shell import inspect_response
class WaveReadySpider(scrapy.Spider):
name = "onestop_home-page-3"
allowed_domains = ["https://onestop.jdsu.com"]
start_urls = [
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F",
"https://onestop.jdsu.com/Products/network-systems/Pages/default.aspx"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'ctl00$PlaceHolderMain$loginControl$UserName': 'MY-USERID', 'ctl00$PlaceHolderMain$loginControl$password': 'MY-PASSWD', 'ctl00$PlaceHolderMain$loginControl$login': ''},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "Invalid ID or Password" in response.body:
self.log("Login failed", level=log.ERROR)
return
def parse(self, response):
#=============================================================================
#HOME PAGE: PICK UP OTHER LANDING PAGES IN CENTER COLUMN
#=============================================================================
etc.
I don't know why you fail. But here is how I use "loginform".
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.username, self.password)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
The fill_login_form method will try its best to locate the correct form of login. And then it will return everything needed to perform a Login. If you fill in the form manually, something may be missed.
I am crawling a site which may contain a lot of start_urls, like:
http://www.a.com/list_1_2_3.htm
I want to populate start_urls like [list_\d+_\d+_\d+\.htm],
and extract items from URLs like [node_\d+\.htm] during crawling.
Can I use CrawlSpider to realize this function?
And how can I generate the start_urls dynamically in crawling?
The best way to generate URLs dynamically is to override the start_requests method of the spider:
from scrapy.http.request import Request
def start_requests(self):
with open('urls.txt', 'rb') as urls:
for url in urls:
yield Request(url, self.parse)
There are two questions:
1)yes you can realize this functionality by using Rules e.g ,
rules =(Rule(SgmlLinkExtractor(allow = ('node_\d+.htm')) ,callback = 'parse'))
suggested reading
2) yes you can generate start_urls dynamically , start_urls is a
list
e.g >>> start_urls = ['http://www.a.com/%d_%d_%d' %(n,n+1,n+2) for n in range(0, 26)]
>>> start_urls
['http://www.a.com/0_1_2', 'http://www.a.com/1_2_3', 'http://www.a.com/2_3_4', 'http://www.a.com/3_4_5', 'http://www.a.com/4_5_6', 'http://www.a.com/5_6_7', 'http://www.a.com/6_7_8', 'http://www.a.com/7_8_9', 'http://www.a.com/8_9_10','http://www.a.com/9_10_11', 'http://www.a.com/10_11_12', 'http://www.a.com/11_12_13', 'http://www.a.com/12_13_14', 'http://www.a.com/13_14_15', 'http://www.a.com/14_15_16', 'http://www.a.com/15_16_17', 'http://www.a.com/16_17_18', 'http://www.a.com/17_18_19', 'http://www.a.com/18_19_20', 'http://www.a.com/19_20_21', 'http://www.a.com/20_21_22', 'http://www.a.com/21_22_23', 'http://www.a.com/22_23_24', 'http://www.a.com/23_24_25', 'http://www.a.com/24_25_26', 'http://www.a.com/25_26_27']