I am trying to learn neural network for visualization and want to use chickens as my example. I figured I can scrape all the pictures of chickens off google images since when I search for images of chickens on google I get a bunch of results that keep scrolling down. However, after I scraped all the images the length of my images are only 20. I thought the problem was the pictures might be indexed by pages but as i said, in my browser, there are no pages, there is only a single page that keeps scrolling down so I don't know how to scrape the rest of the pictures after the first 20.
from bs4 import *
import requests
import os
os.mkdir('chickens')
r = requests.get('https://www.google.com/search?q=chickens&client=firefox-b-1-d&sxsrf=AOaemvLwoKYN8RyvBYe-XTRPazSsDAiQuQ:1641698866084&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiLp_bt3KP1AhWHdt8KHZR9C-UQ_AUoAXoECAIQAw&biw=1536&bih=711&dpr=1.25')
soup = BeautifulSoup(r.text, 'html.parser')
images = soup.findAll('img')
images = images[1:]
print(len(images))
Not a perfect solution but I think it will work...
First, googles server has to recognize you as a mobile client so you have a next button at the end of the screen
use this link for your search https://www.google.com/search?ie=ISO-8859-1&hl=en&source=hp&biw=&bih=&q=chickens&iflsig=ALs-wAMAAAAAYdo4U4mFc_xRYkggo_zUXeCf6jUYWUjl&gbv=2&oq=chickens&gs_l=heirloom-hp.3..0i512i433j0i512i433i457j0i402l2j0i512l6.4571.6193.0.6957.8.7.0.1.1.0.134.611.6j1.7.0....0...1ac.1.34.heirloom-hp..0.8.613.OJ31YrPZ-B0'
Then since you have a next button you can then scrape the href of the 'next' button
https://i.stack.imgur.com/nOJCG.png
after you have the href you can then do another requests.get(new url)
and repeat
To visualize what I'm talking about
The next page you would get if you were to request the next button href
This looks like a half-automation scraping case, so you may manually scroll the page to the end, and then use python to scrape all the images.
There could be a "show more" button when scrolling down the page, you can click it and continue. There are total 764 images found in my search and can be easily scraped with python.
findAll('img') will get all images including non-result ones. You may try some other libraries to do the scraping.
We can scrape Google Images data from inline JSON because the data you need renders dynamically.
It can be extracted via regular expressions. To do that, we can search for the first image title in the page source (Ctrl+U) to find the matches we need and if there are any in the <script>> elements, then it is most likely an inline JSON. From there we can extract data.
First of all, we use a regular expression to find the part of the code that contains the information we need about the images:
# https://regex101.com/r/48UZhY/4
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
In the next step, we bring the returned part of the data and selecting only part of the JSON where images are located (thumbnail, original ones):
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/VPz7f2/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
Then find thumbnails:
# https://regex101.com/r/Jt5BJW/1
matched_google_images_thumbnails = ", ".join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
And finally find images in original resolution:
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
full_res_images = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]
To get absolutely all images, you must use browser automation, such as selenium or playwright. Also, you can use the "ijn" URL parameter that defines the page number to get (greater than or equal to 0).
Check code in online IDE.
import requests, re, json, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
"q": "chickens", # search query
"tbm": "isch", # image results
"hl": "en", # language of the search
"gl": "us", # country where search comes fro
}
html = requests.get("https://google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
google_images = []
all_script_tags = soup.select("script")
# https://regex101.com/r/eteSIT/1
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/VPz7f2/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
# https://regex101.com/r/Jt5BJW/1
matched_google_images_thumbnails = ", ".join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
full_res_images = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]
for index, (metadata, thumbnail, original) in enumerate(zip(soup.select('.isv-r.PNCib.MSM1fd.BUooTd'), thumbnails, full_res_images), start=1):
google_images.append({
"title": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["title"],
"link": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["href"],
"source": metadata.select_one(".fxgdke").text,
"thumbnail": thumbnail,
"original": original
})
print(json.dumps(google_images, indent=2, ensure_ascii=False))
Example output
[
{
"title": "Chicken - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Chicken",
"source": "en.wikipedia.org",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTM_XkDqM-gjEHUeniZF4HYdjmA4G_lKckEylFzHxxa_SiN0LV4-6M_QPuCVMleDm52doI&usqp=CAU",
"original": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Male_and_female_chicken_sitting_together.jpg/640px-Male_and_female_chicken_sitting_together.jpg"
},
{
"title": "Chickens | The Humane Society of the United States",
"link": "https://www.humanesociety.org/animals/chickens",
"source": "humanesociety.org",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSYa5_tlXtxNpxDQAU02DWkwK2hVlB3lkY_ljILmh9ReKoVK_pT9TS2PV0-RUuOY5Kkkzs&usqp=CAU",
"original": "https://www.humanesociety.org/sites/default/files/styles/1240x698/public/2018/06/chickens-in-grass_0.jpg?h=56ab1ba7&itok=uou5W86U"
},
{
"title": "chicken | bird | Britannica",
"link": "https://www.britannica.com/animal/chicken",
"source": "britannica.com",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQCl4LDGrSpsA6eFOY3M1ITTH7KlIIkvctOHuB_CbztbDRsdE4KKJNwArQJVJ7WvwCVr14&usqp=CAU",
"original": "https://cdn.britannica.com/07/183407-050-C35648B5/Chicken.jpg"
},
# ...
]
Or you can use Google Images API from SerpApi. It`s a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Simple code example:
from serpapi import GoogleSearch
import os, json
image_results = []
# search query parameters
params = {
"engine": "google", # search engine. Google, Bing, Yahoo, Naver, Baidu...
"q": "chicken", # search query
"tbm": "isch", # image results
"num": "100", # number of images per page
"ijn": 0, # page number: 0 -> first page, 1 -> second...
"api_key": os.getenv("API_KEY") # your serpapi api key
# other query parameters: hl (lang), gl (country), etc
}
search = GoogleSearch(params) # where data extraction happens
images_is_present = True
while images_is_present:
results = search.get_dict() # JSON -> Python dictionary
# checks for "Google hasn't returned any results for this query."
if "error" not in results:
for image in results["images_results"]:
if image["original"] not in image_results:
image_results.append(image["original"])
# update to the next page
params["ijn"] += 1
else:
images_is_present = False
print(results["error"])
print(json.dumps(image_results, indent=2))
Output:
[
"https://www.spendwithpennies.com/wp-content/uploads/2020/07/1200-Grilled-Chicken-Breast-22.jpeg",
"https://assets.bonappetit.com/photos/6282c9277e593c16bfea9c61/2:3/w_2430,h_3645,c_limit/0622-Sweet-and-Sticky-Grilled-Chicken.jpg",
"https://kristineskitchenblog.com/wp-content/uploads/2021/04/grilled-chicken-1200-square-0400-2.jpg",
"https://thecozycook.com/wp-content/uploads/2021/09/Creamy-Garlic-Chicken-f.jpg",
"https://www.jocooks.com/wp-content/uploads/2020/01/instant-pot-chicken-breasts-1-10.jpg",
"https://www.healthbenefitstimes.com/9/uploads/2018/04/Know-about-Chicken-and-health-benefits-702x459.png",
"https://www.tasteofhome.com/wp-content/uploads/2022/03/Air-Fryer-Rotisserie-Chicken_EXPS_FT22_237368_F_0128_1.jpg?fit=700,1024",
"https://www.militarytimes.com/resizer/-1j4zK-eaI1KPote1gyV1fw9XVg=/1024x0/filters:format(png):quality(70)/cloudfront-us-east-1.images.arcpublishing.com/archetype/BFPDC4MPLVGONPK2D5XXN7QOXI.png",
# ...
]
There's a Scrape and download Google Images with Python blog post if you need a little bit more code explanation.
Disclaimer, I work for SerpApi.
scrapers.here is my code. I am using scrapy basic spider template and I am getting DNS lookup failed error. where is my mistake?
class TopmoviesSpider(scrapy.Spider):
name = 'topmovies'
allowed_domains = ['www.imdb.com']
start_urls = ['https://https://www.imdb.com/chart/top/']
def parse(self, response):
movies = response.xpath("//td[#class='titleColumn']/a")
for movie in movies:
link = movie.xpath(".//#href").get()
yield response.follow(url=link, callback=self.scrape_movie)
def scrape_movie(self,response):
rating = response.xpath("//span[#itemprop='ratingValue']/text()").get()
for mov in response.xpath("//div[#class='title_wrapper']"):
yield {
'title': mov.xpath(".//h1/text()").get(),
'year_of_release': mov.xpath(".//span/a/text()").get(),
'duration': mov.xpath(".//div[#class='subtext']/time/text()").get(),
'genre': mov.xpath(".//div[#class='subtext']/a/text()").get(),
'date_of_release': mov.xpath("//div[#class='subtext']/a[2]/text()"),
'rating': rating
}
Check the start_urls. You had given an invalid url. If you are trying to crawl imdb, check this post.
I am trying to extract quotes from https://www.goodreads.com/quotes. It seems that I am only getting the first page and the next page part is not working.
Here is my code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://www.goodreads.com/quotes'
]
def parse(self,response):
for quote in response.xpath("//div[#class='quote']"):
yield {
'quoteText': quote.xpath(".//div[#class ='quoteText']").extract_first()
}
next_page=response.css("a").xpath("#href").extract()
if next_page is not None:
next_page_link=response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback= self.parse)
You have to get the href of the next page's link.
Use this for getting next page URL:
next_page=response.css("a.next_page::attr(href)").get()
You can read more about selectors here:
https://docs.scrapy.org/en/latest/topics/selectors.html
I am trying to get the titles of Booking.com comments from this website:
https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75,
where r_lang=all basically says that the website should show comments in every language.
In order to obtain the titles from this page I do this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
From the website (see screenshot), the first two titles should be "Sencillamente placentera" and "It could have been great.". However, somehow the url only loads comments in spanish:
“Sencillamente placentera”
“La atención de la chica del restaurante”
“El desayuno estilo buffet, completo ”
“Me gusto la ubicación, y la vista.”
“Su ubicación es muy buena.”
I noticed that if in the url I change the 'museo.es.' to 'museo.en.', I get the headers of english comments. But this is inconsistent, because if I load the original url, I get comments in english, french, spanish, etc. How can I fix this? Thanks
Servers can be configured to send different responses based on the browser making the request. Adding a User-Agent seems to fix the problem.
import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75'
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
}
)
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
Output:
“Sencillamente placentera”
“It could had been great.”
“will never stay their in the future.”
“Hôtel bien situé.”
...
You could always use a browser as a plan B. Selenium doesn't have this problem
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75')
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
print(titles)
New way to access Booking.com reviews is to use the new reviewlist.html endpoint. For example for hotel in original question reviews are located over at:
https://www.booking.com/reviewlist.html?pagename=ibis-bogota-museo&type=total&lang=en-us&sort=f_recent_desc&cc1=co&dist=1&rows=25&offset=0
This endpoint is particularly great because it supports many filters and offers up to 25 reviews per page.
Here's a snippet in Python with parsel and httpx:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('#data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb", # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
I write more about scraping this endpoint on my blog How to Scrape Booking.com which has more illustrations and videos if more information is needed.
I am developing a scraper for internal use and evaluation of my company's partner website onestop.jdsu.com. The website is actually ASPX site.
I can't get scrapy to login to the page: https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F
There are actually two methods of login on the page and I think I'm having a problem with distinguishing them in the scrapy spider. The one I'm most interested in is the "partner login" although login using the employee login, which is actually a script that displays a drop-down login window, would be fine.
I've used "loginform" to extract the relevant fields from both forms. Unfortunately no combination of relevant POST data seems to make a difference. Perhaps I'm not clicking the button on the partner form ("ctl00$PlaceHolderMain$loginControl$login","")?
Also the "Login failed" message does not come through even when I know the login had to have failed.
The spider below ignores "__VIEWSTATE" and "__EVENTVALIDATION" because they don't make a difference if included and they don't seem to have anything to do with the partner login in the HTML of the page.
Any help would be very much appreciated!
LOGINFORM TEST OUTPUT
python ./test.py https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F
[1] 1273
peter-macbook:_loginform-master peter$ [
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F",
[
[
[
"__VIEWSTATE",
"/wEPDwUKMTEzNDkwMDAxNw9kFgJmD2QWAgIBD2QWAgIDD2QWCAIDDxYCHgdWaXNpYmxlaGQCBQ8WAh8AaGQCCw9kFgYCAQ8WAh4EaHJlZgUhL193aW5kb3dzL2RlZmF1bHQuYXNweD9SZXR1cm5Vcmw9ZAIDD2QWAgIDDw8WAh8AaGRkAgUPFgIfAGhkAg0PFgIfAGgWAgIBDw8WAh4ISW1hZ2VVcmwFIS9fbGF5b3V0cy8xMDMzL2ltYWdlcy9jYWxwcmV2LnBuZ2RkZP7gVj0vs2N5c/DzKfAu4DwrFihP"
],
[
"__EVENTVALIDATION",
"/wEWBALlpOFKAoyn3a4JAuj7pusEAsXI9Y8HY+WYdEUkWKmn7tesA+BODBefeYE="
],
[
"ctl00$PlaceHolderMain$loginControl$UserName",
"USER"
],
[
"ctl00$PlaceHolderMain$loginControl$password",
"PASS"
],
[
"ctl00$PlaceHolderMain$loginControl$login",
""
]
],
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F",
"POST"
]
]
SCRAPY SPIDER FOR PARTNER LOGIN
import scrapy
from tutorial.items import WaveReadyItem
#from scrapy import log
#from scrapy.shell import inspect_response
class WaveReadySpider(scrapy.Spider):
name = "onestop_home-page-3"
allowed_domains = ["https://onestop.jdsu.com"]
start_urls = [
"https://onestop.jdsu.com/_layouts/JDSU.OneStop/Login.aspx?ReturnUrl=%2f_layouts%2fAuthenticate.aspx%3fSource%3d%252F&Source=%2F",
"https://onestop.jdsu.com/Products/network-systems/Pages/default.aspx"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'ctl00$PlaceHolderMain$loginControl$UserName': 'MY-USERID', 'ctl00$PlaceHolderMain$loginControl$password': 'MY-PASSWD', 'ctl00$PlaceHolderMain$loginControl$login': ''},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "Invalid ID or Password" in response.body:
self.log("Login failed", level=log.ERROR)
return
def parse(self, response):
#=============================================================================
#HOME PAGE: PICK UP OTHER LANDING PAGES IN CENTER COLUMN
#=============================================================================
etc.
I don't know why you fail. But here is how I use "loginform".
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.username, self.password)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
The fill_login_form method will try its best to locate the correct form of login. And then it will return everything needed to perform a Login. If you fill in the form manually, something may be missed.