I am trying to learn neural network for visualization and want to use chickens as my example. I figured I can scrape all the pictures of chickens off google images since when I search for images of chickens on google I get a bunch of results that keep scrolling down. However, after I scraped all the images the length of my images are only 20. I thought the problem was the pictures might be indexed by pages but as i said, in my browser, there are no pages, there is only a single page that keeps scrolling down so I don't know how to scrape the rest of the pictures after the first 20.
from bs4 import *
import requests
import os
r = requests.get('https://www.google.com/search?q=chickens&client=firefox-b-1-d&sxsrf=AOaemvLwoKYN8RyvBYe-XTRPazSsDAiQuQ:1641698866084&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiLp_bt3KP1AhWHdt8KHZR9C-UQ_AUoAXoECAIQAw&biw=1536&bih=711&dpr=1.25')
soup = BeautifulSoup(r.text, 'html.parser')
images = soup.findAll('img')
images = images[1:]
Not a perfect solution but I think it will work...
First, googles server has to recognize you as a mobile client so you have a next button at the end of the screen
use this link for your search https://www.google.com/search?ie=ISO-8859-1&hl=en&source=hp&biw=&bih=&q=chickens&iflsig=ALs-wAMAAAAAYdo4U4mFc_xRYkggo_zUXeCf6jUYWUjl&gbv=2&oq=chickens&gs_l=heirloom-hp.3..0i512i433j0i512i433i457j0i402l2j0i512l6.4571.6193.0.6957.'
Then since you have a next button you can then scrape the href of the 'next' button
after you have the href you can then do another requests.get(new url)
and repeat
To visualize what I'm talking about
The next page you would get if you were to request the next button href
This looks like a half-automation scraping case, so you may manually scroll the page to the end, and then use python to scrape all the images.
There could be a "show more" button when scrolling down the page, you can click it and continue. There are total 764 images found in my search and can be easily scraped with python.
findAll('img') will get all images including non-result ones. You may try some other libraries to do the scraping.
We can scrape Google Images data from inline JSON because the data you need renders dynamically.
It can be extracted via regular expressions. To do that, we can search for the first image title in the page source (Ctrl+U) to find the matches we need and if there are any in the <script>> elements, then it is most likely an inline JSON. From there we can extract data.
First of all, we use a regular expression to find the part of the code that contains the information we need about the images:
# https://regex101.com/r/48UZhY/4
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
In the next step, we bring the returned part of the data and selecting only part of the JSON where images are located (thumbnail, original ones):
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/VPz7f2/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
Then find thumbnails:
# https://regex101.com/r/Jt5BJW/1
matched_google_images_thumbnails = ", ".join(
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
And finally find images in original resolution:
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
full_res_images = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
To get absolutely all images, you must use browser automation, such as selenium or playwright. Also, you can use the "ijn" URL parameter that defines the page number to get (greater than or equal to 0).
Check code in online IDE.
import requests, re, json, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
params = {
"q": "chickens", # search query
"tbm": "isch", # image results
"hl": "en", # language of the search
"gl": "us", # country where search comes fro
html = requests.get("https://google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
google_images = []
all_script_tags = soup.select("script")
# https://regex101.com/r/eteSIT/1
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/VPz7f2/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
# https://regex101.com/r/Jt5BJW/1
matched_google_images_thumbnails = ", ".join(
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
full_res_images = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
for index, (metadata, thumbnail, original) in enumerate(zip(soup.select('.isv-r.PNCib.MSM1fd.BUooTd'), thumbnails, full_res_images), start=1):
"title": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["title"],
"link": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["href"],
"source": metadata.select_one(".fxgdke").text,
"thumbnail": thumbnail,
"original": original
print(json.dumps(google_images, indent=2, ensure_ascii=False))
Example output
"title": "Chicken - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Chicken",
"source": "en.wikipedia.org",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTM_XkDqM-gjEHUeniZF4HYdjmA4G_lKckEylFzHxxa_SiN0LV4-6M_QPuCVMleDm52doI&usqp=CAU",
"original": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Male_and_female_chicken_sitting_together.jpg/640px-Male_and_female_chicken_sitting_together.jpg"
"title": "Chickens | The Humane Society of the United States",
"link": "https://www.humanesociety.org/animals/chickens",
"source": "humanesociety.org",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSYa5_tlXtxNpxDQAU02DWkwK2hVlB3lkY_ljILmh9ReKoVK_pT9TS2PV0-RUuOY5Kkkzs&usqp=CAU",
"original": "https://www.humanesociety.org/sites/default/files/styles/1240x698/public/2018/06/chickens-in-grass_0.jpg?h=56ab1ba7&itok=uou5W86U"
"title": "chicken | bird | Britannica",
"link": "https://www.britannica.com/animal/chicken",
"source": "britannica.com",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQCl4LDGrSpsA6eFOY3M1ITTH7KlIIkvctOHuB_CbztbDRsdE4KKJNwArQJVJ7WvwCVr14&usqp=CAU",
"original": "https://cdn.britannica.com/07/183407-050-C35648B5/Chicken.jpg"
# ...
Or you can use Google Images API from SerpApi. It`s a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Simple code example:
from serpapi import GoogleSearch
import os, json
image_results = []
# search query parameters
params = {
"engine": "google", # search engine. Google, Bing, Yahoo, Naver, Baidu...
"q": "chicken", # search query
"tbm": "isch", # image results
"num": "100", # number of images per page
"ijn": 0, # page number: 0 -> first page, 1 -> second...
"api_key": os.getenv("API_KEY") # your serpapi api key
# other query parameters: hl (lang), gl (country), etc
search = GoogleSearch(params) # where data extraction happens
images_is_present = True
while images_is_present:
results = search.get_dict() # JSON -> Python dictionary
# checks for "Google hasn't returned any results for this query."
if "error" not in results:
for image in results["images_results"]:
if image["original"] not in image_results:
# update to the next page
params["ijn"] += 1
images_is_present = False
print(json.dumps(image_results, indent=2))
# ...
I am trying to scrape AirBNB by plain HTTP requests and noticed something.
Let's say we use this search string: "New York, New York, United States".
The simplest working request (striped off from unnecessary headers and fields) I can use to get the desired results is this:
GET /api/v3/ExploreSections?operationName=ExploreSections&locale=en¤cy=USD&variables=%7B%22isInitialLoad%22%3Atrue%2C%22hasLoggedIn%22%3Afalse%2C%22cdnCacheSafe%22%3Afalse%2C%22source%22%3A%22EXPLORE%22%2C%22exploreRequest%22%3A%7B%22metadataOnly%22%3Afalse%2C%22version%22%3A%221.8.3%22%2C%22itemsPerGrid%22%3A20%2C%22placeId%22%3A%22ChIJOwg_06VPwokRYv534QaPC8g%22%2C%22query%22%3A%22New%20York%2C%20New%20York%2C%20United%20States%22%2C%22cdnCacheSafe%22%3Afalse%2C%22screenSize%22%3A%22large%22%2C%22isInitialLoad%22%3Atrue%2C%22hasLoggedIn%22%3Afalse%7D%2C%22removeDuplicatedParams%22%3Atrue%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2282cc0732fe2a6993a26859942d1342b6e42830704b1005aeb2d25f78732275e7%22%7D%7D HTTP/2
Host: www.airbnb.com
X-Airbnb-Api-Key: d306zoyjsyarp7ifhu67rjxn52tv0t20
Accept-Encoding: gzip, deflate
At this point, that API key is pretty much public, so not a concern.
The readable content of the "variables" parameter is this:
"isInitialLoad": true,
"hasLoggedIn": false,
"cdnCacheSafe": false,
"source": "EXPLORE",
"exploreRequest": {
"metadataOnly": false,
"version": "1.8.3",
"itemsPerGrid": 20,
"placeId": "ChIJOwg_06VPwokRYv534QaPC8g",
"query": "New York, New York, United States",
"cdnCacheSafe": false,
"screenSize": "large",
"isInitialLoad": true,
"hasLoggedIn": false
"removeDuplicatedParams": true
The readable content of the "extensions" parameter is this:
"persistedQuery": {
"version": 1,
"sha256Hash": "82cc0732fe2a6993a26859942d1342b6e42830704b1005aeb2d25f78732275e7"
I am trying to figure out where that hash comes from.
It seems it's calculated from a GraphQL query but I don't know anything else and there is no documentation about it.
Any help?
I had the same issue (wanted to get the prices) and after investigating in the HAR files that you can get with Chrome, I found out that you get this value from a Javascript file called PdpPlatformRoute.xxx.js
The steps to get this hash are simply to load the file PdpPlatformRoute.xxx.js, then to parse the file to get an "operationId".
If this helps, this is how I did this.
// contentPage is the HTML content of the listing page (e.g. https://www.airbnb.com/rooms/1234567)
function getPdpPlatformRouteUrl(contentPage) {
return 'https://a0.muscache.com/airbnb/static/packages/web/en/frontend/gp-stays-pdp-route/routes/' + `${contentPage}`.match(/(PdpPlatformRoute\.\w+\.\js)/)?.[1];
// textContent is the JS content that you get when you fetch the previously found URL
function getSha256(textContent) {
return `${textContent}`.match(/name:'StaysPdpSections',type:'query',operationId:'(.*)'/)?.[1];
Right now it looks like a mystery. Please help me in solving it.
I use iTunes public API to fetch an album: "Metallica" by Metallica (see it in browser: US region, MV region). I construct the following URLs to fetch it via API:
US region https://itunes.apple.com/lookup?id=579372950&country=US&entity=album - works
MV region https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album - doesn't work
Here's the actual behaviour I observe:
If I query GET https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album in a Spring app (using RestTemplate + Jackson HttpMessageConverter) I get an empty response:
"results": []
If I navigate to https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album in a browser I get a file download prompt. The file contains an empty response:
"results": []
If I query API using HttpPie http get https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album I get a non-empty response !!!
"resultCount": 1,
"results": [
"amgArtistId": 4906,
"artistId": 3996865,
"artistName": "Metallica",
"artistViewUrl": "https://music.apple.com/us/artist/metallica/3996865?uo=4",
"artworkUrl100": "https://is1-ssl.mzstatic.com/image/thumb/Music/v4/0b/9c/d2/0b9cd2e7-6e76-8912-0357-14780cc2616a/source/100x100bb.jpg",
"artworkUrl60": "https://is1-ssl.mzstatic.com/image/thumb/Music/v4/0b/9c/d2/0b9cd2e7-6e76-8912-0357-14780cc2616a/source/60x60bb.jpg",
"collectionCensoredName": "Metallica",
"collectionExplicitness": "notExplicit",
"collectionId": 579372950,
"collectionName": "Metallica",
"collectionPrice": 9.99,
"collectionType": "Album",
"collectionViewUrl": "https://music.apple.com/us/album/metallica/579372950?uo=4",
"copyright": "℗ 1991 Blackened Recordings",
"country": "USA",
"currency": "USD",
"primaryGenreName": "Metal",
"releaseDate": "1991-08-12T07:00:00Z",
"trackCount": 13,
"wrapperType": "collection"
I tried it multiple times and the results seem to be consistent. I compared the requests and they seem to be identical.
Why does iTunes respond differently to different clients? I can't understand. What important detail am I missing?
This problem happens to the following regions (it's a complete list):
LI https://itunes.apple.com/lookup?id=579372950&country=LI&entity=album
MV https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album
MM https://itunes.apple.com/lookup?id=579372950&country=MM&entity=album
ET https://itunes.apple.com/lookup?id=579372950&country=ET&entity=album
RS https://itunes.apple.com/lookup?id=579372950&country=RS&entity=album
I spotted a difference:
http get 'https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album' -> empty response
curl 'https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album' -> empty response
http get https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album -> 1 album in response
curl https://itunes.apple.com/lookup?id=579372950&country=MV&entity=album -> 1 album in response
if I don't use quotes around URL, the request is interpreted as GET https://itunes.apple.com/lookup?id=579372950. the default country is US and therefore I see 1 US album in response.
I am trying to get the titles of Booking.com comments from this website:
where r_lang=all basically says that the website should show comments in every language.
In order to obtain the titles from this page I do this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
From the website (see screenshot), the first two titles should be "Sencillamente placentera" and "It could have been great.". However, somehow the url only loads comments in spanish:
“Sencillamente placentera”
“La atención de la chica del restaurante”
“El desayuno estilo buffet, completo ”
“Me gusto la ubicación, y la vista.”
“Su ubicación es muy buena.”
I noticed that if in the url I change the 'museo.es.' to 'museo.en.', I get the headers of english comments. But this is inconsistent, because if I load the original url, I get comments in english, french, spanish, etc. How can I fix this? Thanks
Servers can be configured to send different responses based on the browser making the request. Adding a User-Agent seems to fix the problem.
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request(
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
“Sencillamente placentera”
“It could had been great.”
“will never stay their in the future.”
“Hôtel bien situé.”
You could always use a browser as a plan B. Selenium doesn't have this problem
from selenium import webdriver
d = webdriver.Chrome()
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
New way to access Booking.com reviews is to use the new reviewlist.html endpoint. For example for hotel in original question reviews are located over at:
This endpoint is particularly great because it supports many filters and offers up to 25 reviews per page.
Here's a snippet in Python with parsel and httpx:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
"id": review_box.xpath('#data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb", # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
return results
I am trying to scrape all the objects with the same tag from a specific site (Google Scholar) with BeautifulSoup, but it doesn't scrap the object under the "show more" at the end of the page. How can I fix it?
Here's an example of my code:
# -*- coding: cp1253 -*-
from urllib import urlopen
from bs4 import BeautifulSoup
for t in soup.findAll('a',{"class":"gsc_a_at"}):
print t.text
You have to pass pagination parameters to the request url.
cstart - Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.).
pagesize - Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100.
You could also use a third party solution like SerpApi to do this for you. It's a paid API with a free trial.
Example python code (available in other libraries also) to retrieve the second page of results:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_author",
"hl": "en",
"author_id": "FwuKA4UAAAAJ",
"start": "20",
"api_key": "secret_api_key"
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"articles": [
"title": "MuseumScrabble: Design of a mobile game for children’s interaction with a digitally augmented cultural space",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:RHpTSmoSYBkC",
"citation_id": "FwuKA4UAAAAJ:RHpTSmoSYBkC",
"authors": "C Sintoris, A Stoica, I Papadimitriou, N Yiannoutsou, V Komis, N Avouris",
"publication": "Social and organizational impacts of emerging mobile devices: Evaluating use …, 2012",
"cited_by": {
"value": 69,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=6286720977869955347",
"serpapi_link": "https://serpapi.com/search.json?cites=6286720977869955347&engine=google_scholar&hl=en",
"cites_id": "6286720977869955347"
"year": "2012"
"title": "The effective combination of hybrid usability methods in evaluating educational applications of ICT: Issues and challenges",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:hqOjcs7Dif8C",
"citation_id": "FwuKA4UAAAAJ:hqOjcs7Dif8C",
"authors": "N Tselios, N Avouris, V Komis",
"publication": "Education and Information Technologies 13 (1), 55-76, 2008",
"cited_by": {
"value": 68,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1046912849634390721",
"serpapi_link": "https://serpapi.com/search.json?cites=1046912849634390721&engine=google_scholar&hl=en",
"cites_id": "1046912849634390721"
"year": "2008"
In Chrome, try F12 --> Network, select 'Preserve log' and disable cache.
Now hit the show more button.
Check the GET/POST request being sent. You will know what to do next.
I'm using Google Analytics to track data across multiple domains in a single profile.
By default, reporting only shows the path, not the full URL. This makes it quite confusing where multiple pages on our different domains have the same paths (e.g. '/index' or '/about').
To get round this, I've implemented the filter advised by Google to display the full URL in reporting:
Filter Type: Custom filter > Advanced
Field A: Hostname Extract A: (.*)
Field B: Request URI Extract: (.*)
Output To: Request URI Constructor: $A1$B1
This works just fine ; the only downside is that using the 'preview link' button in the reporting always appends the domain, resulting in a 404 error.
....clicking the 'link preview' icon results in......
Does anyone know a way around this ; either by preventing GA from appending the domain or a better way of displaying the full URLs in reporting?
Thanks Eike - I took your advice and wrote a small browser extension for Chrome. Obviously this isn't an essential, but I wanted to address it as our marketing team use the feature so frequently.
The manifest json :
"manifest_version": 2,
"name": "Analytics cross-domain link shortcut",
"version": "1.0",
"description": "Makes the links shortcuts in analytics work when using a 'full url' filter!",
"matches": ["*://*/*"],
"js": ["myscript.js"],
"run_at": "document_start"
And the script:
if (window.opener && document.referrer == "") {
var currentLocation = window.location.href;
if(currentLocation.indexOf("www.appendedurl.com") > -1) {
var newLocation = currentLocation.substr(30); // where '30' is the length of the appended URL
window.location.href = "http://"+newLocation;
So it's essentially just snipping off the appended URL (if present) on freshly opened popup windows.