Failing to get image with requests.get - python-requests

My goal is to compute the average brightness of an image displayed at the given url.
import requests
from PIL import ImageStat, UnidentifiedImageError
from PIL import Image as ImPIL
def brightness(imgUrl):
response = requests.get(imgUrl)
try:
img = ImPIL.open(response.raw)
img = img.convert('L')
stat = ImageStat.Stat(img)
return stat.mean[0] # Average brightness of pixels in given image
except UnidentifiedImageError:
print(imgUrl)
The images I am using are paintings hosted on Wikidata/Wikimedia.
Here are some of the URLs that trigger the exception:
http://commons.wikimedia.org/wiki/Special:FilePath/15-10-27-Els%20Quatre%20Gats-RalfR-WMA%202740a.jpg
http://commons.wikimedia.org/wiki/Special:FilePath/Garrote%20vil%2C%20de%20Ram%C3%B3n%20Casas.jpg
http://commons.wikimedia.org/wiki/Special:FilePath/La%20morfina%20%28Santiago%20Rusi%C3%B1ol%29.jpg
http://commons.wikimedia.org/wiki/Special:FilePath/Ramon%20Casas%20-%20Over%20My%20Dead%20Body%20-%20Google%20Art%20Project.jpg
What am I missing ?

Turns out Wikimedia returns a 403 error because of limits on queries (I had 100 urls to go through).
Here is an answer with more details on these limits: https://stackoverflow.com/a/61805827/18124080

Related

Why does requests.get() is giving me the information in Spanish?

I'm trying to request the weather from Google for an specific place at an specific time. When I get the response the text is in Spanish instead of English. Ie. instead of "Mostly cloudly" I get "parcialmente nublado". I'm using the requests library and BeautifulSoup.
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=weather+Nissan+Stadium+Nashville+TN+Thursday+December+29+2022+8:15+PM"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
clima = soup.find("div",class_="tAd8D")
print(clima.text)
Output
jueves
Mayormente nublado
Máxima: 16°C Mínima: 8°C
Desired output:
Thursday
Mostly cloudy
Maximun : x (fahrenheit) Minimum x(fahrenheit)
The most likely explanation is that Google associates your IP address with a primarily Spanish-speaking region and defaults to giving you results in Spanish.
Try specifying English in your search string by adding hl=en:
https://www.google.com/search?hl=en&q=my+search+string

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!
You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

What Caused the Python NoneType Error During My Splinter 'click()' Call?

When trying to scrape the county data from multiple Politico state web pages, such as this one, I concluded the best method was to first click the button that expands the county list before grabbing the table body's data (when present). However, my attempt at clicking the button had failed:
from bs4 import BeautifulSoup as bs
import requests
from splinter import Browser
state_page_url = "https://www.politico.com/2020-election/results/washington/"
executable_path = {'executable_path': 'chrome-driver/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)
browser.visit(state_page_url)
state_soup = bs(browser.html, 'html.parser')
reveal_button = state_soup.find('button', class_='jsx-3713440361')
if (reveal_button == None):
# Steps to take when the button isn't present
# ...
else:
reveal_button.click()
The error returned when following the else-condition is for my click() call: "TypeError: NoneType object is not callable". This doesn't make sense to me since I thought that the if-statement implied the reveal_button was not a NoneType. Am I misinterpeting the error message, how the reveal_button was set or am I misinterpeting what I'm working with after making state_soup?
Based on the comment thread for the question, and this solution to a similar question, I came across the following fix:
from bs4 import BeautifulSoup as bs
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
# Navigate the page to click the desired button
state_page_url = "https://www.politico.com/2020-election/results/alabama/"
driver = webdriver.Chrome(executable_path='chrome-driver/chromedriver.exe')
driver.get(state_page_url)
button_list = driver.find_elements(By.CLASS_NAME, 'jsx-3713440361')
if button_list == []:
# Actions to take when no button is found
# ...
else:
button_list[-1].click() # The index was determined through trial/error specific to the web page
# Now to grab the table and its data
state_soup = bs(driver.page_source)
state_county_results_table = state_soup.find('tbody', class_='jsx-3713440361')
Note that it required selenium for navigation and interaction while BeautifulSoup4 was used to parse it for the information I'd need

How to update payload info for python scraping

I have a python scraper that works for this site:
https://dhhr.wv.gov/COVID-19/Pages/default.aspx
It will scrape the tooltips from one of the graphs that is navigated to by clicking the "Positive Case Trends" link in the above URL.
here is my code:
import re
import requests
import json
from datetime import date
url4 = 'https://wabi-us-gov-virginia-api.analysis.usgovcloudapi.net/public/reports/querydata?synchronous=true'
# payload:
x=r'{"version":"1.0.0","queries":[{"Query":{"Commands":[{"SemanticQueryDataShapeCommand":{"Query":{"Version":2,"From":[{"Name":"c","Entity":"Case Data"}],"Select":[{"Column":{"Expression":{"SourceRef":{"Source":"c"}},"Property":"Lab Report Date"},"Name":"Case Data.Lab Add Date"},{"Aggregation":{"Expression":{"Column":{"Expression":{"SourceRef":{"Source":"c"}},"Property":"Daily Confirmed Cases"}},"Function":0},"Name":"Sum(Case Data.Daily Confirmed Cases)"},{"Aggregation":{"Expression":{"Column":{"Expression":{"SourceRef":{"Source":"c"}},"Property":"Daily Probable Cases"}},"Function":0},"Name":"Sum(Case Data.Daily Probable Cases)"}]},"Binding":{"Primary":{"Groupings":[{"Projections":[0,1,2]}]},"DataReduction":{"DataVolume":4,"Primary":{"BinnedLineSample":{}}},"Version":1}}}]},"CacheKey":"{\"Commands\":[{\"SemanticQueryDataShapeCommand\":{\"Query\":{\"Version\":2,\"From\":[{\"Name\":\"c\",\"Entity\":\"Case Data\"}],\"Select\":[{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"c\"}},\"Property\":\"Lab Report Date\"},\"Name\":\"Case Data.Lab Add Date\"},{\"Aggregation\":{\"Expression\":{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"c\"}},\"Property\":\"Daily Confirmed Cases\"}},\"Function\":0},\"Name\":\"Sum(Case Data.Daily Confirmed Cases)\"},{\"Aggregation\":{\"Expression\":{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"c\"}},\"Property\":\"Daily Probable Cases\"}},\"Function\":0},\"Name\":\"Sum(Case Data.Daily Probable Cases)\"}]},\"Binding\":{\"Primary\":{\"Groupings\":[{\"Projections\":[0,1,2]}]},\"DataReduction\":{\"DataVolume\":4,\"Primary\":{\"BinnedLineSample\":{}}},\"Version\":1}}}]}","QueryId":"","ApplicationContext":{"DatasetId":"fb9b182d-de95-4d65-9aba-3e505de8eb75","Sources":[{"ReportId":"dbabbc9f-cc0d-4dd0-827f-5d25eeca98f6"}]}}],"cancelQueries":[],"modelId":339580}'
x=x.replace("\\\'","'")
json_data = json.loads(x)
final_data2 = requests.post(url4, json=json_data, headers={'X-PowerBI-ResourceKey': 'ab4e5874-7bbf-44c9-9443-0701abdee612'}).json()
print(json.dumps(final_data2))
The issue is that some days it stops working because the payload and X-PowerBI-ResourceKey header parameter values change and i have to find and manually copy and paste the new values from browser inspection network section into my source. Is there a way to programatically obtain these from the webpage and construct them in my code?
I'm pretty sure the resource key is part of the iframe url encoded as base64.
from base64 import b64decode
from bs4 import BeautifulSoup
import json
import requests
resp = requests.get('https://dhhr.wv.gov/COVID-19/Pages/default.aspx')
soup = BeautifulSoup(resp.text)
data = soup.find_all('iframe')[0]['src'].split('=').pop()
decoded = json.loads(b64decode(data).decode())

OSError Err 22 [Invalid Argument] when scraping NASA website using python script

I've just put together a little scraping script within my Windows environment in order to download all pictures from the associated url's within the main index url. A few pictures download into the subdirectory but the code dumps out with the error:
"OSError: [Errno 22] Invalid argument:
'apod_pictures\start?type=1x1'"
What am I doing wrong? Any help would be appreciated.
Thanks,
Alun.
import os
import urllib.request
from urllib.parse import urljoin
from bs4 import BeautifulSoup
# Download index page
base_url = "https://apod.nasa.gov/apod/astropix.html"
download_directory = "apod_pictures"
content = urllib.request.urlopen(base_url).read()
# For each link on the index page
for link in BeautifulSoup(content, "lxml").findAll("a"):
print("Following link:", link)
href = urljoin(base_url, link["href"])
# follow the link and pull down the image held on that page
content = urllib.request.urlopen(href).read()
for img in BeautifulSoup(content, "lxml").findAll("img"):
img_href = urljoin(href, img["src"])
print("Downloading image:", img_href)
img_name = img_href.split("/")[-1]
"".join(x for x in img_name if x.isalnum())
"".join(x for x in download_directory if x.isalnum())
urllib.request.urlretrieve(img_href, `os.path.join(download_directory, img_name))`
"OSError: [Errno 22] Invalid argument: 'apod_pictures\\start?type=1x1'"
When I first saw the issue, I tried inserting the lines of code seen above and below, which didn't seem to correct things:
"".join(x for x in img_name if x.isalnum())
"".join(x for x in download_directory if x.isalnum())

Resources