Beautiful Soup having trouble parsing output from mechanize - mechanize

def return_with_soup(url):
#uses mechanize to tell the browser we aren't a bot
#and to retrieve webpage
#returns a soupified webpage
browser = mechanize.Browser() #I am made of human
browser.set_handle_robots(False) #no bots here, no sir
browser.open(url)
#print browser.response().read()
soup = BeautifulSoup(browser.response().read()) #this is where it breaks
return soup
It throws this error in reference to the second to last line "Type error: module is not callable"
What's going on exactly?

BeautifulSoup was imported as a module. So I had to change that line to:
BeautifulSoup.BeautifulSoup(...)

Related

Why does requests.get() is giving me the information in Spanish?

I'm trying to request the weather from Google for an specific place at an specific time. When I get the response the text is in Spanish instead of English. Ie. instead of "Mostly cloudly" I get "parcialmente nublado". I'm using the requests library and BeautifulSoup.
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=weather+Nissan+Stadium+Nashville+TN+Thursday+December+29+2022+8:15+PM"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
clima = soup.find("div",class_="tAd8D")
print(clima.text)
Output
jueves
Mayormente nublado
Máxima: 16°C Mínima: 8°C
Desired output:
Thursday
Mostly cloudy
Maximun : x (fahrenheit) Minimum x(fahrenheit)
The most likely explanation is that Google associates your IP address with a primarily Spanish-speaking region and defaults to giving you results in Spanish.
Try specifying English in your search string by adding hl=en:
https://www.google.com/search?hl=en&q=my+search+string

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!
You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

Multiple classes, unable to return desired page(s)

first want to say that I am a first time poster so I am sorry in advance if any parts of my question or the way it is asked/presented "sucks." With that being said, I've been trying to scrape a table from barchart.com use jupyter and beautifulsoup that is on multiple pages and while I have been successful in returning the entire page as a whole, I haven't had much luck trying to return the specific pages I need. I did include some images, the first three of which reference the elements that I am currently "choosing" from to use:
the 'div' element that highlights the entire table
another 'div' element within the first 'div' that also has the entire table I need
The 'table' element that I would use but it doesn't include the left most column that includes the tickers/stock symbols
Regardless of what I have tried to put in my code, I always get a "[]" back and haven't been able to figure out how to write the multiple parts of each 'div' or 'table', if that makes sense.
Code pic
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen, Request
stonks_url = Request('https://www.barchart.com/options/unusual-activity/stocks', headers={'User-Agent': 'Mozilla/5.0'})
stonks_data = urlopen(stonks_url)
stonks_html = stonks_data.read()
stonks_data.close()
page_soup = soup(stonks_html, 'html.parser')
uoa_table = page_soup.findAll('tbody', {'data-ng-repeat': 'rows in content'})
print(uoa_table)
Thanks in advance to any advice or guidance!
As this page is not working with javascript request you need to use the selenium and get the page source of the page and use it for processing the table
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
driver= webdriver.Chrome()
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)

What Caused the Python NoneType Error During My Splinter 'click()' Call?

When trying to scrape the county data from multiple Politico state web pages, such as this one, I concluded the best method was to first click the button that expands the county list before grabbing the table body's data (when present). However, my attempt at clicking the button had failed:
from bs4 import BeautifulSoup as bs
import requests
from splinter import Browser
state_page_url = "https://www.politico.com/2020-election/results/washington/"
executable_path = {'executable_path': 'chrome-driver/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)
browser.visit(state_page_url)
state_soup = bs(browser.html, 'html.parser')
reveal_button = state_soup.find('button', class_='jsx-3713440361')
if (reveal_button == None):
# Steps to take when the button isn't present
# ...
else:
reveal_button.click()
The error returned when following the else-condition is for my click() call: "TypeError: NoneType object is not callable". This doesn't make sense to me since I thought that the if-statement implied the reveal_button was not a NoneType. Am I misinterpeting the error message, how the reveal_button was set or am I misinterpeting what I'm working with after making state_soup?
Based on the comment thread for the question, and this solution to a similar question, I came across the following fix:
from bs4 import BeautifulSoup as bs
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
# Navigate the page to click the desired button
state_page_url = "https://www.politico.com/2020-election/results/alabama/"
driver = webdriver.Chrome(executable_path='chrome-driver/chromedriver.exe')
driver.get(state_page_url)
button_list = driver.find_elements(By.CLASS_NAME, 'jsx-3713440361')
if button_list == []:
# Actions to take when no button is found
# ...
else:
button_list[-1].click() # The index was determined through trial/error specific to the web page
# Now to grab the table and its data
state_soup = bs(driver.page_source)
state_county_results_table = state_soup.find('tbody', class_='jsx-3713440361')
Note that it required selenium for navigation and interaction while BeautifulSoup4 was used to parse it for the information I'd need

How can I use beautiful soup to get the following data from kick starter?

I am trying to get some data from kick starter. How can use beautiful soup library?
Kick Starter link
https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=7
These are the following information I need
Crowdfunding goal
Total crowdfunding
Total backers
Length of the campaign (# of days)
This is my current code
import requests
r = requests.get('https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=1')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'})
len(results)
i'll give you some of hint that i know, and hope you can do by yourself.
crawling has legal problem when you abuse Term of Service.
find_all should use with 'for' statment. it works like find all on web page(Ctrl + f).
e.g.
for a in soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
print (a)
3.links should be open 'for' statement. - https://www.kickstarte...seed=2600008&page=1
bold number repeated in for statement, so you can crawling all data In orderly
4.you sholud linked twice. - above link, there is list of pj. you should get link of these pj.
so code's algorithm likes this.
for i in range(0,10000):
url = www.kick.....page=i
for pj_link in find_all(each pj's link):
r2 = requests.get(pj_link)
soup2 = BeautifulSoup(r2.text, 'html.parser')
......

Resources