Scraping: No attribute find_all for <p> - web-scraping

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!

You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

Related

Why does requests.get() is giving me the information in Spanish?

I'm trying to request the weather from Google for an specific place at an specific time. When I get the response the text is in Spanish instead of English. Ie. instead of "Mostly cloudly" I get "parcialmente nublado". I'm using the requests library and BeautifulSoup.
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=weather+Nissan+Stadium+Nashville+TN+Thursday+December+29+2022+8:15+PM"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
clima = soup.find("div",class_="tAd8D")
print(clima.text)
Output
jueves
Mayormente nublado
Máxima: 16°C Mínima: 8°C
Desired output:
Thursday
Mostly cloudy
Maximun : x (fahrenheit) Minimum x(fahrenheit)
The most likely explanation is that Google associates your IP address with a primarily Spanish-speaking region and defaults to giving you results in Spanish.
Try specifying English in your search string by adding hl=en:
https://www.google.com/search?hl=en&q=my+search+string

Multiple classes, unable to return desired page(s)

first want to say that I am a first time poster so I am sorry in advance if any parts of my question or the way it is asked/presented "sucks." With that being said, I've been trying to scrape a table from barchart.com use jupyter and beautifulsoup that is on multiple pages and while I have been successful in returning the entire page as a whole, I haven't had much luck trying to return the specific pages I need. I did include some images, the first three of which reference the elements that I am currently "choosing" from to use:
the 'div' element that highlights the entire table
another 'div' element within the first 'div' that also has the entire table I need
The 'table' element that I would use but it doesn't include the left most column that includes the tickers/stock symbols
Regardless of what I have tried to put in my code, I always get a "[]" back and haven't been able to figure out how to write the multiple parts of each 'div' or 'table', if that makes sense.
Code pic
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen, Request
stonks_url = Request('https://www.barchart.com/options/unusual-activity/stocks', headers={'User-Agent': 'Mozilla/5.0'})
stonks_data = urlopen(stonks_url)
stonks_html = stonks_data.read()
stonks_data.close()
page_soup = soup(stonks_html, 'html.parser')
uoa_table = page_soup.findAll('tbody', {'data-ng-repeat': 'rows in content'})
print(uoa_table)
Thanks in advance to any advice or guidance!
As this page is not working with javascript request you need to use the selenium and get the page source of the page and use it for processing the table
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
driver= webdriver.Chrome()
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)

What Caused the Python NoneType Error During My Splinter 'click()' Call?

When trying to scrape the county data from multiple Politico state web pages, such as this one, I concluded the best method was to first click the button that expands the county list before grabbing the table body's data (when present). However, my attempt at clicking the button had failed:
from bs4 import BeautifulSoup as bs
import requests
from splinter import Browser
state_page_url = "https://www.politico.com/2020-election/results/washington/"
executable_path = {'executable_path': 'chrome-driver/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)
browser.visit(state_page_url)
state_soup = bs(browser.html, 'html.parser')
reveal_button = state_soup.find('button', class_='jsx-3713440361')
if (reveal_button == None):
# Steps to take when the button isn't present
# ...
else:
reveal_button.click()
The error returned when following the else-condition is for my click() call: "TypeError: NoneType object is not callable". This doesn't make sense to me since I thought that the if-statement implied the reveal_button was not a NoneType. Am I misinterpeting the error message, how the reveal_button was set or am I misinterpeting what I'm working with after making state_soup?
Based on the comment thread for the question, and this solution to a similar question, I came across the following fix:
from bs4 import BeautifulSoup as bs
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
# Navigate the page to click the desired button
state_page_url = "https://www.politico.com/2020-election/results/alabama/"
driver = webdriver.Chrome(executable_path='chrome-driver/chromedriver.exe')
driver.get(state_page_url)
button_list = driver.find_elements(By.CLASS_NAME, 'jsx-3713440361')
if button_list == []:
# Actions to take when no button is found
# ...
else:
button_list[-1].click() # The index was determined through trial/error specific to the web page
# Now to grab the table and its data
state_soup = bs(driver.page_source)
state_county_results_table = state_soup.find('tbody', class_='jsx-3713440361')
Note that it required selenium for navigation and interaction while BeautifulSoup4 was used to parse it for the information I'd need

How can I use beautiful soup to get the following data from kick starter?

I am trying to get some data from kick starter. How can use beautiful soup library?
Kick Starter link
https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=7
These are the following information I need
Crowdfunding goal
Total crowdfunding
Total backers
Length of the campaign (# of days)
This is my current code
import requests
r = requests.get('https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=1')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'})
len(results)
i'll give you some of hint that i know, and hope you can do by yourself.
crawling has legal problem when you abuse Term of Service.
find_all should use with 'for' statment. it works like find all on web page(Ctrl + f).
e.g.
for a in soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
print (a)
3.links should be open 'for' statement. - https://www.kickstarte...seed=2600008&page=1
bold number repeated in for statement, so you can crawling all data In orderly
4.you sholud linked twice. - above link, there is list of pj. you should get link of these pj.
so code's algorithm likes this.
for i in range(0,10000):
url = www.kick.....page=i
for pj_link in find_all(each pj's link):
r2 = requests.get(pj_link)
soup2 = BeautifulSoup(r2.text, 'html.parser')
......

How to close a window when you click a button to open another window

I am working on a program that will allow someone to enter details in order to write a CV. I am using the Tkinter module (as extra practice) but am already stuck on the menu!
At the moment I have three different options the user can choose: Write CV, Review CV and Exit. I have created a button for each option and when the user presses the button it'll open, however the menu window remains open (there is a different subroutine for each option).
I understand that you need to do something like window.destroy(), however I'm not sure how to give a button two commands without doing something too fiddly like create more subroutines etc.?
The other option I think I'd prefer is is I could clear the menu screen?
Here is the programming I have at the moment:
def Main_Menu():
import tkinter
main_menu = tkinter.Tk()
main_menu.title("CV Writer")
main_menu.geometry("300x300")
main_menu.wm_iconbitmap('cv_icon.ico')
title = tkinter.Label(main_menu, text = "Main Menu", font=("Helvetica",25))
title.pack()
gap = tkinter.Label(main_menu, text = "")
gap.pack()
write_cv = tkinter.Button(main_menu, text = "1) Write CV", font=("Helvetica"), command=Write_CV)
write_cv.pack()
review_cv = tkinter.Button(main_menu, text = "2) Review CV", font=("Helvetica"), command=Review_CV)
review_cv.pack()
leave = tkinter.Button(main_menu, text = "3) Exit", font=("Helvetica"), command=Exit)
leave.pack()
main_menu.mainloop()
def Write_CV():
import tkinter
write_cv = tkinter.Tk()
write_cv.geometry("300x300")
write_cv.title("Write CV")
def Review_CV():
import tkinter
review_cv = tkinter.Tk()
review_cv.geometry("300x300")
review_cv.title("Review CV")
def Exit():
import tkinter
leave = tkinter.Tk()
leave.geometry("300x300")
leave.title("Exit")
Main_Menu()
Running the program should help make this question make more sense!
I am so sorry for the wordy question, but any kind of help would be appreciated! Please bear in mind I am only a GCSE student so simple language would also be so nice! Thank you!
I don't know why are you importing tkinter under each method, it's completely useless. Simply import it once at the beginning of your file with a syntax like this:
import tkinter as tk
So that you can refer to the widgets simply with the duo tk:
btn = tk.Button(None, text='I can simply refer to a widget with tk')
Apart from this, the structure of your program is really bad. In my opinion, you should not instantiate Tk inside your function Main_Menu, because it will only be visible inside it. If you want to refer to the master or root or whatever you want to call the instance of Tk, you can't, because it's a local instance, as I said above.
I usually instantiate Tk in the main function of my program, or in the following if __name__ == '__main__': construct:
if __name__ == '__main__':
master = tk.Tk() # note I am using "tk"
# create your objects or call your functions here
master.mainloop()
Your are creating an instance of Tkin each of your function, that is really a bad practice, never do that. You should only create one instance of Tk for each Tkinter application.
You should use the object-oriented paradigm or make all your widgets global to structure your application.
Except these details, you can simply call master.destroy() when you want to destroy your main window and all its children widgets, where master is the Tk instance.
In general, you have a lot of errors and bad practices. My advice is:
Read a tutorial on Python first and then on Tkinter, before
proceeding.

Resources