BeautifulSoup not getting all Children elements - web-scraping

Have an issue when trying to scrape the following URL: https://www.hiperlibertad.com.ar/lacteos/leches
I used the following simple code as a starter:
def Disco_scrape(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
main = soup.find_all("body")
main2 = soup.find_all("div")
return(soup, main, main2)
However, the vital info I'm looking for is in the main element inside body. I'm sending an image attached with the HTML code:
Any suggestions why this is happening?

The data is loaded dynamically, therefore requests won't support it. However, the data is available via sending a GET request to the API. For example, to print the item names, you can access the data as a Python dictionary (dict), with the keys / values.
import requests
URL = "https://www.hiperlibertad.com.ar/api/catalog_system/pub/products/search/lacteos/leches?O=OrderByTopSaleDESC&_from=0&_to=23&ft&sc=1"
response = requests.get(URL).json()
for data in response:
print(data["description"])
Output:
Leche Entera UAT MANFREY Larga Vida 1 L
Leche Parcialmente Descremada UAT MANFREY Larga Vida 1 L
Leche descremada larga vida La Serenísima 1% 1 Lt
...
...
An alternative (slower) approach would be to use Selenium together with BeautifulSoup to scrape the page.

Related

Why does requests.get() is giving me the information in Spanish?

I'm trying to request the weather from Google for an specific place at an specific time. When I get the response the text is in Spanish instead of English. Ie. instead of "Mostly cloudly" I get "parcialmente nublado". I'm using the requests library and BeautifulSoup.
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=weather+Nissan+Stadium+Nashville+TN+Thursday+December+29+2022+8:15+PM"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
clima = soup.find("div",class_="tAd8D")
print(clima.text)
Output
jueves
Mayormente nublado
Máxima: 16°C Mínima: 8°C
Desired output:
Thursday
Mostly cloudy
Maximun : x (fahrenheit) Minimum x(fahrenheit)
The most likely explanation is that Google associates your IP address with a primarily Spanish-speaking region and defaults to giving you results in Spanish.
Try specifying English in your search string by adding hl=en:
https://www.google.com/search?hl=en&q=my+search+string

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!
You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

Python Request returning only two elements

I am building a web scrapper for a Real Estate Web, Requests module only returns the first two elements I am looking for. I have tried using HTMLSession, as well as adding Headers to request.get argmuent, but still doesn´t work.
url = requests.get('https://www.fotocasa.es/es/comprar/viviendas/lugo-provincia/todas-las-zonas/l').text
soup = BeautifulSoup(url, 'lxml')
activos = soup.find_all('div', class_='re-CardPackPremium-info')
for activo in activos:
try:
link = get_link(activo, 'a', 'href')
except:
link = 'NaN'
print(link)
It returns only the first two links:
http://fotocasa.es//es/comprar/vivienda/lugo-capital/parking-trastero-ascensor/164412301/d?from=list
http://fotocasa.es//es/comprar/vivienda/lugo-capital/san-roque-as-fontinas/162586475/d?from=list
Thank you in advance

Get JSON request from link into R

I am trying to learn how to collect data from the Web into R. There's a website from Brazilian Ministery of Health that is sharing the numbers of the disease here in Brazil, it is a public portal.
COVIDBRASIL
So, on this page, I am interested in the graph that displays the daily reporting of cases here in Brazil. Using the inspector on Google Chrome I can access the JSON file feeding the data to this chart, my question is how could I get this file automatically with R. When I try to open the JSON in a new tab outside the inspector "Response" tab, I get an "Unauthorized" message. There is any way of doing this or every time I would have to manually copy the JSON from the inspector and update my R script?
In my case, I am interested in the "PortalDias" response. Thank you.
URL PORTAL DIAS
You need to set some headers to prevent this "Unauthorized" message. I copied them from the 'Headers' section in the browser 'Network' window.
library(curl)
library(jsonlite)
url <- "https://xx9p7hp1p7.execute-api.us-east-1.amazonaws.com/prod/PortalDias"
h <- new_handle()
handle_setheaders(h, Host = "xx9p7hp1p7.execute-api.us-east-1.amazonaws.com",
`Accept-Encoding` = "gzip, deflate, br",
`X-Parse-Application-Id` = "unAFkcaNDeXajurGB7LChj8SgQYS2ptm")
fromJSON(rawToChar(curl_fetch_memory(url, handle = h)$content))
# $results
# objectId label createdAt updatedAt qtd_confirmado qtd_obito
# 1 6vr9rUPbd4 26/02 2020-03-25T16:25:53.970Z 2020-03-25T22:25:42.967Z 1 123
# 2 FUNHS00sng 27/02 2020-03-25T16:27:34.040Z 2020-03-25T22:25:55.169Z 0 34
# 3 t4qW51clpj 28/02 2020-03-25T19:08:36.689Z 2020-03-25T22:26:02.427Z 0 35
# ...

How can I use beautiful soup to get the following data from kick starter?

I am trying to get some data from kick starter. How can use beautiful soup library?
Kick Starter link
https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=7
These are the following information I need
Crowdfunding goal
Total crowdfunding
Total backers
Length of the campaign (# of days)
This is my current code
import requests
r = requests.get('https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=1')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'})
len(results)
i'll give you some of hint that i know, and hope you can do by yourself.
crawling has legal problem when you abuse Term of Service.
find_all should use with 'for' statment. it works like find all on web page(Ctrl + f).
e.g.
for a in soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
print (a)
3.links should be open 'for' statement. - https://www.kickstarte...seed=2600008&page=1
bold number repeated in for statement, so you can crawling all data In orderly
4.you sholud linked twice. - above link, there is list of pj. you should get link of these pj.
so code's algorithm likes this.
for i in range(0,10000):
url = www.kick.....page=i
for pj_link in find_all(each pj's link):
r2 = requests.get(pj_link)
soup2 = BeautifulSoup(r2.text, 'html.parser')
......

Resources