Why does BeautifulSoup skip over some tables when using findAll

Why does BeautifulSoup skip over some tables when using findAll - web-scraping

I am trying to extract the "Four Factors" table from the following URL, https://www.basketball-reference.com/boxscores/201810160GSW.html, when I use the findAll method in the BeautifulSoup library when searching for tables I do not see that table, nor do I see the "Line Score" table. I am only concerned with the "Four Factors" table, but I figured the note about the "Line Score" table could be useful information.
URL2 = 'https://www.basketball-reference.com/boxscores/201810160GSW.html'
page2 = requests.get(URL2)
page2 = page2.text
soup2 = bs4.BeautifulSoup(page2, 'html.parser')
content = soup2.findAll('table')
If you look at content, you can find the other 4 tables on the page, but the "Four Factors" and "Line Score" do not show up there. In addition to helping me extract the "Four Factors" table, can you explain why it doesn't show up in content?

It comes out in one of the comments which is why you weren't finding it I think.
import requests
from bs4 import BeautifulSoup , Comment
import pandas as pd
r =requests.get('https://www.basketball-reference.com/boxscores/201810160GSW.html')
soup = BeautifulSoup(r.text,'lxml')
comments= soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if 'id="four_factors"' in comment:
soup = BeautifulSoup(comment, 'lxml')
break
table = soup.select_one('#four_factors')
df = pd.read_html(str(table))[0].fillna('')
print(df)

Related

Beautiful soup, only part of the data is scraped

I am trying to scrape data from the shareholding disclosures page from the Hong Kong Exchange yet when I find all ('TR'), my code only grabs the previous balance no of shares rather than the whole tr line including the name and ticker.
url = "https://di.hkex.com.hk/di/summary/DSM20220218C1.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
result =soup.find('table', id="Table3")
for stock in result:
rows = stock.find('tr')
print(rows)

Why am i getting an empty list instead of a list with elements in it while web scraping

So I am trying to scrape the countries name from the table in a website https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths as a list. But when i am printing it out, it's just giving me empty list, instead of a list containing countries name. could anybody explain why i am getting this? the code is below,
import requests
from bs4 import BeautifulSoup
webpage = requests.get("https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths")
soup = BeautifulSoup(webpage.content, "html.parser")
countries = soup.find_all("div", attrs={"class": 'gv-cell gv-country-name'})
print(countries)
list_of_countries = []
for country in countries:
list_of_countries.append(country.get_text())
print(list_of_countries)
This is the output i am getting
[]
[]
Also, not only here, i was getting the same result (empty list) when i was trying to scrape a product's information from the amazon's website.

The list is dynamically retrieved from another endpoint you can find in the network tab which returns json. Something like as follows should work:
import requests
r = requests.get('https://interactive.guim.co.uk/2020/coronavirus-central-data/latest.json').json() #may need to add headers
countries = [i['attributes']['Country_Region'] for i in r['features']]

Unable to select table from url

I'm having trouble to select a table (or actually anything) from the specified URL. I used find_all method to no avail. There are multiple tables on the webpage separated by <h3> tags (image below is a table with <h3> tag 'Basic').
import requests
from bs4 import BeautifulSoup
res = requests.get('Example_URL')
soup = BeautifulSoup(res.text,'html.parser')
table = soup.find_all('table')[0]
print(table)
I expect to be able to select the table and then print the contents onto the console. But print(table) returns IndexError: list index out of range.

Beautiful soup replaces certain symbols in a URL with other symbols

I am parsing a certain webpage with Beautiful soup, trying to retrieve all links that are inside h3 tag:
page = = requests.get(https://www....)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for item in soup.find_all('h3'):
links.append(item.a['href']
However, the links found are different than the links present in the page. For example, when the link http://www.estense.com/?p=116872 is present in the page, Beautiful soup returns http://www.estense.com/%3Fp%3D116872, replacing '?' with '%3F' and '=' with %3D. Why is that?
Thanks.

You can unquote the URL using urllib.parse
from urllib import parse
parse.unquote(item.a['href'])

How to get the span of a dictionary as it appears on the site?

I am trying to get all the meanings in the "noun" heading of the word the user enters.
This is my code for now:
import requests
from bs4 import BeautifulSoup
word=raw_input("Enter word: ").lower()
url=('http://www.dictionary.com/browse/'+word)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
try:
meaning=soup.find("div",attrs={"class":"def-content"}).get_text()
print "Meaning of",word,"is: "
print meaning
except AttributeError:
print "Sorry, we were not able to find the word."
pass
finally:
print "Thank you for using our dictionary."
Now suppose the user enters the word "today" and my output will be:
this present day: Today is beautiful.
I dont understand why does it leave so many spaces and why doesnt the part
"Today is beautiful"
come down.
Anyway when you look up that word on this site, you can see there are 2 meanings yet my program only shows one.
I want the output to be:
1.this present day:
Today is beautiful.
2.
this present time or age:
the world of today.
Can anyone explain me whats wrong and how can i fix it?
I have no idea what's wrong so please dont think I dint try.

You are getting the first noun meaning using the above code.
I have rewritten the code, it is as below:
from bs4 import BeautifulSoup
import requests
word = raw_input("Enter word: ").lower()
url = ('http://www.dictionary.com/browse/' + word)
r = requests.get(url)
bsObj = BeautifulSoup(r.content, "lxml")
nouns = bsObj.find("section", {"class": "def-pbk ce-spot"})
data = nouns.findAll('div', {'class': 'def-content'})
count = 1
for item in data:
temp = ' '.join(item.get_text().strip().split())
print str(count) + '. ' + temp
count += 1
Explanation:
Yes. Assuming the website shows noun meaning first, I am retrieving the first section which contains complete noun data. Then I am finding all the meanings under that section inside data variable and iterating it in a loop and fetching the text of each meaning present in the data. Then to remove all the extra spaces I am splitting the fetched text and the joining it with a single space along with the addition of a number at the beginning

try:
meaning = soup.find(attrs={"class": "def-pbk ce-spot"}).get_text(separator="\n",strip=True)
you can strip the whitesapce of the text by pass strip=True to get_text()
the reason way you don't got all the text is that you seletor is wrong, you should make the range bigger.
I add separator= '\n' to get_text() to format output.
if you hava any question, you can read the BeautifulSoup Document.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why does BeautifulSoup skip over some tables when using findAll - web-scraping

Related

Beautiful soup, only part of the data is scraped

Why am i getting an empty list instead of a list with elements in it while web scraping

Unable to select table from url

Beautiful soup replaces certain symbols in a URL with other symbols

How to get the span of a dictionary as it appears on the site?

Categories

Resources