I'm having trouble to select a table (or actually anything) from the specified URL. I used find_all method to no avail. There are multiple tables on the webpage separated by <h3> tags (image below is a table with <h3> tag 'Basic').
import requests
from bs4 import BeautifulSoup
res = requests.get('Example_URL')
soup = BeautifulSoup(res.text,'html.parser')
table = soup.find_all('table')[0]
print(table)
I expect to be able to select the table and then print the contents onto the console. But print(table) returns IndexError: list index out of range.
Related
I have a piece of code that I made in Google Colab that essentially just scrapes a piece of data from a website, show below:
#imports
#<div class="sc-aef7b723-0 dDQUel priceTitle">
#<div class="priceValue ">
from bs4 import BeautifulSoup
import requests
import time
url = 'https://coinmarketcap.com/currencies/index-cooperative/'
HTML = requests.get(url)
soup = BeautifulSoup(HTML.text, 'html.parser')
text = soup.find('div', attrs={'class':'sc-aef7b723-0 dDQUel priceTitle'}).find('div', attrs={'class':'priceValue '}).text
print(text)
I need this to run as a py file on my computer, but when it runs as a py file, I get the error:
text = soup.find('div', attrs={'class':'sc-aef7b723-0 dDQUel priceTitle'}).find('div', attrs={'class':'priceValue '}).text
AttributeError: 'NoneType' object has no attribute 'text'
I was wondering why this happened as it is the exact same code. All of my packages are at the most recent version as well.
You just need to remove the trailing space in a class name attrs={'class':'priceValue'} because when you run the specified web page through the html.parser it corrects the HTML in some ways.
In this case it remove the trailing space that present on the web page, because it doesn't really makes sense to have a trailing space in a class name. Spaces needed when you have more than one class for a given element.
So, parsed web page that you store in your soup variable have that div looking like this <div class="priceValue"><span>$1.74</span></div>. And soup.find function actually care about trailing spaces, so it couldn't match the class priceValue with priceValue
To match the class with any trailing or leading whitespaces you could've used the soup.select function that uses CSS selectors to match elements, so it doesn't care about spaces, you could've found element of that class like this (with any amount of trailing and/or leading whitespaces):
css_selected_value = soup.select("[class= priceValue ]")[0].text
print(css_selected_value)
That being said, I'm not sure why your code works properly on Google Colab, never tried it. Maybe will try to dig into it later.
So I am trying to scrape the countries name from the table in a website https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths as a list. But when i am printing it out, it's just giving me empty list, instead of a list containing countries name. could anybody explain why i am getting this? the code is below,
import requests
from bs4 import BeautifulSoup
webpage = requests.get("https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths")
soup = BeautifulSoup(webpage.content, "html.parser")
countries = soup.find_all("div", attrs={"class": 'gv-cell gv-country-name'})
print(countries)
list_of_countries = []
for country in countries:
list_of_countries.append(country.get_text())
print(list_of_countries)
This is the output i am getting
[]
[]
Also, not only here, i was getting the same result (empty list) when i was trying to scrape a product's information from the amazon's website.
The list is dynamically retrieved from another endpoint you can find in the network tab which returns json. Something like as follows should work:
import requests
r = requests.get('https://interactive.guim.co.uk/2020/coronavirus-central-data/latest.json').json() #may need to add headers
countries = [i['attributes']['Country_Region'] for i in r['features']]
I am trying to scrape the star rating for the "value" data from the Trip Advisor hotels but I am not able to get the data using class name:
Below is the code which I have tried to use:
review_pages=requests.get("https://www.tripadvisor.com/Hotel_Review-g60745-d94367-Reviews-Harborside_Inn-Boston_Massachusetts.html")
soup3=BeautifulSoup(review_pages.text,'html.parser')
value=soup3.find_all(class_='hotels-review-list-parts-AdditionalRatings__bubbleRating--2WcwT')
Value_1=soup3.find_all(class_="hotels-review-list-parts-AdditionalRatings__ratings--3MtoD")
When I am trying to capture the values it is returning an empty list. Any direction would be really helpful. I have tried mutiple class names which are in that page but I am getting various fields such as Data,reviews ect but I am not able to get the bubble ratings for only service.
You can use an attribute = value selector and pass the class in with its value as a substring with ^ starts with operator to allow for different star values which form part of the attribute value.
Or, more simply use the span type selector to select for the child spans.
.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN span
In this line:
values=soup3.select('.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN [class^="ui_bubble_rating bubble_"]')
The first part of the selector, when reading from left to right, is selecting for the parent class of those ratings. The following space is a descendant combinator combining the following attribute = value selector which gathers a list of the qualifying children. As mentioned, you can replace that with just using span.
Code:
import requests
from bs4 import BeautifulSoup
import re
review_pages=requests.get("https://www.tripadvisor.com/Hotel_Review-g60745-d94367-Reviews-Harborside_Inn-Boston_Massachusetts.html")
soup3=BeautifulSoup(review_pages.content,'lxml')
values=soup3.select('.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN [class^="ui_bubble_rating bubble_"]') #.hotels-hotel-review-about-with-photos-Reviews__subratings--3DGjN span
Value_1 = values[-1]
print(Value_1['class'][1])
stars = re.search(r'\d', Value_1['class'][1]).group(0)
print(stars)
Although I use re, I think it is overkill and you could simply use replace.
I am parsing a certain webpage with Beautiful soup, trying to retrieve all links that are inside h3 tag:
page = = requests.get(https://www....)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for item in soup.find_all('h3'):
links.append(item.a['href']
However, the links found are different than the links present in the page. For example, when the link http://www.estense.com/?p=116872 is present in the page, Beautiful soup returns http://www.estense.com/%3Fp%3D116872, replacing '?' with '%3F' and '=' with %3D. Why is that?
Thanks.
You can unquote the URL using urllib.parse
from urllib import parse
parse.unquote(item.a['href'])
I am trying to get all the meanings in the "noun" heading of the word the user enters.
This is my code for now:
import requests
from bs4 import BeautifulSoup
word=raw_input("Enter word: ").lower()
url=('http://www.dictionary.com/browse/'+word)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
try:
meaning=soup.find("div",attrs={"class":"def-content"}).get_text()
print "Meaning of",word,"is: "
print meaning
except AttributeError:
print "Sorry, we were not able to find the word."
pass
finally:
print "Thank you for using our dictionary."
Now suppose the user enters the word "today" and my output will be:
this present day: Today is beautiful.
I dont understand why does it leave so many spaces and why doesnt the part
"Today is beautiful"
come down.
Anyway when you look up that word on this site, you can see there are 2 meanings yet my program only shows one.
I want the output to be:
1.this present day:
Today is beautiful.
2.
this present time or age:
the world of today.
Can anyone explain me whats wrong and how can i fix it?
I have no idea what's wrong so please dont think I dint try.
You are getting the first noun meaning using the above code.
I have rewritten the code, it is as below:
from bs4 import BeautifulSoup
import requests
word = raw_input("Enter word: ").lower()
url = ('http://www.dictionary.com/browse/' + word)
r = requests.get(url)
bsObj = BeautifulSoup(r.content, "lxml")
nouns = bsObj.find("section", {"class": "def-pbk ce-spot"})
data = nouns.findAll('div', {'class': 'def-content'})
count = 1
for item in data:
temp = ' '.join(item.get_text().strip().split())
print str(count) + '. ' + temp
count += 1
Explanation:
Yes. Assuming the website shows noun meaning first, I am retrieving the first section which contains complete noun data. Then I am finding all the meanings under that section inside data variable and iterating it in a loop and fetching the text of each meaning present in the data. Then to remove all the extra spaces I am splitting the fetched text and the joining it with a single space along with the addition of a number at the beginning
try:
meaning = soup.find(attrs={"class": "def-pbk ce-spot"}).get_text(separator="\n",strip=True)
you can strip the whitesapce of the text by pass strip=True to get_text()
the reason way you don't got all the text is that you seletor is wrong, you should make the range bigger.
I add separator= '\n' to get_text() to format output.
if you hava any question, you can read the BeautifulSoup Document.